HTTPS via a Proxy

I've trying to crawl a HTTPS site through a Squid proxy and keep seeing errors like these:

```
java.io.IOException: RIS already open for ToeThread #12: https://XXX/robots.txt
   at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:84)
   at org.archive.util.Recorder.inputWrap(Recorder.java:185)
   at org.archive.modules.fetcher.FetchHTTPRequest$RecordingHttpClientConnection.getSocketInputStream(FetchHTTPRequest.java:649)
   at org.apache.http.impl.BHttpConnectionBase.ensureOpen(BHttpConnectionBase.java:131)
```

HTTP sites are fine but HTTPS just doesn't seem to work. The problem seems to be down to the `RecordingInputStream` and `RecordingOutputStream`, both of which throw an `IOException` if the underlying Stream is `!= null`.

If, however, I comment out those checks, the HTTPS crawl works perfectly (as far as I can tell...). I'm not sure whether this is the `webarchive-commons` library being overly cautious or `heritrix3` failing to do something for HTTPS sites.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTPS via a Proxy #64

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

HTTPS via a Proxy #64

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions