Skip to content

Remove dependency on Apache Commons HttpClient 3.1#107

Merged
ato merged 1 commit into
masterfrom
remove-httpclient-3.1
May 20, 2025
Merged

Remove dependency on Apache Commons HttpClient 3.1#107
ato merged 1 commit into
masterfrom
remove-httpclient-3.1

Conversation

@ato

@ato ato commented May 19, 2025

Copy link
Copy Markdown
Member

HttpClient 3 was discontinued in 2007 and frequently triggers alerts in dependency vulnerability scanners. We're also not using much of it anymore, with one big exception.

The URI class is the foundation of UsableURI and central to Heritrix which has made removing the library difficult. URIException in particular appears a lot in client code. HttpClient 4+ has switched to java.net.URI and the main reason Heritrix was built on HttpClient URI instead was because java.net.URI is not flexible and differs from how browsers behave. (Although, how browsers behave has shifted over time.)

Eventually we'll probably need to rework Heritrix's URI handling to follow the WhatWG URL spec. However, to let us remove the dependency while keeping UsableURI working, this copies HttpClient 3's URI, URIException and ChunkedInputStream with some small tweaks remove their dependency on other classes in HttpClient. The HttpClient Header class is replaced with our existing HttpHeader. URI and ChunkedInputStream are marked package private for now.

This is a breaking API change and will require a major version bump. If you're using webarchive-commons, you'll need to make the following updates to your code:

  1. Update URIException import

    Before:

    import org.apache.commons.httpclient.URIException;

    After:

    import org.archive.url.URIException;
  2. Replace Header with HttpHeader

    Before:

    import org.apache.commons.httpclient.Header
    
    Header[] headers = LaxHttpParser.parseHeaders(stream, charset);

    After:

    import org.archive.format.http.HttpHeader
    
    HttpHeader[] headers = LaxHttpParser.parseHeaders(stream, charset);

Fixes #78

HttpClient 3 was discontinued in 2007 and frequently triggers alerts in dependency vulnerability scanners. We're also not using much of it anymore, with one big exception.

The URI class is the foundation of UsableURI and central to Heritrix which has made removing the library difficult. URIException in particular appears a lot in client code. HttpClient 4+ has switched to java.net.URI and the main reason Heritrix was built on HttpClient URI instead was because java.net.URI is not flexible and differs from how browsers behave. (Although, how browsers behave has shifted over time.)

Eventually we'll probably need to rework Heritrix's URI handling to follow the WhatWG URL spec. However, to let us remove the dependency while keeping UsableURI working, this copies HttpClient 3's URI, URIException and ChunkedInputStream with some small tweaks remove their dependency on other classes in HttpClient. The HttpClient Header class is replaced with our existing HttpHeader. URI and ChunkedInputStream are marked package private for now.

This is a breaking API change and will trigger a bump of the major version number.
@ato ato merged commit 52f8abf into master May 20, 2025
7 checks passed
@ato ato deleted the remove-httpclient-3.1 branch May 20, 2025 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

commons-httpclient-3.1 vulnerability

1 participant