WAT extractor: Overlong truncated HTTP request header line throws exception and loss of request record

If a WARC request record contains and overlong and truncated HTTP request header line (`GET /path HTTP/1.1`) HttpRequestMessageParser throws an exception which causes that the request record is not transformed into a WAT record. If the exception is not handled in the calling code, even the WAT/WET extractor job (commoncrawl/ia-hadoop-tools) may fail.

The issue was observed on a couple of WARC files of CC-MAIN-2023-40:
1. the overlong HTTP request header lines were truncated after 8 kiB, so that the line was only `GET /path-truncated`  which caused the  HttpRequestMessageParser to fail (no HTTP version). Investigate in separate issues
   a. why the truncation happened (commoncrawl/nutch: in the WARC writer or at the protocol level recording the HTTP communication between crawler and web server)?
   b. how these URLs stem from and whether the URL filters need to be tightened to avoid similar errors.
2. fix HttpRequestMessageParser: it should correctly recognize that the header line is truncated but not fail on `Response message to long`
    ```
   $> java -cp target/webarchive-commons-jar-with-dependencies.jar org.archive.extract.ResourceExtractor -wat CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request.warc.gz >CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request.warc.wat.gz
   org.archive.resource.ResourceParseException: org.archive.format.http.HttpParseException: Response Message too long
        at org.archive.resource.http.HTTPRequestResourceFactory.getResource(HTTPRequestResourceFactory.java:34)
        at org.archive.extract.ExtractingResourceProducer.getNext(ExtractingResourceProducer.java:40)
        at org.archive.extract.ResourceExtractor.run(ResourceExtractor.java:137)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
        at org.archive.extract.ResourceExtractor.main(ResourceExtractor.java:63)
   Caused by: org.archive.format.http.HttpParseException: Response Message too long
        at org.archive.format.http.HttpRequestMessageParser.parse(HttpRequestMessageParser.java:43)
        at org.archive.format.http.HttpRequestParser.parse(HttpRequestParser.java:18)
        at org.archive.resource.http.HTTPRequestResourceFactory.getResource(HTTPRequestResourceFactory.java:27)
        ... 4 more
    ```
3. also fix HttpResponseMessageParser (obviously, code was copy-pasted)
4. if a WARC file with only the single request record is parsed, the exception changes:
    ```
   $> java -cp target/webarchive-commons-jar-with-dependencies.jar org.archive.extract.ResourceExtractor -wat CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request-only.warc.gz >CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request-only.warc.wat.gz
   org.archive.resource.ResourceParseException: org.archive.format.http.HttpParseException: No spaces in message
        at org.archive.resource.http.HTTPRequestResourceFactory.getResource(HTTPRequestResourceFactory.java:34)
        at org.archive.extract.ExtractingResourceProducer.getNext(ExtractingResourceProducer.java:40)
        at org.archive.extract.ResourceExtractor.run(ResourceExtractor.java:137)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
        at org.archive.extract.ResourceExtractor.main(ResourceExtractor.java:63)
   Caused by: org.archive.format.http.HttpParseException: No spaces in message
        at org.archive.format.http.HttpRequestMessageParser.parseLax(HttpRequestMessageParser.java:176)
        at org.archive.format.http.HttpRequestMessageParser.parse(HttpRequestMessageParser.java:49)
        at org.archive.format.http.HttpRequestMessageParser.parse(HttpRequestMessageParser.java:39)
        at org.archive.format.http.HttpRequestParser.parse(HttpRequestParser.java:18)
        at org.archive.resource.http.HTTPRequestResourceFactory.getResource(HTTPRequestResourceFactory.java:27)
        ... 4 more
   ```
[CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request.warc.gz](https://github.com/commoncrawl/ia-web-commons/files/12827549/CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request.warc.gz)
[CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request-only.warc.gz](https://github.com/commoncrawl/ia-web-commons/files/12827550/CC-MAIN-20230922102329-20230922132329-00140.overlong-get-request-only.warc.gz)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAT extractor: Overlong truncated HTTP request header line throws exception and loss of request record #32

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

WAT extractor: Overlong truncated HTTP request header line throws exception and loss of request record #32

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions