Skip to content

Do not add value of preceding HTTP header field if there is no value#74

Merged
ldko merged 1 commit into
iipc:masterfrom
sebastian-nagel:empty-http-header-field
May 2, 2017
Merged

Do not add value of preceding HTTP header field if there is no value#74
ldko merged 1 commit into
iipc:masterfrom
sebastian-nagel:empty-http-header-field

Conversation

@sebastian-nagel

Copy link
Copy Markdown
Collaborator

If a HTTP header field is empty (or contains only white space), the HttpHeaderParser does not use the empty value but uses the value from the preceding HTTP header field. See commoncrawl#11 for an example and test data.

This PR fixes the problem and adds a unit test.

@ldko

ldko commented May 2, 2017

Copy link
Copy Markdown
Member

Looking at this, I think the code change is fine. However, I ran it with and without the change on the sample WARC file attached at commoncrawl#11 and found that in both cases I couldn't check the value of the Server header because the relevant records in the WAT files in both cases didn't have Actual-Content-Type or HTTP-Response-Metadata fields in the Payload-Metadata. I think this is because of this difference between your code and the iipc code regarding WARCs created with wget:

diff --git a/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java b/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
index ad10be4..0afe16f 100644
--- a/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
+++ b/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
@@ -153,7 +153,10 @@ public class ExtractingResourceFactoryMapper implements ResourceFactoryMapper {
        private boolean isHTTPResponseWARCResource(MetaData envelope) {
                return childFieldEquals(envelope,WARC_HEADER_METADATA,
                                WARCConstants.CONTENT_TYPE,
-                               WARCConstants.HTTP_RESPONSE_MIMETYPE);
+                               WARCConstants.HTTP_RESPONSE_MIMETYPE)
+                       || childFieldEquals(envelope,WARC_HEADER_METADATA,
+                               WARCConstants.CONTENT_TYPE,
+                               WARCConstants.HTTP_RESPONSE_MIMETYPE_NS);
        }
        private boolean isWARCJSONResource(MetaData envelope) {
                return childFieldEquals(envelope,WARC_HEADER_METADATA,
diff --git a/src/main/java/org/archive/format/warc/WARCConstants.java b/src/main/java/org/archive/format/warc/WARCConstants.java
index 93a81f9..4f2fa57 100644
--- a/src/main/java/org/archive/format/warc/WARCConstants.java
+++ b/src/main/java/org/archive/format/warc/WARCConstants.java
@@ -209,7 +209,9 @@ public interface WARCConstants extends ArchiveFileConstants {
        "application/http; msgtype=request";
     public static final String HTTP_RESPONSE_MIMETYPE =
        "application/http; msgtype=response";
-    
+  public static final String HTTP_RESPONSE_MIMETYPE_NS =
+      "application/http;msgtype=response";                   // wget does this
+
     public static final String FTP_CONTROL_CONVERSATION_MIMETYPE =
         "text/x-ftp-control-conversation";

Does this seem correct to you, @sebastian-nagel ?

@sebastian-nagel

Copy link
Copy Markdown
Collaborator Author

Ok, I see. Of course, application/http;msgtype=request should also be covered. The fork commoncrawl/ia-web-commons contains still a couple of changes not merged into iipc/webarchive-commons, mainly the WETExtractor. I'm on course to push the remaining changes. But feel free to pull anything!

@ldko

ldko commented May 2, 2017

Copy link
Copy Markdown
Member

Ok, I will merge this along with the differences from WARCConstants.java and ExtractingResourceFactoryMapper.java. Thanks!

@ldko ldko merged commit cf34a3e into iipc:master May 2, 2017
@sebastian-nagel sebastian-nagel deleted the empty-http-header-field branch November 11, 2025 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants