You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a HTTP header field is empty (or contains only white space), the HttpHeaderParser does not use the empty value but uses the value from the preceding HTTP header field. See commoncrawl#11 for an example and test data.
Looking at this, I think the code change is fine. However, I ran it with and without the change on the sample WARC file attached at commoncrawl#11 and found that in both cases I couldn't check the value of the Server header because the relevant records in the WAT files in both cases didn't have Actual-Content-Type or HTTP-Response-Metadata fields in the Payload-Metadata. I think this is because of this difference between your code and the iipc code regarding WARCs created with wget:
diff --git a/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java b/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
index ad10be4..0afe16f 100644
--- a/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
+++ b/src/main/java/org/archive/extract/ExtractingResourceFactoryMapper.java
@@ -153,7 +153,10 @@ public class ExtractingResourceFactoryMapper implements ResourceFactoryMapper {
private boolean isHTTPResponseWARCResource(MetaData envelope) {
return childFieldEquals(envelope,WARC_HEADER_METADATA,
WARCConstants.CONTENT_TYPE,
- WARCConstants.HTTP_RESPONSE_MIMETYPE);
+ WARCConstants.HTTP_RESPONSE_MIMETYPE)
+ || childFieldEquals(envelope,WARC_HEADER_METADATA,
+ WARCConstants.CONTENT_TYPE,
+ WARCConstants.HTTP_RESPONSE_MIMETYPE_NS);
}
private boolean isWARCJSONResource(MetaData envelope) {
return childFieldEquals(envelope,WARC_HEADER_METADATA,
diff --git a/src/main/java/org/archive/format/warc/WARCConstants.java b/src/main/java/org/archive/format/warc/WARCConstants.java
index 93a81f9..4f2fa57 100644
--- a/src/main/java/org/archive/format/warc/WARCConstants.java
+++ b/src/main/java/org/archive/format/warc/WARCConstants.java
@@ -209,7 +209,9 @@ public interface WARCConstants extends ArchiveFileConstants {
"application/http; msgtype=request";
public static final String HTTP_RESPONSE_MIMETYPE =
"application/http; msgtype=response";
-
+ public static final String HTTP_RESPONSE_MIMETYPE_NS =
+ "application/http;msgtype=response"; // wget does this
+
public static final String FTP_CONTROL_CONVERSATION_MIMETYPE =
"text/x-ftp-control-conversation";
Ok, I see. Of course, application/http;msgtype=request should also be covered. The fork commoncrawl/ia-web-commons contains still a couple of changes not merged into iipc/webarchive-commons, mainly the WETExtractor. I'm on course to push the remaining changes. But feel free to pull anything!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If a HTTP header field is empty (or contains only white space), the HttpHeaderParser does not use the empty value but uses the value from the preceding HTTP header field. See commoncrawl#11 for an example and test data.
This PR fixes the problem and adds a unit test.