Fix issues #42 #43 #44 #45 and #47#46
Conversation
…nd WARCMetadataRecordExtractorOutput.java to match new WARC header (eg 'WARC' --> 'WARC/1.0')
There was a problem hiding this comment.
needs whitespace around assignment operator
|
#42 is resolved for me (just a whitespace issue) In looking at this pull request and the specification for the WATs, I opened this #48 that would be relevant and easy to fix and add to this pull request. |
…-Trailing-Slop-Bytes' into 'Entity-Trailing-Slop-Length'
There was a problem hiding this comment.
These defaults will put strings like <enter operator name here> in the WAT file if the user doesn't think to edit the commons. properties file (is there documentation anywhere on using the WAT extractor?). One option to help this could be to leave the values for operator, publisher, and wat.warcinfo.description empty, and if at WAT writing time they are still empty, don't include the fields in the warcinfo content since the fields are optional anyway.
There was a problem hiding this comment.
There was a problem hiding this comment.
For users who don't change the commons.properties file, won't they have the literal strings of <enter operator name here> , <enter publisher name here>, and <enter warc info description here> in their WAT files? If there is no documentation for using WAT extractor (please point me to the documentation if it exists), people may not change the commons.properties file, so the default values here could be an issue. Perhaps others could voice opinions on this.
There was a problem hiding this comment.
My apologies, I think I forgot to push my commit, and I can't access my desktop right now, I'll do it tomorrow morning.
|
There are a few places where comments need spaces for consistency, but the changes for the commons.properties works for me. The WAT files now look to be to specification. |
|
Hi Lauren, I've added spaces in comments as you suggested. |
|
These changes look good to me. Thanks. |
There was a problem hiding this comment.
Could you add a line in here about the change you did for Entity-Trailing-Slop-Bytes should be called Entity-Trailing-Slop-Length (#48)? Thank you!
There was a problem hiding this comment.
There was a problem hiding this comment.
Hey Lam,
I was generating some WATs with this code yesterday, and noticed that the hostname written into the warcinfo record differs from what Heritrix writes into WARCs. Heritrix uses getCanonicalHostName() to get the FQDN:
https://github.com/internetarchive/heritrix3/blob/master/modules/src/main/java/org/archive/modules/writer/WARCWriterProcessor.java#L777
So in the WATs I generated I get:
hostname: somehostname
In other WARCs generated by Heritrix I get:
hostname: somehostname.library.unt.edu
There was a problem hiding this comment.
That seems like an undesirable discrepancy in behavior. @scheylord @kngenie Could either of you weigh in on this?
There was a problem hiding this comment.
Hi,
I'm sorry for answering so late.
I've just corrected the method, can you check please ? (getHostName -> getCanonicalHostName)
|
@scheylord I'm keen to get this merged in - can you take a look at @ldko's feedback about |
|
Hi, |
There was a problem hiding this comment.
@scheylord I ran the WAT extractor code on a different server than usual today and it threw a java.net.UnknownHostException on the getLocalHost() here. I was able to avoid getting the exception by modifying the server's /etc/hosts file, but we may want to catch that exception similar to the Heritrix code again. Beyond that, the change you made to getCanonicalHostName() worked for me.
There was a problem hiding this comment.
There was a problem hiding this comment.
Yes, the last release was 1.1.5.
|
Manually merged due to merge conflict in CHANGES.md |
Each commit of this pull request corresponds to a fix for a single issue
Fix issue #42 : WAT extractor: WARC-Filename in the WAT warcinfo record should be the WAT filename itself
Fix issue #43 : WAT extractor: WARC-Date in all records should be the WAT record generation date
Fix issue #44 : WAT extractor: envelope structure does not conform to the WAT specification
Fix issue #45 : WAT extractor: missing WARC format version
Fix issue #47 : WAT extractor: adding information in WAT's warcinfo