Skip to content

Fix issues #42 #43 #44 #45 and #47#46

Closed
scheylord wants to merge 23 commits into
iipc:masterfrom
scheylord:1.1.7-BnF
Closed

Fix issues #42 #43 #44 #45 and #47#46
scheylord wants to merge 23 commits into
iipc:masterfrom
scheylord:1.1.7-BnF

Conversation

@scheylord

Copy link
Copy Markdown

Each commit of this pull request corresponds to a fix for a single issue

Fix issue #42 : WAT extractor: WARC-Filename in the WAT warcinfo record should be the WAT filename itself
Fix issue #43 : WAT extractor: WARC-Date in all records should be the WAT record generation date
Fix issue #44 : WAT extractor: envelope structure does not conform to the WAT specification
Fix issue #45 : WAT extractor: missing WARC format version
Fix issue #47 : WAT extractor: adding information in WAT's warcinfo

@scheylord scheylord changed the title Fix issue #43 Fix issues #42 #43 #44 #45 Apr 1, 2015
@scheylord scheylord changed the title Fix issues #42 #43 #44 #45 Fix issues #42 #43 #44 #45 and #47 Apr 1, 2015

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs whitespace around assignment operator

@ldko

ldko commented Apr 9, 2015

Copy link
Copy Markdown
Member

#42 is resolved for me (just a whitespace issue)
#43 is resolved for me
#44 still has issue. "Actual-Content-Length" is still showing in the Envelope. It should be moved to Payload-Metadata.
#45 is resolved for me
#47 fix puts institution specific default data into operator, publisher, and description

In looking at this pull request and the specification for the WATs, I opened this #48 that would be relevant and easy to fix and add to this pull request.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These defaults will put strings like <enter operator name here> in the WAT file if the user doesn't think to edit the commons. properties file (is there documentation anywhere on using the WAT extractor?). One option to help this could be to leave the values for operator, publisher, and wat.warcinfo.description empty, and if at WAT writing time they are still empty, don't include the fields in the warcinfo content since the fields are optional anyway.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For users who don't change the commons.properties file, won't they have the literal strings of <enter operator name here> , <enter publisher name here>, and <enter warc info description here> in their WAT files? If there is no documentation for using WAT extractor (please point me to the documentation if it exists), people may not change the commons.properties file, so the default values here could be an issue. Perhaps others could voice opinions on this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies, I think I forgot to push my commit, and I can't access my desktop right now, I'll do it tomorrow morning.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "// optional arguments"

@ldko

ldko commented Apr 24, 2015

Copy link
Copy Markdown
Member

There are a few places where comments need spaces for consistency, but the changes for the commons.properties works for me. The WAT files now look to be to specification.

@scheylord

Copy link
Copy Markdown
Author

Hi Lauren, I've added spaces in comments as you suggested.

@vinaygoel

Copy link
Copy Markdown
Contributor

These changes look good to me. Thanks.

Comment thread CHANGES.md Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a line in here about the change you did for Entity-Trailing-Slop-Bytes should be called Entity-Trailing-Slop-Length (#48)? Thank you!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnerikhalse johnerikhalse added this to the 1.1.5 Release milestone May 7, 2015

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Lam,
I was generating some WATs with this code yesterday, and noticed that the hostname written into the warcinfo record differs from what Heritrix writes into WARCs. Heritrix uses getCanonicalHostName() to get the FQDN:
https://github.com/internetarchive/heritrix3/blob/master/modules/src/main/java/org/archive/modules/writer/WARCWriterProcessor.java#L777

So in the WATs I generated I get:
hostname: somehostname
In other WARCs generated by Heritrix I get:
hostname: somehostname.library.unt.edu

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like an undesirable discrepancy in behavior. @scheylord @kngenie Could either of you weigh in on this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I'm sorry for answering so late.
I've just corrected the method, can you check please ? (getHostName -> getCanonicalHostName)

@anjackson

Copy link
Copy Markdown
Member

@scheylord I'm keen to get this merged in - can you take a look at @ldko's feedback about hostname:?

@scheylord

Copy link
Copy Markdown
Author

Hi,
Sorry for answering so late. I've just corrected it (getHostName -> getCanonicalHostName), can you check it please ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scheylord I ran the WAT extractor code on a different server than usual today and it threw a java.net.UnknownHostException on the getLocalHost() here. I was able to avoid getting the exception by modifying the server's /etc/hosts file, but we may want to catch that exception similar to the Heritrix code again. Beyond that, the change you made to getCanonicalHostName() worked for me.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread pom.xml

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anjackson should this still be 1.1.6?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the last release was 1.1.5.

@RogerMathisen

Copy link
Copy Markdown
Contributor

Manually merged due to merge conflict in CHANGES.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants