Skip to content

Store origin-code in ARCRecord header#52

Merged
johnerikhalse merged 3 commits into
iipc:masterfrom
jrwiebe:master
Apr 26, 2016
Merged

Store origin-code in ARCRecord header#52
johnerikhalse merged 3 commits into
iipc:masterfrom
jrwiebe:master

Conversation

@jrwiebe

@jrwiebe jrwiebe commented Mar 8, 2016

Copy link
Copy Markdown
Contributor

It would be useful for our purposes if the origin-code from the version block of an ARC file were stored and made accessible by a method in ARCRecordMetaData. In my fork this method is called getOrigin().

bodyOffset += getTokenizedHeaderLine(in, secondLineValues);
version = ((String)secondLineValues.get(0) +
"." + (String)secondLineValues.get(1));
origin = (String)secondLineValues.get(2);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to assume there is always an Origin string at position 2? Or do we need to check secondLineValues.size() > 2?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it is safe to assume it is there, but I do know that that field will frequently contain erroneous information.

I just had a look at our own ARCs and they all have InternetArchive set for this "origin". Seems webarchive-commons (and thus Heritrix) ARCWriter just hardcodes that value: https://github.com/jrwiebe/webarchive-commons/blob/master/src/main/java/org/archive/io/arc/ARCWriter.java#L272

Given how prevalent this abuse is by now, I think it is safe to say that this field has zero informational value.

@anjackson

Copy link
Copy Markdown
Member

Looks good to me, although I don't know if the origin field is always present. Maybe @kris-sigur or @johnerikhalse have more experience of ARC files?

@jrwiebe

jrwiebe commented Mar 10, 2016

Copy link
Copy Markdown
Contributor Author

According to the ARC format grammar origin-code is required, but I have no idea if all ARC-generating tools respect this: https://archive.org/web/researcher/ArcFileFormat.php

@kris-sigur

Copy link
Copy Markdown
Member

Backwards compatibility is very important when dealing with ARCs. Historically, there have been all types of "mangled" ARCs as different tools have come and gone. I don't know if that applies to this origin field, but at a minimum the code should gracefully handle its absence so that we do not introduce a regression.

Or to put it another way, any ARC (even technically invalid according to the spec) that could be parsed before this change, should still parse after this change. Which is clearly not the case.

@anjackson

Copy link
Copy Markdown
Member

Having said all that, I think I'd accept this change as long as it checked there was a third element in that list before get()ing it.

@kris-sigur

Copy link
Copy Markdown
Member

@anjackson Yes. That was substantially what I meant. Perhaps also adding a note to the Javadoc for the ORIGIN_FIELD_KEY constant about how this field has been (and still is!) incorrectly hardcoded by this library so the value should be used with care.

@jrwiebe

jrwiebe commented Mar 11, 2016

Copy link
Copy Markdown
Contributor Author

I was about to add a check as per @anjackson's suggestion and discovered this is actually already done by getTokenizedHeaderLine.

@anjackson

Copy link
Copy Markdown
Member

Excellent. So, if you add a 1.1.7 section and a note about this change to the CHANGES.md then I think we're good to go.

@johnerikhalse johnerikhalse merged commit 2e8fe92 into iipc:master Apr 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants