diff --git a/.travis.yml b/.travis.yml
new file mode 100644
index 00000000..0dfd3f7f
--- /dev/null
+++ b/.travis.yml
@@ -0,0 +1,26 @@
+language: java
+
+jdk:
+ - oraclejdk7
+
+before_install:
+ - "git clone https://github.com/iipc/travis.git target/travis"
+
+before_script:
+ - "export JAVA_OPTS=-Xmx1024m"
+ - "export MAVEN_OPTS=-Xmx512m"
+ - "ulimit -u 2048"
+
+script:
+ - "target/travis/deploy-if.sh"
+
+# whitelist in the master branch only
+branches:
+ only:
+ - master
+
+env:
+ global:
+ - secure: "qDKjVdoe4Qcz4WfXiQydU7tyl51T62FUJrjqu4FUPBcgeQhFQiggwhpaE6xCOzOpxbsuBi2R1c8gMQf5esE5iDL5jZMu+kz++dYbuzMTd13ttvZWMW5wRPH0H8iHk609FP/RDtVKKBr7WO0JvvIAZEhWNHZrLXBrrKgdTey171g="
+ - secure: "FXGBKJNP9X7ePJfS4eYTZtoFo4RT1sxor34XxncSJr7uV6ggtZb4B4WNd16IlLcDk6E32sx8YoWdltaOGwQ5Vg/kux5Ko/wKZCoccS018Ln1bRT86dD1KoPY34rGoNJVQxe7J/1MPqpBKwmi2XCKfzpsEh3W7bbIqg8w9MEOOZA="
+
diff --git a/CHANGES.md b/CHANGES.md
new file mode 100644
index 00000000..b872846d
--- /dev/null
+++ b/CHANGES.md
@@ -0,0 +1,28 @@
+1.1.5
+-----
+* [Escape redirect URLs in RealCDXExtractorOutput](https://github.com/iipc/webarchive-commons/pull/36)
+* [Tests fail on Windows](https://github.com/iipc/webarchive-commons/issues/2)
+* [Test fails on Java 8](https://github.com/iipc/webarchive-commons/issues/31)
+* [RecordingOutputStream can affect tcp packets sent in an undesirable way](https://github.com/iipc/webarchive-commons/issues/38)
+
+1.1.4
+-----
+* [All dates should be independent of locale settings](https://github.com/iipc/webarchive-commons/pull/22)
+* [Resolved fastutil conflict in dependencies](https://github.com/iipc/webarchive-commons/pull/24)
+
+1.1.3
+-----
+* [Synchronised with IA fork](https://github.com/iipc/webarchive-commons/pull/18)
+* [Updated to more recent Guava APIs](https://github.com/iipc/webarchive-commons/pull/17)
+* [Fixed handling of uncompressed ARC files #13 and #14](https://github.com/iipc/webarchive-commons/pull/14)
+* [Avoid pulling in the logback dependency IA#13](https://github.com/internetarchive/webarchive-commons/pull/13)
+
+1.1.2
+-----
+* Fixed support for reading uncompressed WARCs, along with some unit testing. (https://github.com/iipc/webarchive-commons/pull/12)
+
+1.1.1
+-----
+* Renamed from commons-webarchive to webarchive-commons (https://github.com/iipc/webarchive-commons/pull/8)
+* Cope with malformed GZip extra fields as produced by wget 1.14 (https://github.com/iipc/webarchive-commons/pull/10)
+* Switch to httpcomponents, and add IA deployment information. (https://github.com/iipc/webarchive-commons/pull/11)
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 00000000..37ec93a1
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,191 @@
+Apache License
+Version 2.0, January 2004
+http://www.apache.org/licenses/
+
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+1. Definitions.
+
+"License" shall mean the terms and conditions for use, reproduction, and
+distribution as defined by Sections 1 through 9 of this document.
+
+"Licensor" shall mean the copyright owner or entity authorized by the copyright
+owner that is granting the License.
+
+"Legal Entity" shall mean the union of the acting entity and all other entities
+that control, are controlled by, or are under common control with that entity.
+For the purposes of this definition, "control" means (i) the power, direct or
+indirect, to cause the direction or management of such entity, whether by
+contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the
+outstanding shares, or (iii) beneficial ownership of such entity.
+
+"You" (or "Your") shall mean an individual or Legal Entity exercising
+permissions granted by this License.
+
+"Source" form shall mean the preferred form for making modifications, including
+but not limited to software source code, documentation source, and configuration
+files.
+
+"Object" form shall mean any form resulting from mechanical transformation or
+translation of a Source form, including but not limited to compiled object code,
+generated documentation, and conversions to other media types.
+
+"Work" shall mean the work of authorship, whether in Source or Object form, made
+available under the License, as indicated by a copyright notice that is included
+in or attached to the work (an example is provided in the Appendix below).
+
+"Derivative Works" shall mean any work, whether in Source or Object form, that
+is based on (or derived from) the Work and for which the editorial revisions,
+annotations, elaborations, or other modifications represent, as a whole, an
+original work of authorship. For the purposes of this License, Derivative Works
+shall not include works that remain separable from, or merely link (or bind by
+name) to the interfaces of, the Work and Derivative Works thereof.
+
+"Contribution" shall mean any work of authorship, including the original version
+of the Work and any modifications or additions to that Work or Derivative Works
+thereof, that is intentionally submitted to Licensor for inclusion in the Work
+by the copyright owner or by an individual or Legal Entity authorized to submit
+on behalf of the copyright owner. For the purposes of this definition,
+"submitted" means any form of electronic, verbal, or written communication sent
+to the Licensor or its representatives, including but not limited to
+communication on electronic mailing lists, source code control systems, and
+issue tracking systems that are managed by, or on behalf of, the Licensor for
+the purpose of discussing and improving the Work, but excluding communication
+that is conspicuously marked or otherwise designated in writing by the copyright
+owner as "Not a Contribution."
+
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf
+of whom a Contribution has been received by Licensor and subsequently
+incorporated within the Work.
+
+2. Grant of Copyright License.
+
+Subject to the terms and conditions of this License, each Contributor hereby
+grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
+irrevocable copyright license to reproduce, prepare Derivative Works of,
+publicly display, publicly perform, sublicense, and distribute the Work and such
+Derivative Works in Source or Object form.
+
+3. Grant of Patent License.
+
+Subject to the terms and conditions of this License, each Contributor hereby
+grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
+irrevocable (except as stated in this section) patent license to make, have
+made, use, offer to sell, sell, import, and otherwise transfer the Work, where
+such license applies only to those patent claims licensable by such Contributor
+that are necessarily infringed by their Contribution(s) alone or by combination
+of their Contribution(s) with the Work to which such Contribution(s) was
+submitted. If You institute patent litigation against any entity (including a
+cross-claim or counterclaim in a lawsuit) alleging that the Work or a
+Contribution incorporated within the Work constitutes direct or contributory
+patent infringement, then any patent licenses granted to You under this License
+for that Work shall terminate as of the date such litigation is filed.
+
+4. Redistribution.
+
+You may reproduce and distribute copies of the Work or Derivative Works thereof
+in any medium, with or without modifications, and in Source or Object form,
+provided that You meet the following conditions:
+
+You must give any other recipients of the Work or Derivative Works a copy of
+this License; and
+You must cause any modified files to carry prominent notices stating that You
+changed the files; and
+You must retain, in the Source form of any Derivative Works that You distribute,
+all copyright, patent, trademark, and attribution notices from the Source form
+of the Work, excluding those notices that do not pertain to any part of the
+Derivative Works; and
+If the Work includes a "NOTICE" text file as part of its distribution, then any
+Derivative Works that You distribute must include a readable copy of the
+attribution notices contained within such NOTICE file, excluding those notices
+that do not pertain to any part of the Derivative Works, in at least one of the
+following places: within a NOTICE text file distributed as part of the
+Derivative Works; within the Source form or documentation, if provided along
+with the Derivative Works; or, within a display generated by the Derivative
+Works, if and wherever such third-party notices normally appear. The contents of
+the NOTICE file are for informational purposes only and do not modify the
+License. You may add Your own attribution notices within Derivative Works that
+You distribute, alongside or as an addendum to the NOTICE text from the Work,
+provided that such additional attribution notices cannot be construed as
+modifying the License.
+You may add Your own copyright statement to Your modifications and may provide
+additional or different license terms and conditions for use, reproduction, or
+distribution of Your modifications, or for any such Derivative Works as a whole,
+provided Your use, reproduction, and distribution of the Work otherwise complies
+with the conditions stated in this License.
+
+5. Submission of Contributions.
+
+Unless You explicitly state otherwise, any Contribution intentionally submitted
+for inclusion in the Work by You to the Licensor shall be under the terms and
+conditions of this License, without any additional terms or conditions.
+Notwithstanding the above, nothing herein shall supersede or modify the terms of
+any separate license agreement you may have executed with Licensor regarding
+such Contributions.
+
+6. Trademarks.
+
+This License does not grant permission to use the trade names, trademarks,
+service marks, or product names of the Licensor, except as required for
+reasonable and customary use in describing the origin of the Work and
+reproducing the content of the NOTICE file.
+
+7. Disclaimer of Warranty.
+
+Unless required by applicable law or agreed to in writing, Licensor provides the
+Work (and each Contributor provides its Contributions) on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied,
+including, without limitation, any warranties or conditions of TITLE,
+NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are
+solely responsible for determining the appropriateness of using or
+redistributing the Work and assume any risks associated with Your exercise of
+permissions under this License.
+
+8. Limitation of Liability.
+
+In no event and under no legal theory, whether in tort (including negligence),
+contract, or otherwise, unless required by applicable law (such as deliberate
+and grossly negligent acts) or agreed to in writing, shall any Contributor be
+liable to You for damages, including any direct, indirect, special, incidental,
+or consequential damages of any character arising as a result of this License or
+out of the use or inability to use the Work (including but not limited to
+damages for loss of goodwill, work stoppage, computer failure or malfunction, or
+any and all other commercial damages or losses), even if such Contributor has
+been advised of the possibility of such damages.
+
+9. Accepting Warranty or Additional Liability.
+
+While redistributing the Work or Derivative Works thereof, You may choose to
+offer, and charge a fee for, acceptance of support, warranty, indemnity, or
+other liability obligations and/or rights consistent with this License. However,
+in accepting such obligations, You may act only on Your own behalf and on Your
+sole responsibility, not on behalf of any other Contributor, and only if You
+agree to indemnify, defend, and hold each Contributor harmless for any liability
+incurred by, or claims asserted against, such Contributor by reason of your
+accepting any such warranty or additional liability.
+
+END OF TERMS AND CONDITIONS
+
+APPENDIX: How to apply the Apache License to your work
+
+To apply the Apache License to your work, attach the following boilerplate
+notice, with the fields enclosed by brackets "[]" replaced with your own
+identifying information. (Don't include the brackets!) The text should be
+enclosed in the appropriate comment syntax for the file format. We also
+recommend that a file or class name and description of purpose be included on
+the same "printed page" as the copyright notice for easier identification within
+third-party archives.
+
+ Copyright [yyyy] [name of copyright owner]
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
diff --git a/README.md b/README.md
new file mode 100644
index 00000000..72858a52
--- /dev/null
+++ b/README.md
@@ -0,0 +1,8 @@
+IIPC Web Archive Commons
+========================
+
+[](https://travis-ci.org/iipc/webarchive-commons/)
+
+This repository contains common utility code for [OpenWayback][1] and other projects.
+
+[1]: https://github.com/iipc/openwayback
diff --git a/pom.xml b/pom.xml
index d2004a27..222a4c78 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1,19 +1,63 @@
-
+4.0.0
- org.archive
- ia-web-commons
- 1.1.1-SNAPSHOT
+
+ org.sonatype.oss
+ oss-parent
+ 7
+
+
+ org.netpreserve.commons
+ webarchive-commons
+ 1.1.5-IAjar
- ia-web-commons
- http://maven.apache.org
+ webarchive-commons
+ https://github.com/iipc/webarchive-commons
+
+
+ The International Internet Preservation Consortium
+ http://netpreserve.org/
+
+
+
+ The Apache Software License, Version 2.0
+ http://www.apache.org/licenses/LICENSE-2.0.txt
+ repo
+
+
+
+
+ many-devs
+ Many Others Developers Proceed Me
+ many@dev.org
+
+
+ anjackson
+ Andrew Jackson
+ Andrew.Jackson@bl.uk
+
+
+
+ GitHub Issues
+ https://github.com/iipc/webarchive-commons/issues
+
+
+ scm:git:git@github.com:iipc/webarchive-commons.git
+ scm:git:git@github.com:iipc/webarchive-commons.git
+ git@github.com:iipc/webarchive-commons.git
+ UTF-8${maven.build.timestamp}yyyyMMddhhmmss
+
+
+ sonatype-nexus-staging
+ https://oss.sonatype.org/service/local/staging/deploy/maven2/
+ sonatype-nexus-snapshots
+ https://oss.sonatype.org/content/repositories/snapshots/
@@ -21,13 +65,12 @@
junitjunit3.8.1
- testcom.google.guavaguava
- 14.0.1
+ 17.0
@@ -42,7 +85,7 @@
- org.mozilla
+ com.googlecode.juniversalchardetjuniversalchardet1.0.3
@@ -86,6 +129,10 @@
tomcatjasper-compiler
+
+ hsqldb
+ hsqldb
+
@@ -115,9 +162,15 @@
it.unimi.dsi
- mg4j
- 1.0.1
+ dsiutils
+ 2.0.12compile
+
+
+ ch.qos.logback
+ logback-classic
+
+ org.apache.httpcomponents
@@ -129,12 +182,6 @@
joda-time1.6
-
- fastutil
- fastutil
- 5.0.7
- compile
-
@@ -155,7 +202,7 @@
jar-with-dependencies
- ia-web-commons
+ webarchive-commons
@@ -176,24 +223,6 @@
-
- internetarchive
- Internet Archive Maven Repository
- http://builds.archive.org:8080/maven2
- default
-
-
- true
- daily
- warn
-
-
- true
- daily
- warn
-
-
-
clouderaCloudera Hadoop
@@ -216,10 +245,13 @@
- repository
-
+ ${repository.id}${repository.url}
+
+ ${snapshotRepository.id}
+ ${snapshotRepository.url}
+
diff --git a/src/main/java/org/archive/extract/DumpingExtractorOutput.java b/src/main/java/org/archive/extract/DumpingExtractorOutput.java
index a4151076..69591931 100644
--- a/src/main/java/org/archive/extract/DumpingExtractorOutput.java
+++ b/src/main/java/org/archive/extract/DumpingExtractorOutput.java
@@ -9,8 +9,8 @@
import org.archive.util.StreamCopy;
import org.json.JSONException;
+import com.google.common.io.ByteStreams;
import com.google.common.io.CountingOutputStream;
-import com.google.common.io.NullOutputStream;
public class DumpingExtractorOutput implements ExtractorOutput {
private static final Logger LOG =
@@ -22,7 +22,7 @@ public DumpingExtractorOutput(OutputStream out) {
}
public void output(Resource resource) throws IOException {
- NullOutputStream nullo = new NullOutputStream();
+ OutputStream nullo = ByteStreams.nullOutputStream();
CountingOutputStream co = new CountingOutputStream(nullo);
StreamCopy.copy(resource.getInputStream(), co);
long bytes = co.getCount();
diff --git a/src/main/java/org/archive/extract/RealCDXExtractorOutput.java b/src/main/java/org/archive/extract/RealCDXExtractorOutput.java
index 306f67a3..8ca3ff82 100644
--- a/src/main/java/org/archive/extract/RealCDXExtractorOutput.java
+++ b/src/main/java/org/archive/extract/RealCDXExtractorOutput.java
@@ -1,8 +1,10 @@
package org.archive.extract;
import java.io.IOException;
+import java.io.OutputStream;
import java.io.PrintWriter;
import java.net.MalformedURLException;
+import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.List;
@@ -23,8 +25,8 @@
import org.json.JSONException;
import org.json.JSONObject;
+import com.google.common.io.ByteStreams;
import com.google.common.io.CountingOutputStream;
-import com.google.common.io.NullOutputStream;
public class RealCDXExtractorOutput implements ExtractorOutput {
private static final Logger LOG =
@@ -72,7 +74,7 @@ public RealCDXExtractorOutput(PrintWriter out) {
// SimpleJSONPathSpec gzFooterLengthSpec = new SimpleJSONPathSpec("Container.Gzip-Metadata.Footer-Length");
// SimpleJSONPathSpec gzHeaderLengthSpec = new SimpleJSONPathSpec("Container.Gzip-Metadata.Header-Length");
public void output(Resource resource) throws IOException {
- NullOutputStream nullo = new NullOutputStream();
+ OutputStream nullo = ByteStreams.nullOutputStream();
CountingOutputStream co = new CountingOutputStream(nullo);
try {
StreamCopy.copy(resource.getInputStream(), co);
@@ -306,12 +308,14 @@ private String extractHTMLMetaRefresh(String origUrl, MetaData m) {
return "-";
}
- private String resolve(String context, String spec) {
+ static String resolve(String context, String spec) {
// TODO: test!
try {
URL cUrl = new URL(context);
- URL resolved = new URL(cUrl,spec);
- return resolved.toURI().toASCIIString();
+ URL url = new URL(cUrl, spec);
+ // this constructor escapes its arguments, if necessary
+ URI uri = new URI(url.getProtocol(), url.getHost(), url.getPath(), url.getQuery(), url.getRef());
+ return uri.toASCIIString();
} catch (URISyntaxException e) {
} catch (MalformedURLException e) {
diff --git a/src/main/java/org/archive/extract/WARCMetadataRecordExtractorOutput.java b/src/main/java/org/archive/extract/WARCMetadataRecordExtractorOutput.java
index 0d564a6f..ff46a914 100644
--- a/src/main/java/org/archive/extract/WARCMetadataRecordExtractorOutput.java
+++ b/src/main/java/org/archive/extract/WARCMetadataRecordExtractorOutput.java
@@ -1,6 +1,7 @@
package org.archive.extract;
import java.io.IOException;
+import java.io.OutputStream;
import java.io.PrintWriter;
import java.net.MalformedURLException;
import java.net.URISyntaxException;
@@ -21,8 +22,8 @@
import org.json.JSONException;
import org.json.JSONObject;
+import com.google.common.io.ByteStreams;
import com.google.common.io.CountingOutputStream;
-import com.google.common.io.NullOutputStream;
public class WARCMetadataRecordExtractorOutput implements ExtractorOutput {
private static final Logger LOG =
@@ -47,7 +48,7 @@ public WARCMetadataRecordExtractorOutput(PrintWriter out) {
}
public void output(Resource resource) throws IOException {
- NullOutputStream nullo = new NullOutputStream();
+ OutputStream nullo = ByteStreams.nullOutputStream();
CountingOutputStream co = new CountingOutputStream(nullo);
try {
StreamCopy.copy(resource.getInputStream(), co);
diff --git a/src/main/java/org/archive/format/gzip/GZIPFExtraRecord.java b/src/main/java/org/archive/format/gzip/GZIPFExtraRecord.java
index a4ed6260..0a9a82e0 100644
--- a/src/main/java/org/archive/format/gzip/GZIPFExtraRecord.java
+++ b/src/main/java/org/archive/format/gzip/GZIPFExtraRecord.java
@@ -98,12 +98,17 @@ public void writeTo(OutputStream os) throws IOException {
os.write(value);
}
}
- public int read(InputStream is) throws IOException {
+ public int read(InputStream is, int maxRead) throws IOException {
byte tmpName[] = null;
byte tmpVal[] = null;
int valLen = 0;
tmpName = ByteOp.readNBytes(is, GZIP_FEXTRA_NAME_BYTES);
valLen = ByteOp.readShort(is);
+ if (valLen > (maxRead - BYTES_IN_SHORT - GZIP_FEXTRA_NAME_BYTES)) {
+ /* read in what's left, but throw an exception */
+ tmpVal = ByteOp.readNBytes(is, maxRead - BYTES_IN_SHORT - GZIP_FEXTRA_NAME_BYTES);
+ throw new GZIPFormatException.GZIPExtraFieldShortException(maxRead);
+ }
if(valLen > 0) {
tmpVal = ByteOp.readNBytes(is, valLen);
}
diff --git a/src/main/java/org/archive/format/gzip/GZIPFExtraRecords.java b/src/main/java/org/archive/format/gzip/GZIPFExtraRecords.java
index 7dc0de44..e5920552 100755
--- a/src/main/java/org/archive/format/gzip/GZIPFExtraRecords.java
+++ b/src/main/java/org/archive/format/gzip/GZIPFExtraRecords.java
@@ -53,12 +53,17 @@ public void readRecords(InputStream is)
ArrayList tmpList = new ArrayList();
while(bytesRemaining > 0) {
GZIPFExtraRecord tmpRecord = new GZIPFExtraRecord();
- int bytesRead = tmpRecord.read(is);
- bytesRemaining -= bytesRead;
+ try {
+ int bytesRead = tmpRecord.read(is, bytesRemaining);
+ bytesRemaining -= bytesRead;
+ tmpList.add(tmpRecord);
+ } catch (GZIPFormatException.GZIPExtraFieldShortException ex) {
+ /* not enough bytes for the extra field; move on */
+ bytesRemaining -= ex.bytesRead;
+ }
if(bytesRemaining < 0) {
throw new GZIPFormatException("Invalid FExtra length/records");
}
- tmpList.add(tmpRecord);
}
this.addAll(tmpList);
}
diff --git a/src/main/java/org/archive/format/gzip/GZIPFormatException.java b/src/main/java/org/archive/format/gzip/GZIPFormatException.java
index ca627a88..3916dafa 100644
--- a/src/main/java/org/archive/format/gzip/GZIPFormatException.java
+++ b/src/main/java/org/archive/format/gzip/GZIPFormatException.java
@@ -21,4 +21,11 @@ public GZIPFormatException(Exception e) {
public GZIPFormatException(String message, IOException e) {
super(message,e);
}
+ public static class GZIPExtraFieldShortException extends GZIPFormatException {
+ int bytesRead;
+ public GZIPExtraFieldShortException(int bytesRead) {
+ super("Extra Field short.");
+ this.bytesRead = bytesRead;
+ }
+ }
}
diff --git a/src/main/java/org/archive/format/gzip/zipnum/ZipNumCluster.java b/src/main/java/org/archive/format/gzip/zipnum/ZipNumCluster.java
index bc773a58..a3d34a4b 100644
--- a/src/main/java/org/archive/format/gzip/zipnum/ZipNumCluster.java
+++ b/src/main/java/org/archive/format/gzip/zipnum/ZipNumCluster.java
@@ -21,6 +21,7 @@
import java.util.Date;
import java.util.HashMap;
import java.util.List;
+import java.util.Locale;
import java.util.Map.Entry;
import java.util.concurrent.ConcurrentHashMap;
import java.util.logging.Level;
@@ -102,7 +103,7 @@ public void run() {
public final static String LATEST_TIMESTAMP = "_LATEST";
public final static String OFF = "OFF";
- protected SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
+ protected SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss", Locale.ENGLISH);
protected Date startDate, endDate;
class BlockSize
diff --git a/src/main/java/org/archive/io/RecordingOutputStream.java b/src/main/java/org/archive/io/RecordingOutputStream.java
index fe05701c..7d2ff212 100644
--- a/src/main/java/org/archive/io/RecordingOutputStream.java
+++ b/src/main/java/org/archive/io/RecordingOutputStream.java
@@ -242,6 +242,26 @@ public void write(int b) throws IOException {
checkLimits();
}
+ private int findMessageBodyBeginMark(byte[] b, int off, int len) {
+ if ((lastTwoBytes[1] == '\n' || lastTwoBytes[0] == '\n' && lastTwoBytes[1] == '\r')
+ && len >= 1 && b[off] == '\n') {
+ return 1;
+ } else if (lastTwoBytes[1] == '\n' && len >= 2 && b[off] == '\r' && b[off+1] == '\n') {
+ return 2;
+ }
+
+ for (int i = off; i < off + len - 1; i++) {
+ if (b[i] == '\n' && b[i+1] == '\n') {
+ return i + 2;
+ } else if (b[i] == '\n' && b[i+1] == '\r'
+ && i + 2 < off + len && b[i+2] == '\n') {
+ return i + 3;
+ }
+ }
+
+ return -1;
+ }
+
public void write(byte[] b, int off, int len) throws IOException {
if(position < maxPosition) {
if(position+len<=maxPosition) {
@@ -255,20 +275,35 @@ public void write(byte[] b, int off, int len) throws IOException {
off += consumeRange;
len -= consumeRange;
}
-
- // see comment on int[] lastTwoBytes
- while (messageBodyBeginMark < 0 && len > 0) {
- write(b[off]);
- off++;
- len--;
+
+ if (messageBodyBeginMark < 0) {
+ // see comment on int[] lastTwoBytes
+ int mark = findMessageBodyBeginMark(b, off, len);
+ if (mark > 0) {
+ if(recording) {
+ record(b, off, mark - off);
+ }
+ if (this.out != null) {
+ this.out.write(b, off, mark - off);
+ }
+ markMessageBodyBegin();
+ len = len - (mark - off);
+ off = mark;
+ }
}
-
+
if(recording) {
record(b, off, len);
}
if (this.out != null) {
this.out.write(b, off, len);
}
+ if (len >= 1) {
+ lastTwoBytes[1] = b[off + len - 1];
+ if (len >= 2) {
+ lastTwoBytes[0] = b[off + len - 2];
+ }
+ }
checkLimits();
}
diff --git a/src/main/java/org/archive/io/arc/ARCReaderFactory.java b/src/main/java/org/archive/io/arc/ARCReaderFactory.java
index e7dc1625..44437ed7 100644
--- a/src/main/java/org/archive/io/arc/ARCReaderFactory.java
+++ b/src/main/java/org/archive/io/arc/ARCReaderFactory.java
@@ -147,11 +147,11 @@ protected ArchiveReader getArchiveReader(final String arc,
possiblyWrapped.mark(100);
boolean compressed = testCompressedARCStream(possiblyWrapped);
possiblyWrapped.reset();
-
+
if (compressed) {
return new CompressedARCReader(arc, possiblyWrapped, atFirstRecord);
} else {
- return new UncompressedARCReader(arc, possiblyWrapped);
+ return new UncompressedARCReader(arc, possiblyWrapped, atFirstRecord);
}
}
@@ -330,10 +330,11 @@ public UncompressedARCReader(final File f, final long offset)
* @param f Uncompressed arc to read.
* @param is InputStream.
*/
- public UncompressedARCReader(final String f, final InputStream is) {
+ public UncompressedARCReader(final String f, final InputStream is, boolean atFirstRecord) {
// Arc file has been tested for existence by time it has come
// to here.
setIn(new CountingInputStream(is));
+ setAlignedOnFirstRecord(atFirstRecord);
initialize(f);
}
}
diff --git a/src/main/java/org/archive/io/warc/WARCReaderFactory.java b/src/main/java/org/archive/io/warc/WARCReaderFactory.java
index 9c6c7e77..c3e5baa0 100644
--- a/src/main/java/org/archive/io/warc/WARCReaderFactory.java
+++ b/src/main/java/org/archive/io/warc/WARCReaderFactory.java
@@ -100,12 +100,20 @@ public static ArchiveReader get(final String s, final InputStream is,
atFirstRecord);
}
+ /*
+ * Note that the ARC companion does this differently, with quite a lot of duplication.
+ *
+ * @see org.archive.io.arc.ARCReaderFactory.getArchiveReader(String, InputStream, boolean)
+ */
protected ArchiveReader getArchiveReader(final String f,
final InputStream is, final boolean atFirstRecord)
throws IOException {
- // For now, assume stream is compressed. Later add test of input
- // stream or handle exception thrown when figure not compressed stream.
- return new CompressedWARCReader(f, is, atFirstRecord);
+ // Check if it's compressed, based on file extension.
+ if( f.endsWith(".gz") ) {
+ return new CompressedWARCReader(f, is, atFirstRecord);
+ } else {
+ return new UncompressedWARCReader(f, is);
+ }
}
public static WARCReader get(final URL arcUrl, final long offset)
diff --git a/src/main/java/org/archive/io/warc/WARCWriter.java b/src/main/java/org/archive/io/warc/WARCWriter.java
index b9558263..e2d28ee9 100644
--- a/src/main/java/org/archive/io/warc/WARCWriter.java
+++ b/src/main/java/org/archive/io/warc/WARCWriter.java
@@ -245,10 +245,11 @@ public void writeRecord(WARCRecordInfo recordInfo)
write(bytes);
totalBytes += bytes.length;
+ // Write out the header/body separator.
+ write(CRLF_BYTES);
+ totalBytes += CRLF_BYTES.length;
+
if (recordInfo.getContentStream() != null && recordInfo.getContentLength() > 0) {
- // Write out the header/body separator.
- write(CRLF_BYTES); // TODO: should this be written even for zero-length?
- totalBytes += CRLF_BYTES.length;
contentBytes += copyFrom(recordInfo.getContentStream(),
recordInfo.getContentLength(),
recordInfo.getEnforceLength());
diff --git a/src/main/java/org/archive/resource/AbstractResource.java b/src/main/java/org/archive/resource/AbstractResource.java
index 409e7408..301c53d4 100755
--- a/src/main/java/org/archive/resource/AbstractResource.java
+++ b/src/main/java/org/archive/resource/AbstractResource.java
@@ -5,7 +5,7 @@
import org.archive.util.StreamCopy;
-import com.google.common.io.NullOutputStream;
+import com.google.common.io.ByteStreams;
public abstract class AbstractResource implements Resource {
protected ResourceContainer container;
@@ -44,7 +44,7 @@ public static void dumpShort(PrintStream out, Resource resource) throws IOExcept
// out.println("Headers Before");
// out.print(m.toString());
- long bytes = StreamCopy.copy(resource.getInputStream(), new NullOutputStream());
+ long bytes = StreamCopy.copy(resource.getInputStream(), ByteStreams.nullOutputStream());
out.println("Resource Was:"+bytes+" Long");
out.println("[\n]Headers After");
diff --git a/src/main/java/org/archive/resource/arc/ARCResource.java b/src/main/java/org/archive/resource/arc/ARCResource.java
index 5d63fd4d..b6e0a1c1 100644
--- a/src/main/java/org/archive/resource/arc/ARCResource.java
+++ b/src/main/java/org/archive/resource/arc/ARCResource.java
@@ -18,8 +18,8 @@
import org.archive.util.io.EOFObserver;
import org.archive.util.io.PushBackOneByteInputStream;
+import com.google.common.io.ByteStreams;
import com.google.common.io.CountingInputStream;
-import com.google.common.io.LimitInputStream;
public class ARCResource extends AbstractResource
@@ -54,7 +54,7 @@ public ARCResource(MetaData metaData, ResourceContainer container,
fields.putLong(DECLARED_LENGTH_KEY, arcMetaData.getLength());
countingIS = new CountingInputStream(
- new LimitInputStream(raw, arcMetaData.getLength()));
+ ByteStreams.limit(raw, arcMetaData.getLength()));
try {
digIS = new DigestInputStream(countingIS,
diff --git a/src/main/java/org/archive/resource/http/HTTPResponseResource.java b/src/main/java/org/archive/resource/http/HTTPResponseResource.java
index b5d189bc..cc325427 100644
--- a/src/main/java/org/archive/resource/http/HTTPResponseResource.java
+++ b/src/main/java/org/archive/resource/http/HTTPResponseResource.java
@@ -7,7 +7,6 @@
import java.security.NoSuchAlgorithmException;
import java.util.logging.Logger;
-
import org.archive.format.http.HttpHeader;
import org.archive.format.http.HttpResponse;
import org.archive.format.http.HttpResponseMessage;
@@ -20,8 +19,8 @@
import org.archive.util.io.EOFNotifyingInputStream;
import org.archive.util.io.EOFObserver;
+import com.google.common.io.ByteStreams;
import com.google.common.io.CountingInputStream;
-import com.google.common.io.LimitInputStream;
@@ -65,7 +64,7 @@ public HTTPResponseResource(MetaData metaData,
headers.putString(h.getName(),h.getValue());
}
if(forceCheck && (length >= 0)) {
- LimitInputStream lis = new LimitInputStream(response, length);
+ InputStream lis = ByteStreams.limit(response, length);
countingIS = new CountingInputStream(lis);
} else {
countingIS = new CountingInputStream(response);
diff --git a/src/main/java/org/archive/resource/warc/WARCResource.java b/src/main/java/org/archive/resource/warc/WARCResource.java
index ab9b6900..80929206 100644
--- a/src/main/java/org/archive/resource/warc/WARCResource.java
+++ b/src/main/java/org/archive/resource/warc/WARCResource.java
@@ -19,8 +19,8 @@
import org.archive.util.io.EOFObserver;
import org.archive.util.io.PushBackOneByteInputStream;
+import com.google.common.io.ByteStreams;
import com.google.common.io.CountingInputStream;
-import com.google.common.io.LimitInputStream;
public class WARCResource extends AbstractResource implements EOFObserver, ResourceConstants {
CountingInputStream countingIS;
@@ -51,7 +51,7 @@ public WARCResource(MetaData metaData, ResourceContainer container,
if(length >= 0) {
countingIS = new CountingInputStream(
- new LimitInputStream(response, length));
+ ByteStreams.limit(response, length));
} else {
throw new ResourceParseException(null);
}
diff --git a/src/main/java/org/archive/url/BasicURLCanonicalizer.java b/src/main/java/org/archive/url/BasicURLCanonicalizer.java
index c09ad6e6..5f39ce76 100644
--- a/src/main/java/org/archive/url/BasicURLCanonicalizer.java
+++ b/src/main/java/org/archive/url/BasicURLCanonicalizer.java
@@ -74,15 +74,15 @@ public void canonicalize(HandyURL url) {
url.setPath(escapeOnce(normalizePath(path)));
}
- private static final Pattern SINGLE_FORWARDSLASH_PATTERN = Pattern
- .compile("/");
+ private static final Pattern SINGLE_FORWARDANDBACKSLASH_PATTERN = Pattern
+ .compile("[/\\\\]");
public String normalizePath(String path) {
if (path == null) {
path = "/";
} else {
// -1 gives an empty trailing element if path ends with '/':
- String[] paths = SINGLE_FORWARDSLASH_PATTERN.split(path, -1);
+ String[] paths = SINGLE_FORWARDANDBACKSLASH_PATTERN.split(path, -1);
ArrayList keptPaths = new ArrayList();
boolean first = true;
for (String p : paths) {
diff --git a/src/main/java/org/archive/url/LaxURI.java b/src/main/java/org/archive/url/LaxURI.java
index 807333d3..e1cea9b7 100644
--- a/src/main/java/org/archive/url/LaxURI.java
+++ b/src/main/java/org/archive/url/LaxURI.java
@@ -211,7 +211,7 @@ protected void setURI() {
if (_scheme.length == 4 && Arrays.equals(_scheme, HTTP_SCHEME)) {
_scheme = HTTP_SCHEME;
} else if (_scheme.length == 5
- && Arrays.equals(_scheme, HTTP_SCHEME)) {
+ && Arrays.equals(_scheme, HTTPS_SCHEME)) {
_scheme = HTTPS_SCHEME;
}
}
diff --git a/src/main/java/org/archive/url/URLRegexTransformer.java b/src/main/java/org/archive/url/URLRegexTransformer.java
index 930f5b34..c5505a74 100644
--- a/src/main/java/org/archive/url/URLRegexTransformer.java
+++ b/src/main/java/org/archive/url/URLRegexTransformer.java
@@ -101,7 +101,7 @@ public static String hostToPublicSuffix(String host) {
InternetDomainName idn;
try {
- idn = InternetDomainName.fromLenient(host);
+ idn = InternetDomainName.from(host);
} catch(IllegalArgumentException e) {
return host;
}
@@ -109,7 +109,7 @@ public static String hostToPublicSuffix(String host) {
if(tmp == null) {
return host;
}
- String pubSuff = tmp.name();
+ String pubSuff = tmp.toString();
int idx = host.lastIndexOf(".", host.length() - (pubSuff.length()+2));
if(idx == -1) {
return host;
diff --git a/src/main/java/org/archive/url/UsableURI.java b/src/main/java/org/archive/url/UsableURI.java
index b9c4ff9d..ed40f41a 100644
--- a/src/main/java/org/archive/url/UsableURI.java
+++ b/src/main/java/org/archive/url/UsableURI.java
@@ -18,6 +18,7 @@
*/
package org.archive.url;
+import gnu.inet.encoding.IDNA;
import java.io.File;
import java.io.IOException;
import java.io.ObjectOutputStream;
@@ -271,6 +272,55 @@ public String toString() {
return toCustomString();
}
+ /**
+ * In the case of a puny encoded IDN, this method returns the decoded Unicode version.
+ *
+ * Most of this implementation is copied from {@link org.apache.commons.httpclient.URI#setURI()}.
+ *
+ * @return decoded IDN version of URI
+ */
+ public String toUnicodeHostString() {
+ if (!_is_hostname) {
+ return toString();
+ }
+
+ try {
+ StringBuilder buf = new StringBuilder();
+
+ if (_scheme != null) {
+ buf.append(_scheme);
+ buf.append(':');
+ }
+ if (_is_net_path) {
+ buf.append("//");
+ if (_authority != null) { // has_authority
+ if (_userinfo != null) {
+ buf.append(_userinfo).append('@');
+ }
+ buf.append(IDNA.toUnicode(getHost()));
+ if (_port >= 0) {
+ buf.append(':').append(_port);
+ }
+ }
+ }
+ if (_opaque != null && _is_opaque_part) {
+ buf.append(_opaque);
+ } else if (_path != null) {
+ // _is_hier_part or _is_relativeURI
+ if (_path.length != 0) {
+ buf.append(_path);
+ }
+ }
+ if (_query != null) { // has_query
+ buf.append('?');
+ buf.append(_query);
+ }
+ return buf.toString();
+ } catch (URIException ex) {
+ throw new RuntimeException(ex);
+ }
+ }
+
public synchronized String getEscapedURI() {
if (this.cachedEscapedURI == null) {
this.cachedEscapedURI = super.getEscapedURI();
diff --git a/src/main/java/org/archive/url/UsableURIFactory.java b/src/main/java/org/archive/url/UsableURIFactory.java
index 46b8e119..9118b850 100644
--- a/src/main/java/org/archive/url/UsableURIFactory.java
+++ b/src/main/java/org/archive/url/UsableURIFactory.java
@@ -20,7 +20,7 @@
import gnu.inet.encoding.IDNA;
import gnu.inet.encoding.IDNAException;
-import it.unimi.dsi.mg4j.util.MutableString;
+import it.unimi.dsi.lang.MutableString;
import java.io.UnsupportedEncodingException;
import java.util.BitSet;
diff --git a/src/main/java/org/archive/util/ArchiveUtils.java b/src/main/java/org/archive/util/ArchiveUtils.java
index c41c0bc0..e4224384 100644
--- a/src/main/java/org/archive/util/ArchiveUtils.java
+++ b/src/main/java/org/archive/util/ArchiveUtils.java
@@ -104,7 +104,7 @@ public class ArchiveUtils {
private static ThreadLocal threadLocalDateFormat(final String pattern) {
ThreadLocal tl = new ThreadLocal() {
protected SimpleDateFormat initialValue() {
- SimpleDateFormat df = new SimpleDateFormat(pattern);
+ SimpleDateFormat df = new SimpleDateFormat(pattern, Locale.ENGLISH);
df.setTimeZone(TimeZone.getTimeZone("GMT"));
return df;
}
@@ -393,9 +393,9 @@ public static Date getDate(String d) throws ParseException {
}
final static SimpleDateFormat dateToTimestampFormats[] =
- {new SimpleDateFormat("MM/dd/yyyy"),
- new SimpleDateFormat("MM/yyyy"),
- new SimpleDateFormat("yyyy")};
+ {new SimpleDateFormat("MM/dd/yyyy", Locale.ENGLISH),
+ new SimpleDateFormat("MM/yyyy", Locale.ENGLISH),
+ new SimpleDateFormat("yyyy", Locale.ENGLISH)};
/**
* Convert a user-entered date into a timestamp
diff --git a/src/main/java/org/archive/util/DateUtils.java b/src/main/java/org/archive/util/DateUtils.java
index e7fe78b7..d01b63ce 100755
--- a/src/main/java/org/archive/util/DateUtils.java
+++ b/src/main/java/org/archive/util/DateUtils.java
@@ -65,7 +65,7 @@ public class DateUtils {
private static ThreadLocal threadLocalDateFormat(final String pattern) {
ThreadLocal tl = new ThreadLocal() {
protected SimpleDateFormat initialValue() {
- SimpleDateFormat df = new SimpleDateFormat(pattern);
+ SimpleDateFormat df = new SimpleDateFormat(pattern, Locale.ENGLISH);
df.setTimeZone(TimeZone.getTimeZone("GMT"));
return df;
}
diff --git a/src/main/java/org/archive/util/TextUtils.java b/src/main/java/org/archive/util/TextUtils.java
index 707f93c7..9061a161 100644
--- a/src/main/java/org/archive/util/TextUtils.java
+++ b/src/main/java/org/archive/util/TextUtils.java
@@ -36,8 +36,9 @@
import org.apache.commons.lang.StringEscapeUtils;
-import com.google.common.base.Function;
-import com.google.common.collect.MapMaker;
+import com.google.common.cache.CacheBuilder;
+import com.google.common.cache.CacheLoader;
+import com.google.common.cache.LoadingCache;
public class TextUtils {
private static final String FIRSTWORD = "^([^\\s]*).*$";
@@ -51,11 +52,11 @@ protected Map initialValue() {
};
/** global soft-cache of Patterns, by string key */
- private static final ConcurrentMap PATTERNS = new MapMaker()
+ private static final LoadingCache PATTERNS = CacheBuilder.newBuilder()
.concurrencyLevel(16)
.softValues()
- .makeComputingMap(new Function() {
- public Pattern apply(String regex) {
+ .build(new CacheLoader() {
+ public Pattern load(String regex) {
return Pattern.compile(regex);
}
});
@@ -84,7 +85,7 @@ public static Matcher getMatcher(String pattern, CharSequence input) {
final Map matchers = TL_MATCHER_MAP.get();
Matcher m = (Matcher)matchers.get(pattern);
if(m == null) {
- m = PATTERNS.get(pattern).matcher(input);
+ m = PATTERNS.getUnchecked(pattern).matcher(input);
} else {
matchers.put(pattern,null);
m.reset(input);
diff --git a/src/main/java/org/archive/util/TmpDirTestCase.java b/src/main/java/org/archive/util/TmpDirTestCase.java
new file mode 100644
index 00000000..09ec345b
--- /dev/null
+++ b/src/main/java/org/archive/util/TmpDirTestCase.java
@@ -0,0 +1,119 @@
+/*
+ * This file is part of the Heritrix web crawler (crawler.archive.org).
+ *
+ * Licensed to the Internet Archive (IA) by one or more individual
+ * contributors.
+ *
+ * The IA licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.archive.util;
+
+import java.io.File;
+import java.io.IOException;
+
+import junit.framework.TestCase;
+
+
+/**
+ * Base class for TestCases that want access to a tmp dir for the writing
+ * of files.
+ *
+ * @author stack
+ */
+public abstract class TmpDirTestCase extends TestCase
+{
+ /**
+ * Name of the system property that holds pointer to tmp directory into
+ * which we can safely write files.
+ */
+ public static final String TEST_TMP_SYSTEM_PROPERTY_NAME = "testtmpdir";
+
+ /**
+ * Default test tmp.
+ */
+ public static final String DEFAULT_TEST_TMP_DIR = File.separator + "tmp" +
+ File.separator + "heritrix-junit-tests";
+
+ /**
+ * Directory to write temporary files to.
+ */
+ private File tmpDir = null;
+
+
+ public TmpDirTestCase()
+ {
+ super();
+ }
+
+ public TmpDirTestCase(String testName)
+ {
+ super(testName);
+ }
+
+ /*
+ * @see TestCase#setUp()
+ */
+ protected void setUp() throws Exception {
+ super.setUp();
+ this.tmpDir = tmpDir();
+ }
+
+ /**
+ * @return Returns the tmpDir.
+ */
+ public File getTmpDir()
+ {
+ return this.tmpDir;
+ }
+
+ /**
+ * Delete any files left over from previous run.
+ *
+ * @param basename Base name of files we're to clean up.
+ */
+ public void cleanUpOldFiles(String basename) {
+ cleanUpOldFiles(getTmpDir(), basename);
+ }
+
+ /**
+ * Delete any files left over from previous run.
+ *
+ * @param prefix Base name of files we're to clean up.
+ * @param basedir Directory to start cleaning in.
+ */
+ public void cleanUpOldFiles(File basedir, String prefix) {
+ File [] files = FileUtils.getFilesWithPrefix(basedir, prefix);
+ if (files != null) {
+ for (int i = 0; i < files.length; i++) {
+ org.apache.commons.io.FileUtils.deleteQuietly(files[i]);
+ }
+ }
+ }
+
+
+ public static File tmpDir() throws IOException {
+ String tmpDirStr = System.getProperty(TEST_TMP_SYSTEM_PROPERTY_NAME);
+ tmpDirStr = (tmpDirStr == null)? DEFAULT_TEST_TMP_DIR: tmpDirStr;
+ File tmpDir = new File(tmpDirStr);
+ FileUtils.ensureWriteableDirectory(tmpDir);
+
+ if (!tmpDir.canWrite())
+ {
+ throw new IOException(tmpDir.getAbsolutePath() +
+ " is unwriteable.");
+ }
+
+ return tmpDir;
+ }
+}
diff --git a/src/main/java/org/archive/util/binsearch/impl/HDFSSeekableLineReader.java b/src/main/java/org/archive/util/binsearch/impl/HDFSSeekableLineReader.java
index 621c6bce..93757a45 100644
--- a/src/main/java/org/archive/util/binsearch/impl/HDFSSeekableLineReader.java
+++ b/src/main/java/org/archive/util/binsearch/impl/HDFSSeekableLineReader.java
@@ -6,7 +6,7 @@
import org.apache.hadoop.fs.FSDataInputStream;
import org.archive.util.binsearch.AbstractSeekableLineReader;
-import com.google.common.io.LimitInputStream;
+import com.google.common.io.ByteStreams;
public class HDFSSeekableLineReader extends AbstractSeekableLineReader {
private FSDataInputStream fsdis;
@@ -23,7 +23,7 @@ public InputStream doSeekLoad(long offset, int maxLength) throws IOException {
fsdis.seek(offset);
if (maxLength >= 0) {
- return new LimitInputStream(fsdis, maxLength);
+ return ByteStreams.limit(fsdis, maxLength);
} else {
return fsdis;
}
diff --git a/src/main/java/org/archive/util/binsearch/impl/RandomAccessFileSeekableLineReader.java b/src/main/java/org/archive/util/binsearch/impl/RandomAccessFileSeekableLineReader.java
index b211db16..5131dd06 100644
--- a/src/main/java/org/archive/util/binsearch/impl/RandomAccessFileSeekableLineReader.java
+++ b/src/main/java/org/archive/util/binsearch/impl/RandomAccessFileSeekableLineReader.java
@@ -7,7 +7,7 @@
import org.archive.util.binsearch.AbstractSeekableLineReader;
-import com.google.common.io.LimitInputStream;
+import com.google.common.io.ByteStreams;
public class RandomAccessFileSeekableLineReader extends AbstractSeekableLineReader {
@@ -24,7 +24,7 @@ public InputStream doSeekLoad(long offset, int maxLength) throws IOException {
FileInputStream fis = new FileInputStream(raf.getFD());
if (maxLength > 0) {
- return new LimitInputStream(fis, maxLength);
+ return ByteStreams.limit(fis, maxLength);
} else {
return fis;
}
diff --git a/src/main/java/org/archive/util/binsearch/impl/http/ApacheHttp31SLRFactory.java b/src/main/java/org/archive/util/binsearch/impl/http/ApacheHttp31SLRFactory.java
index 9bd7542b..bc5b83f4 100644
--- a/src/main/java/org/archive/util/binsearch/impl/http/ApacheHttp31SLRFactory.java
+++ b/src/main/java/org/archive/util/binsearch/impl/http/ApacheHttp31SLRFactory.java
@@ -3,6 +3,7 @@
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
+import java.util.Locale;
import java.util.logging.Logger;
import org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
@@ -156,7 +157,7 @@ public boolean isStaleChecking()
public long getModTime()
{
HTTPSeekableLineReader reader = null;
- SimpleDateFormat lastModFormat = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss zzz");
+ SimpleDateFormat lastModFormat = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss zzz", Locale.ENGLISH);
try {
reader = get();
diff --git a/src/test/java/org/archive/extract/RealCDXExtractorOutputTest.java b/src/test/java/org/archive/extract/RealCDXExtractorOutputTest.java
new file mode 100644
index 00000000..14f8489d
--- /dev/null
+++ b/src/test/java/org/archive/extract/RealCDXExtractorOutputTest.java
@@ -0,0 +1,28 @@
+package org.archive.extract;
+
+import java.net.MalformedURLException;
+import java.net.URI;
+import java.net.URISyntaxException;
+import java.net.URL;
+import java.net.URLEncoder;
+
+import junit.framework.TestCase;
+
+
+public class RealCDXExtractorOutputTest extends TestCase {
+
+ public void testEscapeResolvedUrl() throws Exception {
+ String context ="http://www.uni-giessen.de/cms/studium/dateien/informationberatung/merkblattpdf";
+ String spec = "http://fss.plone.uni-giessen.de/fß/studium/dateien/informationberatung/merkblattpdf/file/Mérkblatt zur Gestaltung von Nachteilsausgleichen.pdf?föo=bar#änchor";
+ String escaped = RealCDXExtractorOutput.resolve(context, spec);
+ assertTrue(escaped.indexOf(" ") < 0);
+ URI parsed = new URI(escaped);
+ assertEquals("änchor", parsed.getFragment());
+ }
+
+ public void testNoDoubleEscaping() throws Exception {
+ String spec = "https://www.google.com/search?q=java+escape+url+spaces&ie=utf-8&oe=utf-8";
+ String resolved = RealCDXExtractorOutput.resolve(spec, spec);
+ assertTrue(spec.equals(resolved));
+ }
+}
diff --git a/src/test/java/org/archive/format/gzip/GZIPMemberSeriesTest.java b/src/test/java/org/archive/format/gzip/GZIPMemberSeriesTest.java
index 95c7e96f..2eec46ec 100644
--- a/src/test/java/org/archive/format/gzip/GZIPMemberSeriesTest.java
+++ b/src/test/java/org/archive/format/gzip/GZIPMemberSeriesTest.java
@@ -374,6 +374,9 @@ public void testAutoSkip() throws IOException {
assertNull(m);
assertTrue(s.gotEOF());
}
-
+ public void testWgetProblem() throws IndexOutOfBoundsException, FileNotFoundException, IOException {
+ InputStream is = getClass().getResourceAsStream("IAH-urls-wget.warc.gz");
+ new GZIPDecoder().parseHeader(is);
+ }
}
diff --git a/src/test/java/org/archive/format/gzip/GZIPMemberWriterTest.java b/src/test/java/org/archive/format/gzip/GZIPMemberWriterTest.java
index 5cd75ccf..483d2baf 100644
--- a/src/test/java/org/archive/format/gzip/GZIPMemberWriterTest.java
+++ b/src/test/java/org/archive/format/gzip/GZIPMemberWriterTest.java
@@ -12,8 +12,8 @@
public class GZIPMemberWriterTest extends TestCase {
public void testWrite() throws IOException {
- String outPath = "/tmp/tmp.gz";
- GZIPMemberWriter gzw = new GZIPMemberWriter(new FileOutputStream(new File(outPath)));
+ File outFile = File.createTempFile("tmp", ".gz");
+ GZIPMemberWriter gzw = new GZIPMemberWriter(new FileOutputStream(outFile));
gzw.write(new ByteArrayInputStream("Here is record 1".getBytes(IAUtils.UTF8)));
gzw.write(new ByteArrayInputStream("Here is record 2".getBytes(IAUtils.UTF8)));
}
diff --git a/src/test/java/org/archive/io/RecordingOutputStreamTest.java b/src/test/java/org/archive/io/RecordingOutputStreamTest.java
new file mode 100644
index 00000000..f697ff31
--- /dev/null
+++ b/src/test/java/org/archive/io/RecordingOutputStreamTest.java
@@ -0,0 +1,360 @@
+/*
+ * This file is part of the Heritrix web crawler (crawler.archive.org).
+ *
+ * Licensed to the Internet Archive (IA) by one or more individual
+ * contributors.
+ *
+ * The IA licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.archive.io;
+
+import java.io.ByteArrayOutputStream;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileOutputStream;
+import java.io.IOException;
+
+import org.archive.util.Base32;
+import org.archive.util.TmpDirTestCase;
+
+
+/**
+ * Test casesfor RecordingOutputStream.
+ *
+ * @author stack
+ */
+public class RecordingOutputStreamTest extends TmpDirTestCase
+{
+ /**
+ * Size of buffer used in tests.
+ */
+ private static final int BUFFER_SIZE = 5;
+
+ /**
+ * How much to write total to testing RecordingOutputStream.
+ */
+ private static final int WRITE_TOTAL = 10;
+
+
+ /*
+ * @see TmpDirTestCase#setUp()
+ */
+ protected void setUp() throws Exception
+ {
+ super.setUp();
+ }
+
+ /**
+ * Test reusing instance of RecordingOutputStream.
+ *
+ * @throws IOException Failed open of backing file or opening of
+ * input streams verifying recording.
+ */
+ public void testReuse()
+ throws IOException
+ {
+ final String BASENAME = "testReuse";
+ cleanUpOldFiles(BASENAME);
+ RecordingOutputStream ros = new RecordingOutputStream(BUFFER_SIZE,
+ (new File(getTmpDir(), BASENAME + "Bkg.txt")).getAbsolutePath());
+ for (int i = 0; i < 3; i++)
+ {
+ reuse(BASENAME, ros, i);
+ }
+ }
+
+ private void reuse(String baseName, RecordingOutputStream ros, int index)
+ throws IOException
+ {
+ final String BASENAME = baseName + Integer.toString(index);
+ File f = writeIntRecordedFile(ros, BASENAME, WRITE_TOTAL);
+ verifyRecording(ros, f, WRITE_TOTAL);
+ // Do again to test that I can get a new ReplayInputStream on same
+ // RecordingOutputStream.
+ verifyRecording(ros, f, WRITE_TOTAL);
+ }
+
+ /**
+ * Method to test for void write(int).
+ *
+ * Uses small buffer size and small write size. Test mark and reset too.
+ *
+ * @throws IOException Failed open of backing file or opening of
+ * input streams verifying recording.
+ */
+ public void testWriteint()
+ throws IOException
+ {
+ final String BASENAME = "testWriteint";
+ cleanUpOldFiles(BASENAME);
+ RecordingOutputStream ros = new RecordingOutputStream(BUFFER_SIZE,
+ (new File(getTmpDir(), BASENAME + "Backing.txt")).getAbsolutePath());
+ File f = writeIntRecordedFile(ros, BASENAME, WRITE_TOTAL);
+ verifyRecording(ros, f, WRITE_TOTAL);
+ // Do again to test that I can get a new ReplayInputStream on same
+ // RecordingOutputStream.
+ verifyRecording(ros, f, WRITE_TOTAL);
+ }
+
+ /**
+ * Method to test for void write(byte []).
+ *
+ * Uses small buffer size and small write size.
+ *
+ * @throws IOException Failed open of backing file or opening of
+ * input streams verifying recording.
+ */
+ public void testWritebytearray()
+ throws IOException
+ {
+ final String BASENAME = "testWritebytearray";
+ cleanUpOldFiles(BASENAME);
+ RecordingOutputStream ros = new RecordingOutputStream(BUFFER_SIZE,
+ (new File(getTmpDir(), BASENAME + "Backing.txt")).getAbsolutePath());
+ File f = writeByteRecordedFile(ros, BASENAME, WRITE_TOTAL);
+ verifyRecording(ros, f, WRITE_TOTAL);
+ // Do again to test that I can get a new ReplayInputStream on same
+ // RecordingOutputStream.
+ verifyRecording(ros, f, WRITE_TOTAL);
+ }
+
+ /**
+ * Test mark and reset.
+ * @throws IOException
+ */
+ public void testMarkReset() throws IOException
+ {
+ final String BASENAME = "testMarkReset";
+ cleanUpOldFiles(BASENAME);
+ RecordingOutputStream ros = new RecordingOutputStream(BUFFER_SIZE,
+ (new File(getTmpDir(), BASENAME + "Backing.txt")).getAbsolutePath());
+ File f = writeByteRecordedFile(ros, BASENAME, WRITE_TOTAL);
+ verifyRecording(ros, f, WRITE_TOTAL);
+ ReplayInputStream ris = ros.getReplayInputStream();
+ ris.mark(10 /*Arbitrary value*/);
+ // Read from the stream.
+ ris.read();
+ ris.read();
+ ris.read();
+ // Reset it. It should be back at zero.
+ ris.reset();
+ assertEquals("Reset to zero", ris.read(), 0);
+ assertEquals("Reset to zero char 1", ris.read(), 1);
+ assertEquals("Reset to zero char 2", ris.read(), 2);
+ // Mark stream. Here. Next character should be '3'.
+ ris.mark(10 /* Arbitrary value*/);
+ ris.read();
+ ris.read();
+ ris.reset();
+ assertEquals("Reset to zero char 3", ris.read(), 3);
+ }
+
+ /**
+ * Record a file write.
+ *
+ * Write a file w/ characters that start at null and ascend to
+ * filesize. Record the writing w/ passed ros
+ * recordingoutputstream. Return the file recorded as result of method.
+ * The file output stream that is recorded is named
+ * basename + ".txt".
+ *
+ *
This method writes a character at a time.
+ *
+ * @param ros RecordingOutputStream to record with.
+ * @param basename Basename of file.
+ * @param size How many characters to write.
+ * @return Recorded output stream.
+ */
+ private File writeIntRecordedFile(RecordingOutputStream ros,
+ String basename, int size)
+ throws IOException
+ {
+ File f = new File(getTmpDir(), basename + ".txt");
+ FileOutputStream fos = new FileOutputStream(f);
+ ros.open(fos);
+ for (int i = 0; i < WRITE_TOTAL; i++)
+ {
+ ros.write(i);
+ }
+ ros.close();
+ fos.close();
+ assertEquals("Content-Length test", size,
+ ros.getResponseContentLength());
+ return f;
+ }
+
+ /**
+ * Record a file byte array write.
+ *
+ * Write a file w/ characters that start at null and ascend to
+ * filesize. Record the writing w/ passed ros
+ * recordingoutputstream. Return the file recorded as result of method.
+ * The file output stream that is recorded is named
+ * basename + ".txt".
+ *
+ *
This method writes using a byte array.
+ *
+ * @param ros RecordingOutputStream to record with.
+ * @param basename Basename of file.
+ * @param size How many characters to write.
+ * @return Recorded output stream.
+ */
+ private File writeByteRecordedFile(RecordingOutputStream ros,
+ String basename, int size)
+ throws IOException
+ {
+ File f = new File(getTmpDir(), basename + ".txt");
+ FileOutputStream fos = new FileOutputStream(f);
+ ros.open(fos);
+ byte [] b = new byte[size];
+ for (int i = 0; i < size; i++)
+ {
+ b[i] = (byte)i;
+ }
+ ros.write(b);
+ ros.close();
+ fos.close();
+ assertEquals("Content-Length test", size,
+ ros.getResponseContentLength());
+ return f;
+ }
+
+ /**
+ * Verify what was written is both in the file written to and in the
+ * recording stream.
+ *
+ * @param ros Stream to check.
+ * @param f File that was recorded. Stream should have its content
+ * exactly.
+ * @param size Amount of bytes written.
+ *
+ * @exception IOException Failure reading streams.
+ */
+ private void verifyRecording(RecordingOutputStream ros, File f,
+ int size) throws IOException
+ {
+ assertEquals("Recorded file size.", size, f.length());
+ FileInputStream fis = new FileInputStream(f);
+ assertNotNull("FileInputStream not null", fis);
+ ReplayInputStream ris = ros.getReplayInputStream();
+ assertNotNull("ReplayInputStream not null", ris);
+ for (int i = 0; i < size; i++)
+ {
+ assertEquals("ReplayInputStream content verification", i,
+ ris.read());
+ assertEquals("Recorded file content verification", i,
+ fis.read());
+ }
+ assertEquals("ReplayInputStream at EOF", -1, ris.read());
+ fis.close();
+ ris.close();
+ }
+
+ public void testMessageBodyBegin() throws IOException {
+ final String BASENAME = "testMessageBodyBegin";
+ cleanUpOldFiles(BASENAME);
+ RecordingOutputStream ros = new RecordingOutputStream(BUFFER_SIZE,
+ (new File(getTmpDir(), BASENAME + "Backing.txt")).getAbsolutePath());
+ ros.setSha1Digest();
+
+ ros.open(new ByteArrayOutputStream());
+ ros.write("0123456789\n\nabcdefghij".getBytes());
+ assertEquals(12, ros.getMessageBodyBegin());
+ assertEquals("22GBTIFDIW36VN4NLYI6TEOAE3WGBW3D", Base32.encode(ros.getDigestValue()));
+ ros.close();
+
+ ros.open(new ByteArrayOutputStream());
+ ros.write("0123456789\r\n\r\nabcdefghij".getBytes());
+ assertEquals(14, ros.getMessageBodyBegin());
+ assertEquals("22GBTIFDIW36VN4NLYI6TEOAE3WGBW3D", Base32.encode(ros.getDigestValue()));
+ ros.close();
+
+ ros.open(new ByteArrayOutputStream());
+ ros.write("0123456789\n\r\nabcdefghij".getBytes());
+ assertEquals(13, ros.getMessageBodyBegin());
+ assertEquals("22GBTIFDIW36VN4NLYI6TEOAE3WGBW3D", Base32.encode(ros.getDigestValue()));
+ ros.close();
+
+ ros.open(new ByteArrayOutputStream());
+ ros.write("0123456789\n".getBytes());
+ assertEquals(-1, ros.getMessageBodyBegin());
+ ros.write("\nabcdefghij".getBytes());
+ assertEquals(12, ros.getMessageBodyBegin());
+ assertEquals("22GBTIFDIW36VN4NLYI6TEOAE3WGBW3D", Base32.encode(ros.getDigestValue()));
+ ros.close();
+
+ ros.open(new ByteArrayOutputStream());
+ ros.write("0123456789\n".getBytes());
+ assertEquals(-1, ros.getMessageBodyBegin());
+ ros.write("\r\nabcdefghij".getBytes());
+ assertEquals(13, ros.getMessageBodyBegin());
+ assertEquals("22GBTIFDIW36VN4NLYI6TEOAE3WGBW3D", Base32.encode(ros.getDigestValue()));
+ ros.close();
+
+ ros.open(new ByteArrayOutputStream());
+ ros.write("0123456789\n\r".getBytes());
+ assertEquals(-1, ros.getMessageBodyBegin());
+ ros.write("\nabcdefghij".getBytes());
+ assertEquals(13, ros.getMessageBodyBegin());
+ assertEquals("22GBTIFDIW36VN4NLYI6TEOAE3WGBW3D", Base32.encode(ros.getDigestValue()));
+ ros.close();
+
+ ros.open(new ByteArrayOutputStream());
+ ros.write("0123456789".getBytes());
+ ros.write('\n');
+ assertEquals(-1, ros.getMessageBodyBegin());
+ ros.write("\nabcdefghij".getBytes());
+ assertEquals(12, ros.getMessageBodyBegin());
+ assertEquals("22GBTIFDIW36VN4NLYI6TEOAE3WGBW3D", Base32.encode(ros.getDigestValue()));
+ ros.close();
+
+ ros.open(new ByteArrayOutputStream());
+ ros.write("0123456789".getBytes());
+ ros.write('\n');
+ ros.write('\n');
+ for (int b: "abcdefghij".getBytes()) {
+ ros.write(b);
+ }
+ assertEquals(12, ros.getMessageBodyBegin());
+ assertEquals("22GBTIFDIW36VN4NLYI6TEOAE3WGBW3D", Base32.encode(ros.getDigestValue()));
+ ros.close();
+
+ ros.open(new ByteArrayOutputStream());
+ ros.write("0123456789".getBytes());
+ ros.write('\n');
+ ros.write('\r');
+ ros.write('\n');
+ for (int b: "abcdefghij".getBytes()) {
+ ros.write(b);
+ }
+ assertEquals(13, ros.getMessageBodyBegin());
+ assertEquals("22GBTIFDIW36VN4NLYI6TEOAE3WGBW3D", Base32.encode(ros.getDigestValue()));
+ ros.close();
+
+ ros.open(new ByteArrayOutputStream());
+ ros.write("0123456789\n".getBytes());
+ ros.write('\n');
+ ros.write("abcdefghij".getBytes());
+ assertEquals(12, ros.getMessageBodyBegin());
+ assertEquals("22GBTIFDIW36VN4NLYI6TEOAE3WGBW3D", Base32.encode(ros.getDigestValue()));
+ ros.close();
+
+ ros.open(new ByteArrayOutputStream());
+ ros.write("0123456789\n\r".getBytes());
+ ros.write('\n');
+ ros.write("abcdefghij".getBytes());
+ assertEquals(13, ros.getMessageBodyBegin());
+ assertEquals("22GBTIFDIW36VN4NLYI6TEOAE3WGBW3D", Base32.encode(ros.getDigestValue()));
+ ros.close();
+ }
+}
diff --git a/src/test/java/org/archive/io/arc/ARCReaderFactoryTest.java b/src/test/java/org/archive/io/arc/ARCReaderFactoryTest.java
new file mode 100644
index 00000000..0721f795
--- /dev/null
+++ b/src/test/java/org/archive/io/arc/ARCReaderFactoryTest.java
@@ -0,0 +1,57 @@
+package org.archive.io.arc;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileNotFoundException;
+import java.io.InputStream;
+import java.io.RandomAccessFile;
+
+import org.archive.io.ArchiveReader;
+import org.archive.io.ArchiveRecord;
+
+import junit.framework.TestCase;
+
+/**
+ *
+ * Based on https://github.com/iipc/openwayback/pull/104/files
+ *
+ * @author csr@statsbiblioteket.dk (Colin Rosenthal)
+ *
+ */
+public class ARCReaderFactoryTest extends TestCase {
+
+ private File testfile1 = new File("src/test/resources/org/archive/format/arc/IAH-20080430204825-00000-blackbook-truncated.arc");
+
+ /**
+ * Test reading uncompressed arcfile for issue
+ * https://github.com/iipc/openwayback/issues/101
+ * @throws Exception
+ */
+ public void testGetResource() throws Exception {
+ this.offsetResourceTest(testfile1, 1515, "http://www.archive.org/robots.txt" );
+ this.offsetResourceTest(testfile1, 36420, "http://www.archive.org/services/collection-rss.php" );
+ }
+
+ private void offsetResourceTest( File testfile, long offset, String uri ) throws Exception {
+ RandomAccessFile raf = new RandomAccessFile(testfile, "r");
+ raf.seek(offset);
+ InputStream is = new FileInputStream(raf.getFD());
+ String fPath = testfile.getAbsolutePath();
+ ArchiveReader reader = ARCReaderFactory.get(fPath, is, false);
+ // This one works:
+ //ArchiveReader reader = ARCReaderFactory.get(testfile, offset);
+ ArchiveRecord record = reader.get();
+
+ final String url = record.getHeader().getUrl();
+ assertEquals("URL of record is not as expected.", uri, url);
+
+ final long position = record.getPosition();
+ final long recordLength = record.getHeader().getLength();
+ assertTrue("Position " + position + " is after end of record " + recordLength, position <= recordLength);
+
+ // Clean up:
+ if( raf != null )
+ raf.close();
+ }
+
+}
diff --git a/src/test/java/org/archive/io/warc/WARCReaderFactoryTest.java b/src/test/java/org/archive/io/warc/WARCReaderFactoryTest.java
new file mode 100644
index 00000000..25028797
--- /dev/null
+++ b/src/test/java/org/archive/io/warc/WARCReaderFactoryTest.java
@@ -0,0 +1,34 @@
+package org.archive.io.warc;
+
+import java.io.FileInputStream;
+import java.io.IOException;
+
+import org.archive.format.warc.WARCConstants;
+import org.archive.format.warc.WARCConstants.WARCRecordType;
+import org.archive.io.ArchiveReader;
+import org.archive.io.ArchiveRecord;
+
+import junit.framework.TestCase;
+
+public class WARCReaderFactoryTest extends TestCase {
+
+ // Test files:
+ String[] files = new String[] {
+ "src/test/resources/org/archive/format/gzip/IAH-urls-wget.warc.gz",
+ "src/test/resources/org/archive/format/warc/IAH-urls-wget.warc"
+ };
+
+ public void testGetStringInputstreamBoolean() throws IOException {
+ // Check the test files can be opened:
+ for( String file : files ) {
+ FileInputStream is = new FileInputStream(file);
+ ArchiveReader ar = WARCReaderFactory.get(file, is, true);
+ ArchiveRecord r = ar.get();
+ String type = (String) r.getHeader().getHeaderValue(WARCConstants.HEADER_KEY_TYPE);
+ // Check the first record comes out as a 'warcinfo' record.
+ assertEquals(WARCRecordType.warcinfo.name(), type);
+ }
+ }
+
+
+}
diff --git a/src/test/java/org/archive/net/PublicSuffixesTest.java b/src/test/java/org/archive/net/PublicSuffixesTest.java
index b88acb6d..ca6e6408 100644
--- a/src/test/java/org/archive/net/PublicSuffixesTest.java
+++ b/src/test/java/org/archive/net/PublicSuffixesTest.java
@@ -36,6 +36,7 @@
*/
public class PublicSuffixesTest extends TestCase {
// test of low level implementation
+ private final String NL = System.getProperty("line.separator");
public void testCompare() {
Node n = new Node("hoge");
@@ -78,27 +79,26 @@ public void testTrie1() {
Node alt = new Node(null, new ArrayList());
alt.addBranch("ac,");
// specifically, should not have empty string as match.
- assertEquals("(null)\n" +
- " \"ac,\"\n", dump(alt));
+ assertEquals("(null)" + NL + " \"ac,\"" + NL, dump(alt));
alt.addBranch("ac,com,");
- assertEquals("(null)\n" +
- " \"ac,\"\n" +
- " \"com,\"\n" +
- " \"\"\n", dump(alt));
+ assertEquals("(null)" + NL +
+ " \"ac,\"" + NL +
+ " \"com,\"" + NL +
+ " \"\"" + NL, dump(alt));
alt.addBranch("ac,edu,");
- assertEquals("(null)\n" +
- " \"ac,\"\n" +
- " \"com,\"\n" +
- " \"edu,\"\n" +
- " \"\"\n", dump(alt));
+ assertEquals("(null)" + NL +
+ " \"ac,\"" + NL +
+ " \"com,\"" + NL +
+ " \"edu,\"" + NL +
+ " \"\"" + NL, dump(alt));
}
public void testTrie2() {
Node alt = new Node(null, new ArrayList());
alt.addBranch("ac,");
alt.addBranch("*,");
- assertEquals("(null)\n" +
- " \"ac,\"\n" +
- " \"*,\"\n", dump(alt));
+ assertEquals("(null)" + NL +
+ " \"ac,\"" + NL +
+ " \"*,\"" + NL, dump(alt));
}
public void testTrie3() {
@@ -107,11 +107,11 @@ public void testTrie3() {
alt.addBranch("ac,!hoge,");
alt.addBranch("ac,*,");
// exception goes first.
- assertEquals("(null)\n" +
- " \"ac,\"\n" +
- " \"!hoge,\"\n" +
- " \"*,\"\n" +
- " \"\"\n", dump(alt));
+ assertEquals("(null)" + NL +
+ " \"ac,\"" + NL +
+ " \"!hoge,\"" + NL +
+ " \"*,\"" + NL +
+ " \"\"" + NL, dump(alt));
}
// test of higher-level functionality
diff --git a/src/test/java/org/archive/url/UsableURITest.java b/src/test/java/org/archive/url/UsableURITest.java
index 2aec0e96..2a2f41f5 100644
--- a/src/test/java/org/archive/url/UsableURITest.java
+++ b/src/test/java/org/archive/url/UsableURITest.java
@@ -21,7 +21,6 @@
import java.net.URISyntaxException;
import org.apache.commons.httpclient.URIException;
-import org.archive.url.UsableURI;
import junit.framework.TestCase;
@@ -53,4 +52,31 @@ public void testSchemalessRelative() throws URIException {
UsableURI test = new UsableURI(base, relative);
assertEquals("http://www.facebook.com/?href=http://www.archive.org/a", test.toString());
}
+
+ /**
+ * Test of toUnicodeHostString method, of class UsableURI.
+ */
+ public void testToUnicodeHostString() throws URIException {
+ assertEquals("http://øx.dk", new UsableURI("http://xn--x-4ga.dk", true, "UTF-8").toUnicodeHostString());
+ assertEquals("xn--x-4ga.dk", new UsableURI("xn--x-4ga.dk", true, "UTF-8").toUnicodeHostString());
+ assertEquals("http://user:pass@øx.dk:8080", new UsableURI("http://user:pass@xn--x-4ga.dk:8080", true, "UTF-8").toUnicodeHostString());
+ assertEquals("http://user@øx.dk:8080", new UsableURI("http://user@xn--x-4ga.dk:8080", true, "UTF-8").toUnicodeHostString());
+ assertEquals("http://øx.dk/foo/bar?query=q", new UsableURI("http://xn--x-4ga.dk/foo/bar?query=q", true, "UTF-8").toUnicodeHostString());
+ assertEquals("http://127.0.0.1/foo/bar?query=q", new UsableURI("http://127.0.0.1/foo/bar?query=q", true, "UTF-8").toUnicodeHostString());
+
+ // test idn round trip
+ // XXX fails because idn is not handled here (it is converted to punycode in UsableURIFactory.fixupDomainlabel())
+ // assertEquals("http://øx.dk", new UsableURI("http://øx.dk", false, "UTF-8").toUnicodeHostString());
+ // To check the round trip it is then necessary to use the factory method in UsableURIFactory.
+ assertEquals("http://øx.dk/", UsableURIFactory.getInstance("http://øx.dk/", "UTF-8").toUnicodeHostString());
+
+ // non-idn domain name
+ assertEquals("http://example.org", new UsableURI("http://example.org", true, "UTF-8").toUnicodeHostString());
+
+ // ensure a call to toUnicodeHostString() has no effect on toString()
+ UsableURI uri = new UsableURI("http://xn--x-4ga.dk", true, "UTF-8");
+ assertEquals("http://øx.dk", uri.toUnicodeHostString());
+ uri.setPath(uri.getPath()); // force toString() cached value to be recomputed
+ assertEquals("http://xn--x-4ga.dk", uri.toString());
+ }
}
diff --git a/src/test/java/org/archive/util/ArchiveUtilsTest.java b/src/test/java/org/archive/util/ArchiveUtilsTest.java
index 8251615a..586a1821 100644
--- a/src/test/java/org/archive/util/ArchiveUtilsTest.java
+++ b/src/test/java/org/archive/util/ArchiveUtilsTest.java
@@ -229,16 +229,19 @@ public void testByteArrayEquals() {
/** test doubleToString() */
public void testDoubleToString(){
- double test = 12.345;
- assertTrue(
+ double test = 12.121d;
+ assertEquals(
"cecking zero precision",
- ArchiveUtils.doubleToString(test, 0).equals("12"));
- assertTrue(
+ "12",
+ ArchiveUtils.doubleToString(test, 0));
+ assertEquals(
"cecking 2 character precision",
- ArchiveUtils.doubleToString(test, 2).equals("12.34"));
- assertTrue(
+ "12.12",
+ ArchiveUtils.doubleToString(test, 2));
+ assertEquals(
"cecking precision higher then the double has",
- ArchiveUtils.doubleToString(test, 65).equals("12.345"));
+ "12.121",
+ ArchiveUtils.doubleToString(test, 65));
}
diff --git a/src/test/java/org/archive/util/binsearch/SortedTextFileTest.java b/src/test/java/org/archive/util/binsearch/SortedTextFileTest.java
index 2c9d19e8..8f812b75 100644
--- a/src/test/java/org/archive/util/binsearch/SortedTextFileTest.java
+++ b/src/test/java/org/archive/util/binsearch/SortedTextFileTest.java
@@ -25,7 +25,7 @@ private void createFile(File target, int max) throws FileNotFoundException {
public void testGetRecordIteratorStringBoolean() throws IOException {
- File test = new File("/tmp/test.tmp");
+ File test = File.createTempFile("test", null);
int max = 1000000;
createFile(test,max);
RandomAccessFileSeekableLineReaderFactory factory =
diff --git a/src/test/java/org/archive/util/iterator/SortedCompositeIteratorTest.java b/src/test/java/org/archive/util/iterator/SortedCompositeIteratorTest.java
index f1c2a0ec..11ea1229 100644
--- a/src/test/java/org/archive/util/iterator/SortedCompositeIteratorTest.java
+++ b/src/test/java/org/archive/util/iterator/SortedCompositeIteratorTest.java
@@ -4,6 +4,7 @@
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
+import java.io.IOException;
import java.io.PrintWriter;
import java.util.Comparator;
@@ -11,21 +12,11 @@
public class SortedCompositeIteratorTest extends TestCase {
- public void testHasNext() throws FileNotFoundException {
+ public void testHasNext() throws FileNotFoundException, IOException {
- long t = 210000;
- long c = 134;
- float f = (float)c / (float)t;
- System.err.format("F(%f)\n",f);
+ File a = File.createTempFile("filea", null);
+ File b = File.createTempFile("fileb", null);
- File a = new File("/tmp/a");
- File b = new File("/tmp/b");
- if(a.isFile()) {
- a.delete();
- }
- if(b.isFile()) {
- b.delete();
- }
PrintWriter apw = new PrintWriter(a);
PrintWriter bpw = new PrintWriter(b);
apw.println("1");
@@ -38,6 +29,7 @@ public void testHasNext() throws FileNotFoundException {
BufferedReader bbr = new BufferedReader(new FileReader(b));
SortedCompositeIterator sci = new SortedCompositeIterator(new Comparator() {
+ @Override
public int compare(String o1, String o2) {
return o1.compareTo(o2);
}
diff --git a/src/test/java/org/archive/util/zip/GZIPMembersInputStreamTest.java b/src/test/java/org/archive/util/zip/GZIPMembersInputStreamTest.java
index d3dc1ff6..710ff069 100644
--- a/src/test/java/org/archive/util/zip/GZIPMembersInputStreamTest.java
+++ b/src/test/java/org/archive/util/zip/GZIPMembersInputStreamTest.java
@@ -30,7 +30,7 @@
import org.archive.util.ArchiveUtils;
import org.archive.util.zip.GZIPMembersInputStream;
-import com.google.common.io.NullOutputStream;
+import com.google.common.io.ByteStreams;
import com.google.common.primitives.Bytes;
/**
@@ -70,14 +70,14 @@ public static void main(String [] args) {
public void testFullReadAllFour() throws IOException {
GZIPMembersInputStream gzin =
new GZIPMembersInputStream(new ByteArrayInputStream(allfour_gz));
- int count = IOUtils.copy(gzin, new NullOutputStream());
+ int count = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong length uncompressed data", 1024+(32*1024)+1+5, count);
}
public void testFullReadSixSmall() throws IOException {
GZIPMembersInputStream gzin =
new GZIPMembersInputStream(new ByteArrayInputStream(sixsmall_gz));
- int count = IOUtils.copy(gzin, new NullOutputStream());
+ int count = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong length uncompressed data", 1+5+1+5+1+5, count);
}
@@ -85,31 +85,31 @@ public void testReadPerMemberAllFour() throws IOException {
GZIPMembersInputStream gzin =
new GZIPMembersInputStream(new ByteArrayInputStream(allfour_gz));
gzin.setEofEachMember(true);
- int count0 = IOUtils.copy(gzin, new NullOutputStream());
+ int count0 = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong 1k member count", 1024, count0);
assertEquals("wrong member number", 0, gzin.getMemberNumber());
assertEquals("wrong member0 start", 0, gzin.getCurrentMemberStart());
assertEquals("wrong member0 end", noise1k_gz.length, gzin.getCurrentMemberEnd());
gzin.nextMember();
- int count1 = IOUtils.copy(gzin, new NullOutputStream());
+ int count1 = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong 32k member count", (32*1024), count1);
assertEquals("wrong member number", 1, gzin.getMemberNumber());
assertEquals("wrong member1 start", noise1k_gz.length, gzin.getCurrentMemberStart());
assertEquals("wrong member1 end", noise1k_gz.length+noise32k_gz.length, gzin.getCurrentMemberEnd());
gzin.nextMember();
- int count2 = IOUtils.copy(gzin, new NullOutputStream());
+ int count2 = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong 1-byte member count", 1, count2);
assertEquals("wrong member number", 2, gzin.getMemberNumber());
assertEquals("wrong member2 start", noise1k_gz.length+noise32k_gz.length, gzin.getCurrentMemberStart());
assertEquals("wrong member2 end", noise1k_gz.length+noise32k_gz.length+a_gz.length, gzin.getCurrentMemberEnd());
gzin.nextMember();
- int count3 = IOUtils.copy(gzin, new NullOutputStream());
+ int count3 = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong 5-byte member count", 5, count3);
assertEquals("wrong member number", 3, gzin.getMemberNumber());
assertEquals("wrong member3 start", noise1k_gz.length+noise32k_gz.length+a_gz.length, gzin.getCurrentMemberStart());
assertEquals("wrong member3 end", noise1k_gz.length+noise32k_gz.length+a_gz.length+hello_gz.length, gzin.getCurrentMemberEnd());
gzin.nextMember();
- int countEnd = IOUtils.copy(gzin, new NullOutputStream());
+ int countEnd = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong eof count", 0, countEnd);
}
@@ -118,14 +118,14 @@ public void testReadPerMemberSixSmall() throws IOException {
new GZIPMembersInputStream(new ByteArrayInputStream(sixsmall_gz));
gzin.setEofEachMember(true);
for(int i = 0; i < 3; i++) {
- int count2 = IOUtils.copy(gzin, new NullOutputStream());
+ int count2 = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong 1-byte member count", 1, count2);
gzin.nextMember();
- int count3 = IOUtils.copy(gzin, new NullOutputStream());
+ int count3 = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong 5-byte member count", 5, count3);
gzin.nextMember();
}
- int countEnd = IOUtils.copy(gzin, new NullOutputStream());
+ int countEnd = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong eof count", 0, countEnd);
}
@@ -172,19 +172,19 @@ public void testMemberSeek() throws IOException {
new GZIPMembersInputStream(new ByteArrayInputStream(allfour_gz));
gzin.setEofEachMember(true);
gzin.compressedSeek(noise1k_gz.length+noise32k_gz.length);
- int count2 = IOUtils.copy(gzin, new NullOutputStream());
+ int count2 = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong 1-byte member count", 1, count2);
// assertEquals("wrong Member number", 2, gzin.getMemberNumber());
assertEquals("wrong Member2 start", noise1k_gz.length+noise32k_gz.length, gzin.getCurrentMemberStart());
assertEquals("wrong Member2 end", noise1k_gz.length+noise32k_gz.length+a_gz.length, gzin.getCurrentMemberEnd());
gzin.nextMember();
- int count3 = IOUtils.copy(gzin, new NullOutputStream());
+ int count3 = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong 5-byte member count", 5, count3);
// assertEquals("wrong Member number", 3, gzin.getMemberNumber());
assertEquals("wrong Member3 start", noise1k_gz.length+noise32k_gz.length+a_gz.length, gzin.getCurrentMemberStart());
assertEquals("wrong Member3 end", noise1k_gz.length+noise32k_gz.length+a_gz.length+hello_gz.length, gzin.getCurrentMemberEnd());
gzin.nextMember();
- int countEnd = IOUtils.copy(gzin, new NullOutputStream());
+ int countEnd = IOUtils.copy(gzin, ByteStreams.nullOutputStream());
assertEquals("wrong eof count", 0, countEnd);
}
@@ -195,7 +195,7 @@ public void testMemberIterator() throws IOException {
Iterator iter = gzin.memberIterator();
assertTrue(iter.hasNext());
GZIPMembersInputStream gzMember0 = iter.next();
- int count0 = IOUtils.copy(gzMember0, new NullOutputStream());
+ int count0 = IOUtils.copy(gzMember0, ByteStreams.nullOutputStream());
assertEquals("wrong 1k member count", 1024, count0);
assertEquals("wrong member number", 0, gzin.getMemberNumber());
assertEquals("wrong member0 start", 0, gzin.getCurrentMemberStart());
@@ -203,7 +203,7 @@ public void testMemberIterator() throws IOException {
assertTrue(iter.hasNext());
GZIPMembersInputStream gzMember1 = iter.next();
- int count1 = IOUtils.copy(gzMember1, new NullOutputStream());
+ int count1 = IOUtils.copy(gzMember1, ByteStreams.nullOutputStream());
assertEquals("wrong 32k member count", (32*1024), count1);
assertEquals("wrong member number", 1, gzin.getMemberNumber());
assertEquals("wrong member1 start", noise1k_gz.length, gzin.getCurrentMemberStart());
@@ -211,7 +211,7 @@ public void testMemberIterator() throws IOException {
assertTrue(iter.hasNext());
GZIPMembersInputStream gzMember2 = iter.next();
- int count2 = IOUtils.copy(gzMember2, new NullOutputStream());
+ int count2 = IOUtils.copy(gzMember2, ByteStreams.nullOutputStream());
assertEquals("wrong 1-byte member count", 1, count2);
assertEquals("wrong member number", 2, gzin.getMemberNumber());
assertEquals("wrong member2 start", noise1k_gz.length+noise32k_gz.length, gzin.getCurrentMemberStart());
@@ -219,7 +219,7 @@ public void testMemberIterator() throws IOException {
assertTrue(iter.hasNext());
GZIPMembersInputStream gzMember3 = iter.next();
- int count3 = IOUtils.copy(gzMember3, new NullOutputStream());
+ int count3 = IOUtils.copy(gzMember3, ByteStreams.nullOutputStream());
assertEquals("wrong 5-byte member count", 5, count3);
assertEquals("wrong member number", 3, gzin.getMemberNumber());
assertEquals("wrong member3 start", noise1k_gz.length+noise32k_gz.length+a_gz.length, gzin.getCurrentMemberStart());
diff --git a/src/test/resources/org/archive/format/arc/IAH-20080430204825-00000-blackbook-truncated.arc b/src/test/resources/org/archive/format/arc/IAH-20080430204825-00000-blackbook-truncated.arc
new file mode 100644
index 00000000..3cbffb81
--- /dev/null
+++ b/src/test/resources/org/archive/format/arc/IAH-20080430204825-00000-blackbook-truncated.arc
@@ -0,0 +1,1006 @@
+filedesc://IAH-20080430204825-00000-blackbook-truncated.arc 0.0.0.0 20080430204825 text/plain 1300
+1 1 InternetArchive
+URL IP-address Archive-date Content-type Archive-length
+
+
+Heritrix @VERSION@ http://crawler.archive.org
+blackbook
+192.168.1.13
+archive.org-shallow
+archive.org shallow
+Admin
+2008-04-30T20:48:24+00:00
+Mozilla/5.0 (compatible; heritrix/1.14.0 +http://crawler.archive.org)
+archive-crawler-agent@lists.sourceforge.net
+classic
+ARC file version 1.1
+http://www.archive.org/web/researcher/ArcFileFormat.php
+
+dns:www.archive.org 68.87.76.178 20080430204825 text/dns 56
+20080430204825
+www.archive.org. 589 IN A 207.241.229.39
+http://www.archive.org/robots.txt 207.241.229.39 20080430204825 text/plain 782
+HTTP/1.1 200 OK
+Date: Wed, 30 Apr 2008 20:48:24 GMT
+Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g
+Last-Modified: Sat, 02 Feb 2008 19:40:44 GMT
+ETag: "47c3-1d3-11134700"
+Accept-Ranges: bytes
+Content-Length: 467
+Connection: close
+Content-Type: text/plain; charset=UTF-8
+
+##############################################
+#
+# Welcome to the Archive!
+#
+##############################################
+# Please crawl our files.
+# We appreciate if you can crawl responsibly.
+# Stay open!
+##############################################
+User-agent: *
+Disallow: /nothing---please-crawl-us--
+
+# slow down the ask jeeves crawler which was hitting our SE a little too fast
+# via collection pages. --Feb2008 tracey--
+User-agent: Teoma
+Crawl-Delay: 10
+http://www.archive.org/ 207.241.229.39 20080430204826 text/html 680
+HTTP/1.1 200 OK
+Date: Wed, 30 Apr 2008 20:48:25 GMT
+Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g
+Last-Modified: Wed, 09 Jan 2008 23:18:29 GMT
+ETag: "47ac-16e-4f9e5b40"
+Accept-Ranges: bytes
+Content-Length: 366
+Connection: close
+Content-Type: text/html; charset=UTF-8
+
+
+
+
+
+
+
+
+Please visit our website at:
+http://www.archive.org
+
+
+http://www.archive.org/index.php 207.241.229.39 20080430204826 text/html 29000
+HTTP/1.1 200 OK
+Date: Wed, 30 Apr 2008 20:48:25 GMT
+Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g
+X-Powered-By: PHP/5.0.5-2ubuntu1.4
+Set-Cookie: PHPSESSID=657fa9749e9426f2ffa75f14b54ed4ac; path=/; domain=.archive.org
+Connection: close
+Content-Type: text/html; charset=UTF-8
+
+
+
+
+
+
+ Internet Archive
+
+
+
+
+
+The Internet Archive is building a digital library of Internet
+ sites and other cultural artifacts in digital form. Like a paper
+ library, we provide free access to researchers, historians,
+ scholars, and the general public.