Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
ca8ae7f
NUTCH-3126 Report JUnit test results in GitHub pull request thread (#…
lewismc Dec 11, 2025
1d8106c
NUTCH-3132 Standardize existing Nutch metrics naming and implementati…
lewismc Dec 11, 2025
595cf6c
NUTCH-3132 Standardize existing Nutch metrics naming and implementation
sebastian-nagel Feb 25, 2026
d9571d3
NUTCH-3134 Add latency metrics with percentile support to Fetcher, Pa…
lewismc Dec 18, 2025
d989f76
NUTCH-3133 Upgrade GitHub workflows to JDK 17
sebastian-nagel Dec 11, 2025
e864568
NUTCH-3135 Cache downloaded ant-eclipse.jar
sebastian-nagel Dec 12, 2025
1c835c1
NUTCH-3136 Upgrade crawler-commons dependency
sebastian-nagel Dec 12, 2025
bdbc897
NUTCH-3136 Upgrade crawler-commons dependency
sebastian-nagel Dec 12, 2025
488eacb
NUTCH-3139 protocol-okhttp: add support for zstd content-encoding
sebastian-nagel Dec 15, 2025
2df25d1
NUTCH-3141 Cache Hadoop Counter References in Hot Paths (#878)
lewismc Jan 8, 2026
1a22db3
NUTCH-3143 GitHub workflow does not run all unit tests (#884)
lewismc Jan 12, 2026
e632e55
NUTCH-3143 GitHub workflow does not run all unit tests (#885)
lewismc Jan 12, 2026
e3d0af3
NUTCH-3144 URLUtil unit tests fail after upgrade to crawler-commons 1.6
sebastian-nagel Jan 7, 2026
226ac7e
NUTCH-1564: fix immediate refetch for pages not modified
Jan 3, 2026
89e6b87
NUTCH-1564: fix AdaptiveFetchSchedule for unmodified pages
Jan 4, 2026
366a601
NUTCH-1564: address code review comments.
Jan 8, 2026
fb8538b
NUTCH-3148 Cache Ivy dependencies in GitHub CI builds (#886)
lewismc Jan 12, 2026
cc74d71
NUTCH-3148 Cache Ivy dependencies in GitHub CI builds
sebastian-nagel Feb 25, 2026
e742fc5
NUTCH-3143 GitHub workflow does not run all unit tests (#889)
lewismc Jan 21, 2026
b8d1fc9
NUTCH-3143 GitHub workflow does not run all unit tests (#890)
lewismc Jan 21, 2026
1db8e7d
NUTCH-3142 Add Error Context to Metrics (#882)
lewismc Feb 5, 2026
2e2374d
NUTCH-3150 Expand Caching Hadoop Counter References (#892)
lewismc Feb 10, 2026
fef49b9
NUTCH-3152 Job counters getGroup to use metrics constants
sebastian-nagel Feb 8, 2026
023010a
NUTCH-3153 Update of license and notice files
sebastian-nagel Feb 11, 2026
0eda915
NUTCH-3132 Standardize existing Nutch metrics naming and implementation
sebastian-nagel Feb 26, 2026
044dfd2
NUTCH-3132 Standardize existing Nutch metrics naming and implementation
sebastian-nagel Feb 26, 2026
bf01b43
WARC writer: log capture date as ISO date
sebastian-nagel Feb 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions .github/workflows/cc-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ jobs:
os: [ubuntu-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
- name: Set up JDK ${{ matrix.java }}
uses: actions/setup-java@v4
uses: actions/setup-java@v5
with:
java-version: ${{ matrix.java }}
distribution: 'temurin'
Expand All @@ -53,5 +53,12 @@ jobs:
- name: Install recent public suffix list
run: |
curl https://publicsuffix.org/list/public_suffix_list.dat -o conf/effective_tld_names.dat
- name: Cache Ivy dependencies
uses: actions/cache@v4
with:
path: ~/.ivy2/cache
key: ${{ runner.os }}-ivy-${{ hashFiles('ivy/ivy.xml', 'src/plugin/**/ivy.xml') }}
restore-keys: |
${{ runner.os }}-ivy-
- name: Test
run: ant clean test -buildfile build.xml
37 changes: 24 additions & 13 deletions .github/workflows/junit-report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,30 +25,41 @@ jobs:
checks:
runs-on: ubuntu-latest
steps:
- name: Download Test Report
- name: Download Test Report (Ubuntu)
uses: dawidd6/action-download-artifact@v11
with:
name: junit-test-results
name: junit-test-results-ubuntu-latest
workflow: master-build.yml
run_id: ${{ github.event.workflow_run.id }}
continue-on-error: true
- name: Publish Test Report
uses: mikepenz/action-junit-report@v5
uses: mikepenz/action-junit-report@v6
with:
report_paths: |-
./test/TEST-*.xml
./**/test/TEST-*.xml
check_name: |-
JUnit Test Report
JUnit Test Report Plugins
commit: ${{ github.event.workflow_run.head_sha }}
comment: true
pr_id: ${{ github.event.workflow_run.pull_requests[0].number }}
fail_on_failure: true
fail_on_failure: false
fail_on_parse_error: true
require_tests: true
require_passed_tests: true
include_passed: false
include_skipped: true
check_annotations: true
annotate_notice: true
job_summary: true
detailed_summary: true
truncate_stack_traces: false
fail_on_parse_error: false # temporary while debugging TestMimeUtil
require_tests: true
flaky_summary: true
skip_success_summary: true
include_time_in_summary: true
include_passed: true
group_suite: true
comment: true
updateComment: true
skip_comment_without_tests: true
job_name: tests
check_name: |-
JUnit Test Report Core
JUnit Test Report Plugins
truncate_stack_traces: false
annotations_limit: 50
pr_id: ${{ github.event.workflow_run.pull_requests[0].number || '' }}
47 changes: 39 additions & 8 deletions .github/workflows/master-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ jobs:
javadoc:
strategy:
matrix:
java: ['11']
java: ['17']
os: [ubuntu-latest]
runs-on: ${{ matrix.os }}
steps:
Expand All @@ -34,12 +34,19 @@ jobs:
with:
java-version: ${{ matrix.java }}
distribution: 'temurin'
- name: Cache Ivy dependencies
uses: actions/cache@v4
with:
path: ~/.ivy2/cache
key: ${{ runner.os }}-ivy-${{ hashFiles('ivy/ivy.xml', 'src/plugin/**/ivy.xml') }}
restore-keys: |
${{ runner.os }}-ivy-
- name: Javadoc
run: ant clean javadoc -buildfile build.xml
rat:
strategy:
matrix:
java: ['11']
java: ['17']
os: [ubuntu-latest]
runs-on: ${{ matrix.os }}
steps:
Expand All @@ -49,6 +56,13 @@ jobs:
with:
java-version: ${{ matrix.java }}
distribution: 'temurin'
- name: Cache Ivy dependencies
uses: actions/cache@v4
with:
path: ~/.ivy2/cache
key: ${{ runner.os }}-ivy-${{ hashFiles('ivy/ivy.xml', 'src/plugin/**/ivy.xml') }}
restore-keys: |
${{ runner.os }}-ivy-
- name: Run Apache Rat
run: ant clean run-rat -buildfile build.xml
- name: Cache unknown licenses
Expand All @@ -62,17 +76,24 @@ jobs:
tests:
strategy:
matrix:
java: ['11']
java: ['17']
os: [ubuntu-latest, macos-latest]
runs-on: ${{ matrix.os }}
timeout-minutes: 30
timeout-minutes: 45
steps:
- uses: actions/checkout@v5
- name: Set up JDK ${{ matrix.java }}
uses: actions/setup-java@v5
with:
java-version: ${{ matrix.java }}
distribution: 'temurin'
- name: Cache Ivy dependencies
uses: actions/cache@v4
with:
path: ~/.ivy2/cache
key: ${{ runner.os }}-ivy-${{ hashFiles('ivy/ivy.xml', 'src/plugin/**/ivy.xml') }}
restore-keys: |
${{ runner.os }}-ivy-
- uses: dorny/paths-filter@de90cc6fb38fc0963ad72b210f1f284cd68cea36
id: filter
with:
Expand All @@ -99,13 +120,23 @@ jobs:
- name: test plugins
if: ${{ steps.filter.outputs.plugins == 'true' && steps.filter.outputs.core == 'false' && steps.filter.outputs.buildconf == 'false' }}
run: ant clean test-plugins -buildfile build.xml
- name: Check for test results
id: check_tests
if: always() && matrix.os == 'ubuntu-latest'
run: |
shopt -s globstar nullglob
files=(./build/test/TEST-*.xml ./build/**/test/TEST-*.xml)
if [ ${#files[@]} -gt 0 ]; then
echo "has_results=true" >> $GITHUB_OUTPUT
else
echo "has_results=false" >> $GITHUB_OUTPUT
fi
- name: Upload Test Report
uses: actions/upload-artifact@v4
if: always()
if: always() && matrix.os == 'ubuntu-latest' && steps.check_tests.outputs.has_results == 'true'
with:
name: junit-test-results
name: junit-test-results-${{ matrix.os }}
path: |
./build/test/TEST-*.xml
./build/**/test/TEST-*.xml
retention-days: 1
overwrite: true
retention-days: 1
38 changes: 18 additions & 20 deletions LICENSE-binary
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,6 @@ com.google.inject.extensions:guice-servlet
com.google.j2objc:j2objc-annotations
com.healthmarketscience.jackcess:jackcess
com.healthmarketscience.jackcess:jackcess-encrypt
com.intellij:annotations
com.maxmind.db:maxmind-db
com.maxmind.geoip2:geoip2
com.nimbusds:nimbus-jose-jwt
Expand All @@ -257,7 +256,12 @@ com.rometools:rome-utils
com.shapesecurity:salvation2
com.squareup.okhttp3:okhttp
com.squareup.okhttp3:okhttp-brotli
com.squareup.okhttp3:okhttp-jvm
com.squareup.okhttp3:okhttp-zstd
com.squareup.okio:okio
com.squareup.okio:okio-jvm
com.squareup.zstd:zstd-kmp-jvm
com.squareup.zstd:zstd-kmp-okio-jvm
com.tdunning:t-digest
com.typesafe.netty:netty-reactive-streams
com.typesafe.scala-logging:scala-logging_2.12
Expand All @@ -275,13 +279,14 @@ commons-lang:commons-lang
commons-logging:commons-logging
commons-net:commons-net
commons-validator:commons-validator
de.l3s.boilerpipe:boilerpipe
de.vandermeer:ascii-utf-themes
de.vandermeer:asciitable
de.vandermeer:char-translation
de.vandermeer:skb-interfaces
dev.failsafe:failsafe
info.picocli:picocli
io.dropwizard.metrics:metrics-core
io.netty:netty
io.netty:netty-all
io.netty:netty-buffer
io.netty:netty-codec
Expand Down Expand Up @@ -378,7 +383,7 @@ org.apache.hadoop:hadoop-yarn-api
org.apache.hadoop:hadoop-yarn-client
org.apache.hadoop:hadoop-yarn-common
org.apache.hadoop.thirdparty:hadoop-shaded-guava
org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7
org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_25
org.apache.httpcomponents:httpasyncclient
org.apache.httpcomponents:httpclient
org.apache.httpcomponents:httpcore
Expand All @@ -398,21 +403,13 @@ org.apache.kafka:kafka-storage
org.apache.kafka:kafka-storage-api
org.apache.kafka:kafka-tools-api
org.apache.kafka:kafka_2.12
org.apache.kerby:kerb-admin
org.apache.kerby:kerb-client
org.apache.kerby:kerb-common
org.apache.kerby:kerb-core
org.apache.kerby:kerb-crypto
org.apache.kerby:kerb-identity
org.apache.kerby:kerb-server
org.apache.kerby:kerb-simplekdc
org.apache.kerby:kerb-util
org.apache.kerby:kerby-asn1
org.apache.kerby:kerby-config
org.apache.kerby:kerby-pkix
org.apache.kerby:kerby-util
org.apache.kerby:kerby-xdr
org.apache.kerby:token-provider
org.apache.logging.log4j:log4j-api
org.apache.logging.log4j:log4j-core
org.apache.logging.log4j:log4j-slf4j2-impl
Expand All @@ -435,6 +432,7 @@ org.apache.pdfbox:fontbox
org.apache.pdfbox:jbig2-imageio
org.apache.pdfbox:jempbox
org.apache.pdfbox:pdfbox
org.apache.pdfbox:pdfbox-io
org.apache.pdfbox:pdfbox-tools
org.apache.pdfbox:xmpbox
org.apache.poi:poi
Expand All @@ -443,6 +441,7 @@ org.apache.poi:poi-ooxml-lite
org.apache.poi:poi-scratchpad
org.apache.solr:solr-solrj
org.apache.tika:tika-core
org.apache.tika:tika-handler-boilerpipe
org.apache.tika:tika-langdetect-optimaize
org.apache.tika:tika-parser-apple-module
org.apache.tika:tika-parser-audiovideo-module
Expand Down Expand Up @@ -476,8 +475,6 @@ org.asynchttpclient:async-http-client
org.asynchttpclient:async-http-client-netty-utils
org.bitbucket.b_c:jose4j
org.ccil.cowan.tagsoup:tagsoup
org.codehaus.jackson:jackson-core-asl
org.codehaus.jackson:jackson-mapper-asl
org.codehaus.jettison:jettison
org.eclipse.jetty:jetty-alpn-client
org.eclipse.jetty:jetty-alpn-java-client
Expand Down Expand Up @@ -515,9 +512,6 @@ org.gagravarr:vorbis-java-core
org.gagravarr:vorbis-java-tika
org.jetbrains:annotations
org.jetbrains.kotlin:kotlin-stdlib
org.jetbrains.kotlin:kotlin-stdlib-common
org.jetbrains.kotlin:kotlin-stdlib-jdk7
org.jetbrains.kotlin:kotlin-stdlib-jdk8
org.jspecify:jspecify
org.littleshoot:littleproxy
org.locationtech.spatial4j:spatial4j
Expand Down Expand Up @@ -595,9 +589,7 @@ BSD 2-Clause

com.barchart.udt:barchart-udt-bundle
com.github.luben:zstd-jni
com.google.protobuf:protobuf-java
dk.brics:automaton
dnsjava:dnsjava
org.codehaus.woodstox:stax2-api
org.jline:jline

Expand All @@ -609,6 +601,7 @@ BSD 3-Clause

com.adobe.xmp:xmpcore
com.github.virtuald:curvesapi
dnsjava:dnsjava
org.fusesource.leveldbjni:leveldbjni-all
org.ow2.asm:asm

Expand All @@ -633,7 +626,7 @@ Bouncy Castle Licence

(licenses-binary/LICENSE-bouncy-castle-licence.txt)

org.bouncycastle:bcmail-jdk18on
org.bouncycastle:bcjmail-jdk18on
org.bouncycastle:bcpkix-jdk18on
org.bouncycastle:bcprov-jdk18on
org.bouncycastle:bcutil-jdk18on
Expand Down Expand Up @@ -717,13 +710,17 @@ jakarta.jws:jakarta.jws-api
jakarta.xml.bind:jakarta.xml.bind-api
jakarta.xml.soap:jakarta.xml.soap-api
jakarta.xml.ws:jakarta.xml.ws-api
org.eclipse.angus:angus-activation
org.glassfish.jaxb:jaxb-core
org.glassfish.jaxb:jaxb-runtime
org.glassfish.jaxb:txw2


Eclipse Public License - Version 2.0
------------------------------------

(licenses-binary/LICENSE-eclipse-public-license---version-2.0.txt)

org.eclipse.jetty:jetty-http
org.eclipse.jetty:jetty-io
org.eclipse.jetty:jetty-security
Expand All @@ -734,6 +731,8 @@ org.eclipse.jetty:jetty-util
MIT
---

(licenses-binary/LICENSE-mit-license.txt)

net.sourceforge.argparse4j:argparse4j
org.slf4j:slf4j-api

Expand Down Expand Up @@ -781,7 +780,6 @@ Public Domain
(licenses-binary/LICENSE-public-domain.txt)

aopalliance:aopalliance
org.tukaani:xz


Public Domain, per Creative Commons CC0
Expand Down
Loading