Skip to content

Commons IO 2.21.0 and more#722

Merged
garydgregory merged 41 commits intomasterfrom
commons_io_2_21_0
Nov 7, 2025
Merged

Commons IO 2.21.0 and more#722
garydgregory merged 41 commits intomasterfrom
commons_io_2_21_0

Conversation

@garydgregory
Copy link
Member

@garydgregory garydgregory commented Oct 6, 2025

Uses Apache Commons IO 2.21.0, initially as a SNAPSHOT, and uses the new seekable APIs, and more.

This is a work in progress.

Before you push a pull request, review this list:

  • [x Read the contribution guidelines for this project.
  • Read the ASF Generative Tooling Guidance if you use Artificial Intelligence (AI).
  • I used AI to create any part of, or all of, this pull request.
  • Run a successful build using the default Maven goal with mvn; that's mvn on the command line by itself.
  • Write unit tests that match behavioral changes, where the tests fail if the changes to the runtime are not applied. This may not always be possible, but it is a best-practice.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Each commit in the pull request should have a meaningful subject line and body. Note that a maintainer may squash commits during the merge process.

garydgregory and others added 30 commits October 1, 2025 17:51
This change replaces custom argument checks in all `InputStream` and `OutputStream` implementations with the common utility method `IOUtils.checkIndexFromSize`.
This ensures consistent validation logic across the codebase.

Some implementations are intentionally left unchanged: cases where the arguments are immediately delegated to methods such as `InputStream#read`, `OutputStream#write`, or `ByteBuffer#wrap`, which already perform equivalent checks internally.
* remove: make `TarUtils` final and clean up internal methods

`TarUtils` already had a private constructor and was only used within the `o.a.c.compress.archivers.tar` package. This PR makes the class explicitly final and simplifies its internal API:

* Mark `TarUtils` as `final`.
* Change `protected` methods to package-private (they were never usable outside the package due to the private constructor).
* Remove deprecated non-`public` methods.
* Simplify `parsePaxHeaders`:

  * Require a non-negative `headerSize`. The special value `-1` (“unlimited”) was only used in tests.
  * Update tests to supply a valid `headerSize`.
  * Standardize error handling for invalid PAX 0.0 sparse records: all parsing errors now throw `ArchiveException` (previously one case threw `IOException`).

A changelog entry is included even though these are internal changes, to give users a reference point in case any unintended side effects arise.

* fix: make `TarUtils` final

This change actually makes `TarUtils` final and introduces a JapiCmp configuration that ignores all `CLASS_NOW_FINAL` changes.

* fix: don't call tests with `Integer.MAX_VALUE`

* fix: bump `japicmp` to version `0.24.0`

This bumps `japicmp` to the new version `0.24.0`, which ignores changes to protected method in non-extensible classes such as `TarUtils`.

* Update japicmp version from 0.24.0 to 0.24.1

* fix: clean up POM

* fix: enforcer rule to clean up temporary override

* Test `project.parent.version` resolution on CI

* Try ignoring white space

* Try newest Enforcer version

* Remove enforcer rule

---------

Co-authored-by: Gary Gregory <garydgregory@users.noreply.github.com>
* Introduce `AbstractArchiveBuilder` for unified archiver support

This PR introduces a new `AbstractArchiveBuilder`, which serves as a shared foundation for all archive input streams, including `SevenZFile`, `TarFile`, and `ZipFile`.

This refactoring also leverages the new `ChannelOrigin` from Commons IO 2.21.0, streamlining the builder implementations for `SevenZFile`, `TarFile`, and `ZipFile`.

### Additional improvements

* **Constructor unification:** All constructors now delegate to a single constructor that accepts a builder.
* **Resource safety:** If an error occurs during construction (or within the builder’s `get()` method), the underlying resource is now always closed. Previously, resources provided via `SeekableByteChannel` were not closed on error, potentially leaving channels in an incoherent state.
* **Deprecation cleanup:** All references to deprecated constructors and methods have been removed. Compatibility is verified through `LegacyConstructorsTest`, which ensures the builder pattern behaves equivalently to the removed constructors.

* fix: Checkstyle violation

* fix: remove unused members

* fix: Windows failure
All JDK 26-ea tests currently fail because PMD does not yet support this version (see [job 52084346962](https://github.com/apache/commons-compress/actions/runs/18292910119/job/52084346962) for an example).

While it’s valuable to stay ahead of upcoming JDK releases, having test jobs that are guaranteed to fail for reasons outside of Commons Compress or the JDK itself adds little benefit.

This PR disables JDK 26-ea tests until pmd/pmd#5871 is resolved, which will likely require upgrading to ASM 9.9 (released only a few days ago). Tests can be re-enabled once PMD adds support for Java 26.
No longer need to override JApiCmp
This PR fixes an issue in `CpioArchiveInputStream` where the checksum was computed using the wrong slice of the buffer. Previously, it always used bytes `0` through `len`, ignoring the specified `off` parameter. As a result, checksum verification could fail when `read()` was called with a non-zero offset.

### Changes

* Corrected checksum calculation to use the actual range `off` through `off + len`.
* Added a regression test to ensure checksum verification works correctly with non-zero offsets.
* feat: introduce `ArchiveFile` abstraction for file-based archives

This PR adds a new `ArchiveFile` interface to unify the handling of file-based archive utilities such as `SevenZFile`, `TarFile`, and `ZipFile`.

Although these classes target different archive formats, they share several core characteristics:

* All are `Closeable`.
* Each provides the same method to open an `InputStream` for a given entry (`InputStream getInputStream(T)` where `T extends ArchiveEntry`).
* Historically, their `getEntries()` methods returned incompatible types. This PR introduces a common `List<? extends T> entries()` method, aligning with `java.util.zip.ZipFile` in name but offering a modern return value.
* The `ZipFile#stream()` method (added in 1.28.0) is now part of this abstraction.

This change establishes a consistent, format-agnostic API for working with archive files, reducing duplication and improving discoverability for users.

* fix: add `IOIterator<T>` interface

* fix: failing tests

* fix: comment on usage of `getNextTarEntry`

* empty commit to trigger CI

* empty commit to trigger CI (2)

* fix: use `asIterable()` to provide `unwrap()`
Use longer lines
* feat: Add configurable maximum entry name length

This change introduces a configurable limit on archive entry names.

Although formats like **AR**, **CPIO**, and **TAR** permit arbitrarily long path names, real operating systems and file systems impose much stricter limits:

* Individual path segments (file names) are typically limited to 255 bytes.
* Full paths are usually capped at a few KiB (e.g. 1024 bytes on macOS via `MAX_PATH`).

#### What’s new

* Added a new common builder, `AbstractArchiveBuilder`, inserted in the hierarchy between archive stream builders and `AbstractStreamBuilder`.
* Introduced a new configuration option: `setMaxEntryLength`.
* Default value is `Short.MAX_VALUE`, which is higher than any realistic OS limit.
* Enforced the limit across:

  * All `ArchiveInputStream` implementations
  * `SevenZFile`, `TarFile`, and `ZipFile`
* Added a dedicated test suite to verify:

  * Entry names up to `Short.MAX_VALUE` are handled correctly.
  * Entries exceeding a lowered limit result in an exception.

* fix: reformat

* fix: AbstractArchiveBuilder javadoc

* fix: extract PAX_NAME_KEY and PAX_LINK_NAME_KEY

* fix: merge errors

* Empty commit to trigger CI

* fix: replace `readRange` + size check with `IOUtils.toByteArray`**

Simplifies PAX header parsing by using `IOUtils.toByteArray` from Commons IO instead of `IOUtils.readRange` followed by a manual size check.

The previous size check was effectively unreachable: oversized PAX key/value pairs are validated earlier, and truncated reads are already handled by the input stream. This change improves code clarity and test coverage without altering behavior.

* simplify file name parsing in `SevenZFile`

This change streamlines file name parsing by:

* Reading directly from the existing `ByteBuffer` rather than copying data into a temporary byte array.
* Replacing the temporary byte array with a `StringBuilder` to accumulate the decoded UTF-16 characters.

This reduces intermediate allocations and simplifies the code path without changing behavior.

* fix: PMD failure
This PR removes or hides several members that leaked into the public surface, were never meant for external use and were **never released**. There is **no functional behavior change** to the library; this only corrects API visibility and duplication.

## What changed

* **`SevenZFile.SOFT_MAX_ARRAY_LENGTH`**

  * **Change:** Removed.
  * **Rationale:** Duplicates the constant already available in Commons IO’s `IOUtils`.

* **`SevenZFile.toNonNegativeInt(...)`**

  * **Change:** Visibility reduced (internal helper).
  * **Rationale:** Not part of the supported API; only used internally.

* **`SeekableInMemoryByteChannel.getSize()`**

  * **Change:** Removed public alias.
  * **Rationale:** Only used in tests; behavior diverges from `size()` after channel closure and shouldn’t be exposed.

* **`ElementValue.BYTES`**

  * **Change:** Migrated to caller class.
  * **Rationale:** Had a single call site in another package; not a public contract.
- Use final
- Fix errant curly
- Sort members
- Reduce vertical space
> [!CAUTION]
> **Source-incompatible** (callers may need to add `throws IOException` or a catch).
> **Binary-compatible** (the `throws` clause isn’t part of the JVM descriptor).

Several `ArchiveInputStream` implementations either

- must read/validate bytes up front (e.g., magic headers), or
- may fail immediately when the underlying stream is unreadable.

Today we’re inconsistent:

* Formats **without a global signature** (e.g., **CPIO**, **TAR**) historically didn’t read in the constructor, so no `IOException` was declared.
* Other formats that **do need early bytes** either wrapped `IOException` in `ArchiveException` (**ARJ**, **DUMP**) or deferred the read to the first `getNextEntry()` (**AR**, **ZIP**).

This makes error handling uneven for users and complicates eager validation.

* All archive `InputStream` constructors now declare `throws IOException`.

* **ARJ** and **DUMP**: stop wrapping `IOException` in `ArchiveException` during construction; propagate the original `IOException`.

* **AR**: move reading of the global signature into the constructor (eager validation).

No behavioral change is intended beyond surfacing `IOException` at construction time, where appropriate.

For the ARJ format this was discussed in #723 (comment).

> [!NOTE]
> Version `1.29.0` already introduces source-incompatible changes in other methods, by adding checked exceptions.
7z: enforce reference limits on `Folder` parsing

This change aligns `Folder` parsing with the limits defined in the official 7-Zip implementation
([`7zIn.cpp`](https://github.com/ip7z/7zip/blob/main/CPP/7zip/Archive/7z/7zIn.cpp)):

* Maximum coders per folder: **64**
* Maximum input streams per coder: **64**
* Maximum output streams per coder: **1**
* Maximum total input streams per folder: **64**

These bounds are consistent with the reference behavior and are safe because:

* Other 7z implementations use the same or stricter limits.
* No supported coder uses multiple inputs or outputs.
* Custom coder definitions are not supported in this implementation.

By enforcing these limits, the parser becomes simpler and more predictable,
and redundant dynamic size checks can be removed.
* Improve sparse file handling performance

Previously, sparse files were processed recursively. On highly fragmented files, this led to deep recursion and significant inefficiency.

This change replaces the recursive approach with an iterative strategy, which scales better for files with many fragments. It also introduces generated tests that simulate sparse files with very high fragmentation to ensure correctness and performance under stress.

* fix: remove unused method

* fix: simplify input streams

* fix: error message

* Fix failing tests

* Sort members
* ARJ: correct byte accounting and truncation errors

* `getBytesRead()` could drift from the actual archive size after a full read.
* Exceptions on truncation errors were inconsistent or missing.
* `DataInputStream` (big-endian) forced ad-hoc helpers for ARJ’s little-endian fields.

* **Accurate byte accounting:** count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data. `getBytesRead()` now matches the archive length at end-of-stream.
* **Consistent truncation handling:**

  * Truncation in the **main (archive) header**, read during construction, now throws an `ArchiveException` **wrapping** an `EOFException` (cause preserved).
  * Truncation in **file headers or file data** is propagated as a plain `EOFException` from `getNextEntry()`/`read()`.
* **Endianness refactor:** replace `DataInputStream` with `EndianUtils`, removing several bespoke helpers and making intent explicit.

* Add assertion that `getBytesRead()` equals the archive size after full consumption.
* Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.

* fix: failing legacy test

* fix: checkstyle error

* fix: remove `EndianUtils` static import

The static import makes it harder to distinguish calls that need to count bytes from those that do not.

* Fix failing test

* Sort methods

* Remove unused method
* ARJ: correct byte accounting and truncation errors

* `getBytesRead()` could drift from the actual archive size after a full read.
* Exceptions on truncation errors were inconsistent or missing.
* `DataInputStream` (big-endian) forced ad-hoc helpers for ARJ’s little-endian fields.

* **Accurate byte accounting:** count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data. `getBytesRead()` now matches the archive length at end-of-stream.
* **Consistent truncation handling:**

  * Truncation in the **main (archive) header**, read during construction, now throws an `ArchiveException` **wrapping** an `EOFException` (cause preserved).
  * Truncation in **file headers or file data** is propagated as a plain `EOFException` from `getNextEntry()`/`read()`.
* **Endianness refactor:** replace `DataInputStream` with `EndianUtils`, removing several bespoke helpers and making intent explicit.

* Add assertion that `getBytesRead()` equals the archive size after full consumption.
* Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.

* fix: failing legacy test

* fix: checkstyle error

* fix: remove `EndianUtils` static import

The static import makes it harder to distinguish calls that need to count bytes from those that do not.

* ARJ: strict header validation and `selfExtracting` option

Today `ArjArchiveInputStream` keeps scanning past invalid headers, assuming self-extracting stubs. That can hide corruption.

This PR:

* Introduces a `selfExtracting` ARJ archive option (default **false**).

  * **false:** no scanning; parse strictly from the first byte. Any invalid/truncated header fails fast.
  * **true:** scan only to locate the Main Archive Header (AMH), then switch to **strict mode**. All subsequent headers must be contiguous and valid.

**Behavioral change**

Previously, we might “skip over” bad data. Now we **only** allow a discovery scan for AMH (when opted in); everything after must validate or fail.

* fix: simplify rethrowing

* Fix failing test

* Sort methods

* Remove unused method

* Sort members

* Fix remove unused method

* Extract `MAX_BASIC_HEADER_SIZE` constant

* fix: exception message
- Normalize error messages
- Use final
- Reduce vertical whitespace
ppkarwasz and others added 11 commits October 18, 2025 07:37
* 7z: unsigned number parsing and improved header validation

The 7z file format specification defines only **unsigned numbers** (`UINT64`, `REAL_UINT64`, `UINT32`). However, the current implementation allows parsing methods like `readUint64`, `getLong`, and `getInt` to return negative values and then handles those inconsistently in downstream logic.

This PR introduces a safer and more specification-compliant number parsing model.

### Key changes

* **Strict unsigned number parsing**

  * Parsing methods now *never* return negative numbers.
  * `readUint64`, `readUint64ToIntExact`, `readRealUint64`, and `readUint32` follow the terminology from `7zFormat.txt`.
  * Eliminates scattered negative-value checks that previously compensated for parsing issues.

* **Improved header integrity validation**

  * Before large allocations, the size is now validated against the **actual available data in the header** as well as the memory limit.
  * Prevents unnecessary or unsafe allocations when the archive is corrupted or truncated.

* **Correct numeric type usage**

  * Some fields represent 7z numbers as 64-bit values but are constrained internally to Java `int` limits.
  * These are now declared as `int` to signal real constraints in our implementation.

* **Consistent error handling**
  Parsing now throws only three well-defined exception types:

  | Condition                                                              | Exception                                    |
  | ---------------------------------------------------------------------- | -------------------------------------------- |
  | Declared structure exceeds `maxMemoryLimitKiB`                         | `MemoryLimitException`                       |
  | Missing data inside header (truncated or corrupt)                      | `ArchiveException("Corrupted 7z archive")`   |
  | Unsupported numeric values (too large for implementation) | `ArchiveException("Unsupported 7z archive")` |

  Note: `EOFException` is no longer used: a header with missing fields is not “EOF,” it is **corrupted**.

This PR lays groundwork for safer parsing and easier future maintenance by aligning number handling with the actual 7z specification and making header parsing behavior *predictable and robust*.

* fix: explain 3 bytes for Folder

* fix: comment memory check again

* fix: remove unused import

---------

Co-authored-by: Gary Gregory <garydgregory@users.noreply.github.com>
This change simplifies `PackingUtils` by leveraging `IOUtils` from Commons IO for common I/O operations, reducing boilerplate and improving readability.
* 7z: optimize header loading

This change improves the efficiency of 7z header parsing:

* Reads the **Signature Header** in a single ByteBuffer instead of multiple small reads, reducing overhead.
* Uses a `MappedByteBuffer` to load the **Next Header** when the archive is backed by a `FileChannel`, improving performance for large headers by avoiding unnecessary copies.

No new tests are added, as the existing test suite already exercises the affected header loading paths sufficiently.

* fix: rename `computeChecksum` to `crc32`

* fix: improve error messages
* Deprecate `IOUtils.readFully` and `IOUtils.skip`

This change deprecates `readFully` and `skip` in `o.a.c.compress.utils.IOUtils` in favor of their Commons IO counterparts. Beyond code reuse, this offers two benefits:

* `readFully` had several overloads with inconsistent semantics. All read as many bytes as possible (like `read`), but only one threw on EOF (as `readFully` should).
* `skip` previously used a per-call byte buffer, unlike the Commons IO version. Since apache/commons-io#801 upstreamed the concurrency fix, this workaround is no longer needed.

**Note**: As `o.a.c.compress.utils.IOUtils` is now rarely used, it is always referenced by FQCN to avoid confusion.

* fix: unnecessary qualifier

* fix: remove qualified name
Replace `IOUtils.skip(in, Long.MAX_VALUE)`` with `IOUtils.consume(in)`` for clarity and intent, removing the need for a magic constant.
Copy link
Contributor

@ppkarwasz ppkarwasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the changes that didn’t come from PRs, and they look correct to me! 💯

Just a quick note on review overhead: style-only edits (sorting, Javadoc tweaks, whitespace, etc.) make it harder to focus on logic/API changes.

I’m pretty neutral on these stylistic choices, but how about we:

  • Document the style rules (punctuation, capitalization, final usage, etc.) so we have a clear review guideline: for example, is adding a . at the end of @param an improvement or a regression? Both patterns exist today.
  • Apply them once, repo-wide, in a dedicated commit to clean things up consistently.
  • Enforce them automatically (e.g., via Checkstyle or Spotless).

That would make reviews much simpler and less time consuming.

Comment on lines -113 to +116
private boolean hasHitEOF;
private boolean eof;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also TarArchiveInputStream.atEOF, which should be renamed if this one is.

@ppkarwasz ppkarwasz requested a review from Copilot November 7, 2025 09:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds synthetic test resources for testing long file names in various archive formats and updates test code to use modern API patterns while removing shaded dependencies.

Key Changes:

  • Added binary test resource files for testing long file name handling in multiple archive formats (tar, cpio, zip, arj, 7z, ar)
  • Migrated from shaded commons-io imports to unshaded imports
  • Updated ZipFile API usage to use the builder pattern instead of deprecated constructors
  • Disabled Zip64SupportIT test class
  • Updated documentation URL from HTTP to HTTPS

Reviewed Changes

Copilot reviewed 110 out of 124 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/test/resources/synthetic/long-name/*.tar New test resource files for testing long names in TAR formats (pax, gnu)
src/test/resources/synthetic/long-name/*.cpio New test resource files for testing long names in CPIO formats (odc, newc, crc)
src/test/resources/synthetic/long-name/*.zip New test resource file for testing long names in ZIP format
src/test/resources/synthetic/long-name/*.arj New test resource file for testing long names in ARJ format
src/test/resources/synthetic/long-name/*.7z New test resource file for testing long names in 7z format
src/test/resources/synthetic/long-name/*.ar New test resource files for testing long names in AR format
GzipCompressorOutputStreamTest.java Changed import from shaded to unshaded commons-io
ZipMemoryFileSystemTest.java Updated ZipFile instantiation to use builder pattern, removed unused import
ZipFileIgnoringLocalFileHeaderTest.java Migrated to ZipFile builder API, removed deprecated method
ZipArchiveInputStreamTest.java Added IOException to constructor signature, simplified assertion
Zip64SupportIT.java Changed import from shaded to unshaded commons-io, disabled entire test class
examples.xml Updated LZ4 URL from HTTP to HTTPS

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

/**
* Tests {@link ZipFile} Zip64 support.
*/
@Disabled
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire test class has been disabled with @Disabled annotation. This is a significant change that should be documented. Consider adding a reason parameter to explain why these tests are disabled, e.g., @Disabled("Reason for disabling"), to help future maintainers understand the context.

Copilot uses AI. Check for mistakes.
read++;
totalRead++;
if (totalRead < 0 || headerSize >= 0 && totalRead >= headerSize) {
if (totalRead < 0 || totalRead >= headerSize) {
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is always false.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totalRead is an int, so it can become negative if an integer overflow occurs.

However, this comment reminds me that headerSize in readPaxHeaders is not necessary: the inputStream.read() will return -1, after headerSize bytes have been consumed. I have an old local branches that simplifies this code. I will submit it after this PR is merged.

@garydgregory garydgregory merged commit 86c20cd into master Nov 7, 2025
24 of 40 checks passed
@garydgregory garydgregory deleted the commons_io_2_21_0 branch November 7, 2025 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants