Skip to content

Add toByteArray(InputStream input, int size, int bufferSize)#776

Merged
garydgregory merged 20 commits intomasterfrom
feat/incremental-reading
Sep 12, 2025
Merged

Add toByteArray(InputStream input, int size, int bufferSize)#776
garydgregory merged 20 commits intomasterfrom
feat/incremental-reading

Conversation

@ppkarwasz
Copy link
Contributor

This introduces toByteArray(InputStream input, int size, int bufferSize), which reads the stream in chunks of bufferSize instead of allocating the full array up front.

By reading incrementally, the method:

  • Validates that the stream actually contains size bytes before completing the allocation.
  • Prevents excessive memory usage if a corrupted or malicious size value is provided.
  • Offers safer handling for untrusted input compared to the direct-allocation variant.
  • I used AI to create the Javadoc.
  • Run a successful build using the default Maven goal with mvn; that's mvn on the command line by itself.
  • Write unit tests that match behavioral changes, where the tests fail if the changes to the runtime are not applied. This may not always be possible, but it is a best-practice.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Each commit in the pull request should have a meaningful subject line and body. Note that a maintainer may squash commits during the merge process.

This introduces `toByteArray(InputStream input, int size, int bufferSize)`, which reads the stream in chunks of `bufferSize` instead of allocating the full array up front.

By reading incrementally, the method:

* Validates that the stream actually contains `size` bytes before completing the allocation.
* Prevents excessive memory usage if a corrupted or malicious `size` value is provided.
* Offers safer handling for untrusted input compared to the direct-allocation variant.
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ppkarwasz

I have lots of comments! 😉

@ppkarwasz
Copy link
Contributor Author

ppkarwasz commented Sep 5, 2025

@garydgregory,

I removed a lot of details from the Javadoc and kept only these:

I believe that these behaviors will not change and should remain in the method contracts, what do you think?

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ppkarwasz ,
I have comments scattered about.
TY!

* Extends incremental (chunked) reading to all `toByteArray` variants when the requested size is unknown or exceeds 128 KiB.
* The 128 KiB threshold matches the default buffer size used in CPython.
* Updates Javadoc to emphasize that memory usage grows **proportionally** with the number of bytes actually **read**, making these methods suitable for large streams when sufficient memory is available.
@ppkarwasz
Copy link
Contributor Author

Hi @garydgregory,

I’ve refactored this proposal a bit further:

  • Extended chunked reading to the legacy toByteArray(InputStream, int/long) methods as well.
  • Revised the Javadoc to clarify the contract. As you mentioned earlier, we should not guide users based on how trusted the size parameter is. I’ve also removed explicit references to OutOfMemoryError, which can always occur. Instead, the docs now emphasize that memory allocation is proportional to the number of bytes actually read (previously it was de facto proportional to the size argument).

Open questions:

  • Chunking threshold: Currently set to 128 KiB, matching CPython. JDK 11+ uses 16 KiB. We could also consider raising our default buffer size (currently 8 KiB, and in some places as low as 1 KiB, which feels outdated).
  • Use of available(): Should we consult available() in toByteArray(InputStream, int) and toByteArray(InputStream) (see Add toByteArray(InputStream input, int size, int bufferSize) #776 (comment))? The toByteArray(InputStream, int, int) method explicitly promises chunks up to chunkSize, but the others do not.

@garydgregory
Copy link
Member

Hi @garydgregory,

I’ve refactored this proposal a bit further:

  • Extended chunked reading to the legacy toByteArray(InputStream, int/long) methods as well.
  • Revised the Javadoc to clarify the contract. As you mentioned earlier, we should not guide users based on how trusted the size parameter is. I’ve also removed explicit references to OutOfMemoryError, which can always occur. Instead, the docs now emphasize that memory allocation is proportional to the number of bytes actually read (previously it was de facto proportional to the size argument).

Open questions:

  • Chunking threshold: Currently set to 128 KiB, matching CPython. JDK 11+ uses 16 KiB. We could also consider raising our default buffer size (currently 8 KiB, and in some places as low as 1 KiB, which feels outdated).
  • Use of available(): Should we consult available() in toByteArray(InputStream, int) and toByteArray(InputStream) (see feat: Add incremental toByteArray method #776 (comment))? The toByteArray(InputStream, int, int) method explicitly promises chunks up to chunkSize, but the others do not.

CPython is irrelevant IMO. Java's Files.BUFFER_SIZE class in Java 21 is 8K, so 8K is consistent as the default.

JDK 11+ uses 16 KiB.

Where do you see that?

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ppkarwasz

I added some comments.

@ppkarwasz
Copy link
Contributor Author

JDK 11+ uses 16 KiB.

Where do you see that?

In the source code of InputStream. This is the value used by readAllBytes/readNBytes.

@garydgregory garydgregory changed the title feat: Add incremental toByteArray method Add toByteArray(InputStream input, int size, int bufferSize) Sep 11, 2025
@garydgregory garydgregory merged commit 2330b08 into master Sep 12, 2025
37 of 39 checks passed
@garydgregory garydgregory deleted the feat/incremental-reading branch September 12, 2025 14:44
ppkarwasz added a commit that referenced this pull request Oct 3, 2025
The implementation of `IOUtils.toByteArray(InputStream, int, int)` added in #776 throws different exceptions depending on the requested size:

* For request sizes larger than the internal chunk size, it correctly throws an `EOFException`.
* For smaller requests, it incorrectly throws a generic `IOException`.

This PR makes the behavior consistent by always throwing an `EOFException` when the stream ends prematurely.

Note: This also affects `RandomAccessFiles.read`. Its previous truncation behavior was undocumented and inconsistent with `RandomAccessFile.read` (which reads as much as possible). The new behavior is not explicitly documented here either, since it is unclear whether throwing on truncation is actually desirable.
garydgregory pushed a commit that referenced this pull request Oct 4, 2025
The implementation of `IOUtils.toByteArray(InputStream, int, int)` added in #776 throws different exceptions depending on the requested size:

* For request sizes larger than the internal chunk size, it correctly throws an `EOFException`.
* For smaller requests, it incorrectly throws a generic `IOException`.

This PR makes the behavior consistent by always throwing an `EOFException` when the stream ends prematurely.

Note: This also affects `RandomAccessFiles.read`. Its previous truncation behavior was undocumented and inconsistent with `RandomAccessFile.read` (which reads as much as possible). The new behavior is not explicitly documented here either, since it is unclear whether throwing on truncation is actually desirable.
@MarkEWaite
Copy link

MarkEWaite commented Dec 3, 2025

As far as I can tell from git bisect with builds of commons-io bundled into Jenkins core, this is the pull request that caused a major regression in Jenkins 2.537. The major regression was:

Jenkins agents failed to connect with SSH until we reverted Apache Commons IO 2.21.0 and returned to Apache Commons IO 2.20.0. We released Jenkins 2.538 less than 24 hours after 2.537 due to the severity of the issue.

The stack trace reported in the failure is:

[SSH] Starting sftp client.
[SSH] Copying latest remoting.jar...
java.io.IOException: Could not copy remoting.jar into '/home/jenkins/.jenkins-cd-control' on agent
    at PluginClassLoader for ssh-slaves//hudson.plugins.sshslaves.SSHLauncher.copyAgentJar(SSHLauncher.java:738)
    at PluginClassLoader for ssh-slaves//hudson.plugins.sshslaves.SSHLauncher.lambda$launch$0(SSHLauncher.java:462)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.IllegalArgumentException: invalid len argument
    at PluginClassLoader for trilead-api//com.trilead.ssh2.SFTPv3Client.read(SFTPv3Client.java:1250)
    at PluginClassLoader for trilead-api//com.trilead.ssh2.jenkins.SFTPClient$SFTPInputStream.read(SFTPClient.java:172)
    at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:337)
    at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:536)
    at org.apache.commons.io.output.AbstractByteArrayOutputStream.writeImpl(AbstractByteArrayOutputStream.java:405)
    at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.write(UnsynchronizedByteArrayOutputStream.java:227)
    at org.apache.commons.io.IOUtils.copyToOutputStream(IOUtils.java:1958)
    at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:2918)
    at PluginClassLoader for ssh-slaves//hudson.plugins.sshslaves.SSHLauncher.readInputStreamIntoByteArrayAndClose(SSHLauncher.java:796)
    at PluginClassLoader for ssh-slaves//hudson.plugins.sshslaves.SSHLauncher.copyAgentJar(SSHLauncher.java:705)
    ... 5 more
Launch failed - cleaning up connection
[SSH] Connection closed.

I've attached a tar archive of the git repository where I've been able to duplicate the issue.

gh-16845.tar.gz

@ppkarwasz
Copy link
Contributor Author

ppkarwasz commented Dec 3, 2025

Hi @MarkEWaite,

I can’t reproduce it locally, but the stack trace points to SFTPInputStream#read (source):

@Override
public int read(byte[] b, int off, int len) throws IOException {
    int r = SFTPClient.this.read(h, offset, b, off, len);
    if (r < 0) return -1;
    offset += r;
    return r;
}

This method doesn’t fully follow the InputStream#read contract, particularly when len == 0 or when len exceeds the SFTP packet-size limit (2^15). The change in this PR alters the read pattern and exposes this bug.

Given that toByteArray uses an 8192-byte buffer, my guess is that a read() call with len == 0 triggers the exception.

This looks like an issue in Trilead rather than Commons IO. I submitted a minimal fix in jenkinsci/trilead-ssh2#273

MarkEWaite added a commit to MarkEWaite/trilead-api-plugin that referenced this pull request Dec 3, 2025
More details of the bug are available in:

* jenkinsci/jenkins#11314
* jenkinsci/jenkins#11312
* jenkinsci/trilead-ssh2#273

Details of the changes in the library are available in the library
release notes:

* https://github.com/jenkinsci/trilead-ssh2/releases/tag/build-217-jenkins-371.vc1d30dc5a_b_32

Testing done:

* apache/commons-io#776 (comment)
  includes the testing configuration I used to confirm that 2.537 fails
  to start SSH build agents before this change and starts SSH build
  agents successfully after this change

Testing to be done:

* Confirm that incremental build of the plugin passes BOM testing
MarkEWaite added a commit to jenkinsci/trilead-api-plugin that referenced this pull request Dec 4, 2025
More details of the bug are available in:

* jenkinsci/jenkins#11314
* jenkinsci/jenkins#11312
* jenkinsci/trilead-ssh2#273

Details of the changes in the library are available in the library
release notes:

* https://github.com/jenkinsci/trilead-ssh2/releases/tag/build-217-jenkins-371.vc1d30dc5a_b_32

Testing done:

* apache/commons-io#776 (comment)
  includes the testing configuration I used to confirm that 2.537 fails
  to start SSH build agents before this change and starts SSH build
  agents successfully after this change

Testing to be done:

* Confirm that incremental build of the plugin passes BOM testing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants