Add toByteArray(InputStream input, int size, int bufferSize) by ppkarwasz · Pull Request #776 · apache/commons-io

ppkarwasz · 2025-09-04T21:19:46Z

This introduces toByteArray(InputStream input, int size, int bufferSize), which reads the stream in chunks of bufferSize instead of allocating the full array up front.

By reading incrementally, the method:

Validates that the stream actually contains size bytes before completing the allocation.
Prevents excessive memory usage if a corrupted or malicious size value is provided.
Offers safer handling for untrusted input compared to the direct-allocation variant.

I used AI to create the Javadoc.
Run a successful build using the default Maven goal with mvn; that's mvn on the command line by itself.
Write unit tests that match behavioral changes, where the tests fail if the changes to the runtime are not applied. This may not always be possible, but it is a best-practice.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Each commit in the pull request should have a meaningful subject line and body. Note that a maintainer may squash commits during the merge process.

This introduces `toByteArray(InputStream input, int size, int bufferSize)`, which reads the stream in chunks of `bufferSize` instead of allocating the full array up front. By reading incrementally, the method: * Validates that the stream actually contains `size` bytes before completing the allocation. * Prevents excessive memory usage if a corrupted or malicious `size` value is provided. * Offers safer handling for untrusted input compared to the direct-allocation variant.

…ading

garydgregory

Hi @ppkarwasz

I have lots of comments! 😉

src/main/java/org/apache/commons/io/IOUtils.java

src/main/java/org/apache/commons/io/RandomAccessFiles.java

src/changes/changes.xml

src/main/java/org/apache/commons/io/IOUtils.java

ppkarwasz · 2025-09-05T14:42:54Z

@garydgregory,

I removed a lot of details from the Javadoc and kept only these:

For toByteArray(InputStream):

commons-io/src/main/java/org/apache/commons/io/IOUtils.java

Lines 2643 to 2644 in c6c79a3

    
                * <p>The method accumulates the data in temporary buffers and returns a single array 
        
                * containing the entire contents once the end of the stream is reached.</p>

For toByteArray(InputStream, int/long):

commons-io/src/main/java/org/apache/commons/io/IOUtils.java

Lines 2666 to 2667 in c6c79a3

    
                * <p>The method allocates a single array of the requested size and fills it directly 
        
                * from the stream.</p>

For toByteArray(InputStream, int, int):

commons-io/src/main/java/org/apache/commons/io/IOUtils.java

Lines 2708 to 2709 in c6c79a3

    
                * <p>The method accumulates the data in temporary buffers of size at most {@code bufferSize} 
        
                * and returns a single array containing the entire contents once the end of the stream is reached.</p>

I believe that these behaviors will not change and should remain in the method contracts, what do you think?

garydgregory

Hi @ppkarwasz ,
I have comments scattered about.
TY!

src/main/java/org/apache/commons/io/IOUtils.java

* Extends incremental (chunked) reading to all `toByteArray` variants when the requested size is unknown or exceeds 128 KiB. * The 128 KiB threshold matches the default buffer size used in CPython. * Updates Javadoc to emphasize that memory usage grows **proportionally** with the number of bytes actually **read**, making these methods suitable for large streams when sufficient memory is available.

ppkarwasz · 2025-09-06T18:53:02Z

Hi @garydgregory,

I’ve refactored this proposal a bit further:

Extended chunked reading to the legacy toByteArray(InputStream, int/long) methods as well.
Revised the Javadoc to clarify the contract. As you mentioned earlier, we should not guide users based on how trusted the size parameter is. I’ve also removed explicit references to OutOfMemoryError, which can always occur. Instead, the docs now emphasize that memory allocation is proportional to the number of bytes actually read (previously it was de facto proportional to the size argument).

Open questions:

Chunking threshold: Currently set to 128 KiB, matching CPython. JDK 11+ uses 16 KiB. We could also consider raising our default buffer size (currently 8 KiB, and in some places as low as 1 KiB, which feels outdated).
Use of available(): Should we consult available() in toByteArray(InputStream, int) and toByteArray(InputStream) (see Add toByteArray(InputStream input, int size, int bufferSize) #776 (comment))? The toByteArray(InputStream, int, int) method explicitly promises chunks up to chunkSize, but the others do not.

garydgregory · 2025-09-07T23:40:38Z

Hi @garydgregory,

I’ve refactored this proposal a bit further:

Extended chunked reading to the legacy toByteArray(InputStream, int/long) methods as well.

Revised the Javadoc to clarify the contract. As you mentioned earlier, we should not guide users based on how trusted the size parameter is. I’ve also removed explicit references to OutOfMemoryError, which can always occur. Instead, the docs now emphasize that memory allocation is proportional to the number of bytes actually read (previously it was de facto proportional to the size argument).

Open questions:

Chunking threshold: Currently set to 128 KiB, matching CPython. JDK 11+ uses 16 KiB. We could also consider raising our default buffer size (currently 8 KiB, and in some places as low as 1 KiB, which feels outdated).

Use of available(): Should we consult available() in toByteArray(InputStream, int) and toByteArray(InputStream) (see feat: Add incremental toByteArray method #776 (comment))? The toByteArray(InputStream, int, int) method explicitly promises chunks up to chunkSize, but the others do not.

CPython is irrelevant IMO. Java's Files.BUFFER_SIZE class in Java 21 is 8K, so 8K is consistent as the default.

JDK 11+ uses 16 KiB.

Where do you see that?

garydgregory

Hello @ppkarwasz

I added some comments.

src/main/java/org/apache/commons/io/IOUtils.java

ppkarwasz · 2025-09-08T09:21:50Z

JDK 11+ uses 16 KiB.

Where do you see that?

In the source code of InputStream. This is the value used by readAllBytes/readNBytes.

src/main/java/org/apache/commons/io/IOUtils.java

src/test/java/org/apache/commons/io/IOUtilsTest.java

src/main/java/org/apache/commons/io/IOUtils.java

src/changes/changes.xml

src/main/java/org/apache/commons/io/IOUtils.java

src/test/java/org/apache/commons/io/IOUtilsTest.java

src/main/java/org/apache/commons/io/IOUtils.java

The implementation of `IOUtils.toByteArray(InputStream, int, int)` added in #776 throws different exceptions depending on the requested size: * For request sizes larger than the internal chunk size, it correctly throws an `EOFException`. * For smaller requests, it incorrectly throws a generic `IOException`. This PR makes the behavior consistent by always throwing an `EOFException` when the stream ends prematurely. Note: This also affects `RandomAccessFiles.read`. Its previous truncation behavior was undocumented and inconsistent with `RandomAccessFile.read` (which reads as much as possible). The new behavior is not explicitly documented here either, since it is unclear whether throwing on truncation is actually desirable.

MarkEWaite · 2025-12-03T02:40:11Z

As far as I can tell from git bisect with builds of commons-io bundled into Jenkins core, this is the pull request that caused a major regression in Jenkins 2.537. The major regression was:

[JENKINS-76295] SSH agents stopped working on Jenkins 2.537 jenkinsci/jenkins#16845

Jenkins agents failed to connect with SSH until we reverted Apache Commons IO 2.21.0 and returned to Apache Commons IO 2.20.0. We released Jenkins 2.538 less than 24 hours after 2.537 due to the severity of the issue.

The stack trace reported in the failure is:

[SSH] Starting sftp client.
[SSH] Copying latest remoting.jar...
java.io.IOException: Could not copy remoting.jar into '/home/jenkins/.jenkins-cd-control' on agent
    at PluginClassLoader for ssh-slaves//hudson.plugins.sshslaves.SSHLauncher.copyAgentJar(SSHLauncher.java:738)
    at PluginClassLoader for ssh-slaves//hudson.plugins.sshslaves.SSHLauncher.lambda$launch$0(SSHLauncher.java:462)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.IllegalArgumentException: invalid len argument
    at PluginClassLoader for trilead-api//com.trilead.ssh2.SFTPv3Client.read(SFTPv3Client.java:1250)
    at PluginClassLoader for trilead-api//com.trilead.ssh2.jenkins.SFTPClient$SFTPInputStream.read(SFTPClient.java:172)
    at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:337)
    at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:536)
    at org.apache.commons.io.output.AbstractByteArrayOutputStream.writeImpl(AbstractByteArrayOutputStream.java:405)
    at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.write(UnsynchronizedByteArrayOutputStream.java:227)
    at org.apache.commons.io.IOUtils.copyToOutputStream(IOUtils.java:1958)
    at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:2918)
    at PluginClassLoader for ssh-slaves//hudson.plugins.sshslaves.SSHLauncher.readInputStreamIntoByteArrayAndClose(SSHLauncher.java:796)
    at PluginClassLoader for ssh-slaves//hudson.plugins.sshslaves.SSHLauncher.copyAgentJar(SSHLauncher.java:705)
    ... 5 more
Launch failed - cleaning up connection
[SSH] Connection closed.

I've attached a tar archive of the git repository where I've been able to duplicate the issue.

gh-16845.tar.gz

ppkarwasz · 2025-12-03T08:51:59Z

Hi @MarkEWaite,

I can’t reproduce it locally, but the stack trace points to SFTPInputStream#read (source):

@Override
public int read(byte[] b, int off, int len) throws IOException {
    int r = SFTPClient.this.read(h, offset, b, off, len);
    if (r < 0) return -1;
    offset += r;
    return r;
}

This method doesn’t fully follow the InputStream#read contract, particularly when len == 0 or when len exceeds the SFTP packet-size limit (2^15). The change in this PR alters the read pattern and exposes this bug.

Given that toByteArray uses an 8192-byte buffer, my guess is that a read() call with len == 0 triggers the exception.

This looks like an issue in Trilead rather than Commons IO. I submitted a minimal fix in jenkinsci/trilead-ssh2#273

More details of the bug are available in: * jenkinsci/jenkins#11314 * jenkinsci/jenkins#11312 * jenkinsci/trilead-ssh2#273 Details of the changes in the library are available in the library release notes: * https://github.com/jenkinsci/trilead-ssh2/releases/tag/build-217-jenkins-371.vc1d30dc5a_b_32 Testing done: * apache/commons-io#776 (comment) includes the testing configuration I used to confirm that 2.537 fails to start SSH build agents before this change and starts SSH build agents successfully after this change Testing to be done: * Confirm that incremental build of the plugin passes BOM testing

ppkarwasz added 2 commits September 4, 2025 23:17

Merge remote-tracking branch 'apache/master' into feat/incremental-re…

0f857f8

…ading

garydgregory requested changes Sep 5, 2025

View reviewed changes

ppkarwasz added 5 commits September 5, 2025 15:27

fix: move back positivity check to helper method

d748e99

fix: changelog entry

33a4fe1

fix: Javadoc details

7362d3e

fix: remove negative size check

fe39b77

fix: exception message

c6c79a3

ppkarwasz requested a review from garydgregory September 5, 2025 14:42

garydgregory requested changes Sep 5, 2025

View reviewed changes

ppkarwasz added 3 commits September 6, 2025 08:05

fix: restore parameter name

9439095

fix: remove details and add guidance

97d37a9

fix: simplify description

cbfa307

ppkarwasz commented Sep 6, 2025

View reviewed changes

src/main/java/org/apache/commons/io/IOUtils.java Outdated Show resolved Hide resolved

src/main/java/org/apache/commons/io/IOUtils.java Outdated Show resolved Hide resolved

garydgregory requested changes Sep 8, 2025

View reviewed changes

ppkarwasz added 2 commits September 8, 2025 11:36

fix: Javadoc of constants

38a6a2c

fix: Formatting

29f365b

vy reviewed Sep 8, 2025

View reviewed changes

ppkarwasz added 2 commits September 10, 2025 11:38

fix: restore previous toByteArray(InputStream, int) behavior

3afae07

fix: use default buffer size as chunk size

4aba097

garydgregory reviewed Sep 10, 2025

View reviewed changes

src/main/java/org/apache/commons/io/IOUtils.java Outdated Show resolved Hide resolved

ecki reviewed Sep 10, 2025

View reviewed changes

ppkarwasz added 3 commits September 10, 2025 17:13

fix: possible NPE

15b249d

fix: remove unrelated change

624de9f

fix: toByteArray(InputStream) Javadoc

eb6c2bc

fix: Javadoc

e6e2a4d

ppkarwasz requested review from ecki, garydgregory and vy September 11, 2025 09:35

garydgregory reviewed Sep 11, 2025

View reviewed changes

src/main/java/org/apache/commons/io/IOUtils.java Show resolved Hide resolved

ppkarwasz requested a review from garydgregory September 11, 2025 21:38

Fix comment formatting for SOFT_MAX_ARRAY_LENGTH

91636d3

garydgregory changed the title ~~feat: Add incremental toByteArray method~~ Add toByteArray(InputStream input, int size, int bufferSize) Sep 11, 2025

garydgregory merged commit 2330b08 into master Sep 12, 2025
37 of 39 checks passed

garydgregory deleted the feat/incremental-reading branch September 12, 2025 14:44

ppkarwasz mentioned this pull request Oct 3, 2025

IOUtils.toByteArray now throws EOFException when not enough data is available #796

Merged

MarkEWaite mentioned this pull request Dec 3, 2025

Investigate why commons-io 2.21.0 breaks SSH agents on upgrade jenkinsci/jenkins#11314

Closed

MarkEWaite mentioned this pull request Dec 3, 2025

Fix trilead-ssh2 bug exposed by Apache Commons IO 2.21.0 jenkinsci/trilead-api-plugin#259

Merged

2 tasks

Conversation

ppkarwasz commented Sep 4, 2025

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ppkarwasz commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ppkarwasz commented Sep 6, 2025

Uh oh!

garydgregory commented Sep 7, 2025

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ppkarwasz commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MarkEWaite commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppkarwasz commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ppkarwasz commented Sep 5, 2025 •

edited

Loading

MarkEWaite commented Dec 3, 2025 •

edited

Loading

ppkarwasz commented Dec 3, 2025 •

edited

Loading