Make preprocessing of input stream handle supplementary characters #385
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Update: I now think we should do #386 rather than making this change.
Fixes #383. When performing preprocessing of the input stream as specified in https://drafts.csswg.org/css-syntax/#input-preprocessing, this change makes our implementation handle non-BMP supplementary characters as expected — by only replacing surrogates with U+FFFD if they are lone (unpaired) surrogates, but not replacing surrogates that are part of surrogate pairs (a high surrogate followed by a low surrogate).
Otherwise, without this change, a parse error will occur when our implementation encounters supplementary characters in the input stream.
I’ve tested this both in the context of the CSS validator itself running standalone and in the context of the HTML checker and found that it works as expected as far as replacing not replacing surrogates that are part of surrogate pairs.
However, I’ve not found a way to test that this code actually replaces lone (unpaired) surrogates as expected — because in the case of most encodings I tried testing with, Java’s internal encoding handling replaces the lone surrogates with U+FFFD before our input-stream-preprocessing implementation is ever run.
So I don’t know of any way to have lone surrogates passed through as-is from an input stream in such a way that our input-stream-preprocessing code would have a chance to replace them. Java always replaces them before our code runs.