Skip to content

Make preprocessing of input stream handle supplementary characters #385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

sideshowbarker
Copy link
Member

@sideshowbarker sideshowbarker commented Oct 31, 2022

Update: I now think we should do #386 rather than making this change.

Fixes #383. When performing preprocessing of the input stream as specified in https://drafts.csswg.org/css-syntax/#input-preprocessing, this change makes our implementation handle non-BMP supplementary characters as expected — by only replacing surrogates with U+FFFD if they are lone (unpaired) surrogates, but not replacing surrogates that are part of surrogate pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our implementation encounters supplementary characters in the input stream.

I’ve tested this both in the context of the CSS validator itself running standalone and in the context of the HTML checker and found that it works as expected as far as replacing not replacing surrogates that are part of surrogate pairs.

However, I’ve not found a way to test that this code actually replaces lone (unpaired) surrogates as expected — because in the case of most encodings I tried testing with, Java’s internal encoding handling replaces the lone surrogates with U+FFFD before our input-stream-preprocessing implementation is ever run.

So I don’t know of any way to have lone surrogates passed through as-is from an input stream in such a way that our input-stream-preprocessing code would have a chance to replace them. Java always replaces them before our code runs.

@sideshowbarker sideshowbarker requested a review from ylafon October 31, 2022 18:37
@sideshowbarker sideshowbarker force-pushed the sideshowbarker/preprocessing-input-stream-handle-surrogate-pairs branch 6 times, most recently from 6a4a0f4 to 243d644 Compare October 31, 2022 19:40
@sideshowbarker sideshowbarker marked this pull request as draft October 31, 2022 20:31
@sideshowbarker sideshowbarker force-pushed the sideshowbarker/preprocessing-input-stream-handle-surrogate-pairs branch 3 times, most recently from baa0c6f to 4ecda5b Compare November 1, 2022 12:08
@sideshowbarker sideshowbarker marked this pull request as ready for review November 1, 2022 12:10
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
(unpaired) surrogates, but not replacing surrogates that are part of
surrogate pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
@sideshowbarker sideshowbarker force-pushed the sideshowbarker/preprocessing-input-stream-handle-surrogate-pairs branch from 4ecda5b to b3a5695 Compare November 1, 2022 12:12
@sideshowbarker sideshowbarker marked this pull request as draft November 2, 2022 04:10
@ylafon
Copy link
Member

ylafon commented Nov 28, 2022

Used #386 instead as it is really the proper way (for now)

@ylafon ylafon closed this Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parse error with Unicode supplementary characters in “style” element in HTML document
2 participants