Make preprocessing of input stream handle supplementary characters #385

sideshowbarker · 2022-10-31T18:36:20Z

Update: I now think we should do #386 rather than making this change.

Fixes #383. When performing preprocessing of the input stream as specified in https://drafts.csswg.org/css-syntax/#input-preprocessing, this change makes our implementation handle non-BMP supplementary characters as expected — by only replacing surrogates with U+FFFD if they are lone (unpaired) surrogates, but not replacing surrogates that are part of surrogate pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our implementation encounters supplementary characters in the input stream.

I’ve tested this both in the context of the CSS validator itself running standalone and in the context of the HTML checker and found that it works as expected as far as replacing not replacing surrogates that are part of surrogate pairs.

However, I’ve not found a way to test that this code actually replaces lone (unpaired) surrogates as expected — because in the case of most encodings I tried testing with, Java’s internal encoding handling replaces the lone surrogates with U+FFFD before our input-stream-preprocessing implementation is ever run.

So I don’t know of any way to have lone surrogates passed through as-is from an input stream in such a way that our input-stream-preprocessing code would have a chance to replace them. Java always replaces them before our code runs.

Fixes #383 When performing preprocessing of the input stream as specified in https://drafts.csswg.org/css-syntax/#input-preprocessing, this change makes our implementation handle non-BMP supplementary characters as expected — by only replacing surrogates with U+FFFD if they are lone (unpaired) surrogates, but not replacing surrogates that are part of surrogate pairs (a high surrogate followed by a low surrogate). Otherwise, without this change, a parse error will occur when our implementation encounters supplementary characters in the input stream.

ylafon · 2022-11-28T10:14:37Z

Used #386 instead as it is really the proper way (for now)

sideshowbarker requested a review from ylafon October 31, 2022 18:37

sideshowbarker force-pushed the sideshowbarker/preprocessing-input-stream-handle-surrogate-pairs branch 6 times, most recently from 6a4a0f4 to 243d644 Compare October 31, 2022 19:40

sideshowbarker marked this pull request as draft October 31, 2022 20:31

sideshowbarker force-pushed the sideshowbarker/preprocessing-input-stream-handle-surrogate-pairs branch 3 times, most recently from baa0c6f to 4ecda5b Compare November 1, 2022 12:08

sideshowbarker marked this pull request as ready for review November 1, 2022 12:10

sideshowbarker force-pushed the sideshowbarker/preprocessing-input-stream-handle-surrogate-pairs branch from 4ecda5b to b3a5695 Compare November 1, 2022 12:12

sideshowbarker marked this pull request as draft November 2, 2022 04:10

sideshowbarker mentioned this pull request Nov 2, 2022

Don’t do surrogate replacement when preprocessing input stream #386

Merged

ylafon closed this Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make preprocessing of input stream handle supplementary characters #385

Make preprocessing of input stream handle supplementary characters #385

Uh oh!

sideshowbarker commented Oct 31, 2022 •

edited

Loading

Uh oh!

ylafon commented Nov 28, 2022

Uh oh!

Uh oh!

Make preprocessing of input stream handle supplementary characters #385

Make preprocessing of input stream handle supplementary characters #385

Uh oh!

Conversation

sideshowbarker commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ylafon commented Nov 28, 2022

Uh oh!

Uh oh!

sideshowbarker commented Oct 31, 2022 •

edited

Loading