Parse error with Unicode supplementary characters in “style” element in HTML document #383

sideshowbarker · 2022-10-31T02:55:45Z

See https://jigsaw.w3.org/css-validator/validator?uri=https://sideshowbarker.net/tests/css-supplementary-code-point.html

The source of https://sideshowbarker.net/tests/css-supplementary-code-point.html has this:

<!doctype html><html lang="en"><title>s</title><meta charset="UTF-8">
<style>
h1::before { content: '🚧' }
</style>
<style>@charset "UTF-8";
h1::after { content: '🚧' }
</style>

In both cases — in the style elements both with and without @charset "UTF-8" — the 🚧 (U+1F6A7 CONSTRUCTION SIGN) character causes the CSS validator to report a parse error.

The parse error does not occur if the following patch is applied (patch --ignore-whitespace -p1 < patch) to the sources:

diff --git a/org/w3c/css/css/StyleSheetParser.java b/org/w3c/css/css/StyleSheetParser.java
index c6c3a97c..2b128dee 100644
--- a/org/w3c/css/css/StyleSheetParser.java
+++ b/org/w3c/css/css/StyleSheetParser.java
@@ -293,12 +293,7 @@ public final class StyleSheetParser

 //	    if (cssFouffa == null) {
             String charset = ac.getCharsetForURL(url);
-            if (ac.getCssVersion().compareTo(CssVersion.CSS2) >=0 ) {
-                cssFouffa = new CssFouffa(ac, new UnescapeFilterReader(new BufferedReader(reader)), url, lineno);
-            } else {
             cssFouffa = new CssFouffa(ac, reader, url, lineno);
-
-            }
             cssFouffa.addListener(this);
 //	    } else {
 //		cssFouffa.ReInit(ac, input, url, lineno);

So the cause would seem to be in https://github.com/w3c/css-validator/blob/main/org/w3c/css/util/UnescapeFilterReader.java, which appears to only be called on content that comes in from a style element in an HTML document (as opposed to being from a separate standalone stylesheet resource, or being entered from the validator’s direct-input textarea).

Related issue: validator/validator#1344

The text was updated successfully, but these errors were encountered:

sideshowbarker · 2022-10-31T16:33:57Z

I haven’t tested this yet, but I suspect that for checking by URL, the following patch would cause the same parse error. (Scroll to see the full patch).

diff --git a/org/w3c/css/parser/CssFouffa.java b/org/w3c/css/parser/CssFouffa.java
index ef3580bf..a195bb7b 100644
--- a/org/w3c/css/parser/CssFouffa.java
+++ b/org/w3c/css/parser/CssFouffa.java
@@ -26,12 +26,14 @@ import org.w3c.css.util.ApplContext;
 import org.w3c.css.util.CssVersion;
 import org.w3c.css.util.HTTPURL;
 import org.w3c.css.util.InvalidParamException;
+import org.w3c.css.util.UnescapeFilterReader;
 import org.w3c.css.util.Util;
 import org.w3c.css.util.WarningParamException;
 import org.w3c.css.util.Warnings;
 import org.w3c.css.values.CssExpression;
 import org.w3c.css.values.CssValue;

+import java.io.BufferedReader;
 import java.io.FileNotFoundException;
 import java.io.IOException;
 import java.io.InputStream;
@@ -88,7 +90,7 @@ public final class CssFouffa extends CssParser {
      */
     public CssFouffa(ApplContext ac, Reader reader, URL file, int beginLine)
             throws IOException {
-        super(reader);
+        super(new UnescapeFilterReader(new BufferedReader(reader)));
         if (ac.getOrigin() == -1) {
             setOrigin(StyleSheetOrigin.AUTHOR); // default is user
         } else {

Fixes #383 When performing preprocessing of the input stream as specified in https://drafts.csswg.org/css-syntax/#input-preprocessing, this change makes our implementation handle non-BMP supplementary characters as expected — by only replacing surrogates with U+FFFD if they are lone surrogates, but not replacing surrogates that are part of surrogate pairs (a high surrogate followed by a low surrogate). Otherwise, without this change, a parse error will occur when our implementation encounters supplementary characters in the input stream.

Fixes #383 When performing preprocessing of the input stream as specified in https://drafts.csswg.org/css-syntax/#input-preprocessing, this change makes our implementation handle non-BMP supplementary characters as expected — by only replacing surrogates with U+FFFD if they are lone (unpaired) surrogates, but not replacing surrogates that are part of surrogate pairs (a high surrogate followed by a low surrogate). Otherwise, without this change, a parse error will occur when our implementation encounters supplementary characters in the input stream.

Fixes #383 This change drops the code for replacing surrogate code points from our implementation of “filter code points” from “Preprocessing the input stream” at https://drafts.csswg.org/css-syntax/#css-filter-code-points w3c/csswg-drafts#3307 (comment) notes that the only way to produce a surrogate code point in CSS content is by directly assigning a DOMString with one in it via an OM operation; in other words, by manipulating a document using JavaScript to insert a surrogate code point into the document. But because the CSS validator doesn’t execute any JavaScript from a document, there’s no way for a document being checked by the CSS validator to contain any surrogate code points. Therefore, it’s unnecessary for our implementation to handle replacement of surrogate code points. In other words, our implementation can still conform to the spec requirements even if we don’t perform surrogate replacement.

sideshowbarker mentioned this issue Oct 31, 2022

Make preprocessing of input stream handle supplementary characters #385

Closed

sideshowbarker mentioned this issue Nov 2, 2022

Don’t do surrogate replacement when preprocessing input stream #386

Merged

ylafon closed this as completed in #386 Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parse error with Unicode supplementary characters in “style” element in HTML document #383

Parse error with Unicode supplementary characters in “style” element in HTML document #383

sideshowbarker commented Oct 31, 2022

sideshowbarker commented Oct 31, 2022 •

edited

Loading

Uh oh!

Parse error with Unicode supplementary characters in “style” element in HTML document #383

Parse error with Unicode supplementary characters in “style” element in HTML document #383

Comments

sideshowbarker commented Oct 31, 2022

sideshowbarker commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sideshowbarker commented Oct 31, 2022 •

edited

Loading