Skip to content

Parse error with Unicode supplementary characters in “style” element in HTML document #383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sideshowbarker opened this issue Oct 31, 2022 · 1 comment · Fixed by #386

Comments

@sideshowbarker
Copy link
Member

See https://jigsaw.w3.org/css-validator/validator?uri=https://sideshowbarker.net/tests/css-supplementary-code-point.html

The source of https://sideshowbarker.net/tests/css-supplementary-code-point.html has this:

<!doctype html><html lang="en"><title>s</title><meta charset="UTF-8">
<style>
h1::before { content: '🚧' }
</style>
<style>@charset "UTF-8";
h1::after { content: '🚧' }
</style>

In both cases — in the style elements both with and without @charset "UTF-8" — the 🚧 (U+1F6A7 CONSTRUCTION SIGN) character causes the CSS validator to report a parse error.

The parse error does not occur if the following patch is applied (patch --ignore-whitespace -p1 < patch) to the sources:

diff --git a/org/w3c/css/css/StyleSheetParser.java b/org/w3c/css/css/StyleSheetParser.java
index c6c3a97c..2b128dee 100644
--- a/org/w3c/css/css/StyleSheetParser.java
+++ b/org/w3c/css/css/StyleSheetParser.java
@@ -293,12 +293,7 @@ public final class StyleSheetParser

 //	    if (cssFouffa == null) {
             String charset = ac.getCharsetForURL(url);
-            if (ac.getCssVersion().compareTo(CssVersion.CSS2) >=0 ) {
-                cssFouffa = new CssFouffa(ac, new UnescapeFilterReader(new BufferedReader(reader)), url, lineno);
-            } else {
             cssFouffa = new CssFouffa(ac, reader, url, lineno);
-
-            }
             cssFouffa.addListener(this);
 //	    } else {
 //		cssFouffa.ReInit(ac, input, url, lineno);

So the cause would seem to be in https://github.com/w3c/css-validator/blob/main/org/w3c/css/util/UnescapeFilterReader.java, which appears to only be called on content that comes in from a style element in an HTML document (as opposed to being from a separate standalone stylesheet resource, or being entered from the validator’s direct-input textarea).

Related issue: validator/validator#1344

@sideshowbarker
Copy link
Member Author

sideshowbarker commented Oct 31, 2022

I haven’t tested this yet, but I suspect that for checking by URL, the following patch would cause the same parse error. (Scroll to see the full patch).

diff --git a/org/w3c/css/parser/CssFouffa.java b/org/w3c/css/parser/CssFouffa.java
index ef3580bf..a195bb7b 100644
--- a/org/w3c/css/parser/CssFouffa.java
+++ b/org/w3c/css/parser/CssFouffa.java
@@ -26,12 +26,14 @@ import org.w3c.css.util.ApplContext;
 import org.w3c.css.util.CssVersion;
 import org.w3c.css.util.HTTPURL;
 import org.w3c.css.util.InvalidParamException;
+import org.w3c.css.util.UnescapeFilterReader;
 import org.w3c.css.util.Util;
 import org.w3c.css.util.WarningParamException;
 import org.w3c.css.util.Warnings;
 import org.w3c.css.values.CssExpression;
 import org.w3c.css.values.CssValue;

+import java.io.BufferedReader;
 import java.io.FileNotFoundException;
 import java.io.IOException;
 import java.io.InputStream;
@@ -88,7 +90,7 @@ public final class CssFouffa extends CssParser {
      */
     public CssFouffa(ApplContext ac, Reader reader, URL file, int beginLine)
             throws IOException {
-        super(reader);
+        super(new UnescapeFilterReader(new BufferedReader(reader)));
         if (ac.getOrigin() == -1) {
             setOrigin(StyleSheetOrigin.AUTHOR); // default is user
         } else {

sideshowbarker added a commit that referenced this issue Oct 31, 2022
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
surrogates, but not replacing surrogates that are part of surrogate
pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
sideshowbarker added a commit that referenced this issue Oct 31, 2022
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
surrogates, but not replacing surrogates that are part of surrogate
pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
sideshowbarker added a commit that referenced this issue Oct 31, 2022
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
surrogates, but not replacing surrogates that are part of surrogate
pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
sideshowbarker added a commit that referenced this issue Oct 31, 2022
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
surrogates, but not replacing surrogates that are part of surrogate
pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
sideshowbarker added a commit that referenced this issue Oct 31, 2022
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
surrogates, but not replacing surrogates that are part of surrogate
pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
sideshowbarker added a commit that referenced this issue Oct 31, 2022
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
surrogates, but not replacing surrogates that are part of surrogate
pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
sideshowbarker added a commit that referenced this issue Oct 31, 2022
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
surrogates, but not replacing surrogates that are part of surrogate
pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
sideshowbarker added a commit that referenced this issue Nov 1, 2022
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
surrogates, but not replacing surrogates that are part of surrogate
pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
sideshowbarker added a commit that referenced this issue Nov 1, 2022
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
surrogates, but not replacing surrogates that are part of surrogate
pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
sideshowbarker added a commit that referenced this issue Nov 1, 2022
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
(unpaired) surrogates, but not replacing surrogates that are part of
surrogate pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
sideshowbarker added a commit that referenced this issue Nov 1, 2022
Fixes #383

When performing preprocessing of the input stream as specified in
https://drafts.csswg.org/css-syntax/#input-preprocessing, this change
makes our implementation handle non-BMP supplementary characters as
expected — by only replacing surrogates with U+FFFD if they are lone
(unpaired) surrogates, but not replacing surrogates that are part of
surrogate pairs (a high surrogate followed by a low surrogate).

Otherwise, without this change, a parse error will occur when our
implementation encounters supplementary characters in the input stream.
sideshowbarker added a commit that referenced this issue Nov 2, 2022
Fixes #383

This change drops the code for replacing surrogate code points from our
implementation of “filter code points” from “Preprocessing the input
stream” at https://drafts.csswg.org/css-syntax/#css-filter-code-points

w3c/csswg-drafts#3307 (comment)
notes that the only way to produce a surrogate code point in CSS content
is by directly assigning a DOMString with one in it via an OM operation;
in other words, by manipulating a document using JavaScript to insert
a surrogate code point into the document.

But because the CSS validator doesn’t execute any JavaScript from a
document, there’s no way for a document being checked by the CSS
validator to contain any surrogate code points. Therefore, it’s
unnecessary for our implementation to handle replacement of surrogate
code points. In other words, our implementation can still conform to the
spec requirements even if we don’t perform surrogate replacement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant