You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[css-syntax] Rewrite representation to be the literal text consumed by the tokenizing algo, allowing it to isntead be offsets into the input stream. Make the appropriate corrections to match this.
Copy file name to clipboardExpand all lines: css-syntax/Overview.bs
+68-23Lines changed: 68 additions & 23 deletions
Original file line number
Diff line number
Diff line change
@@ -397,7 +397,7 @@ The input byte stream</h3>
397
397
<h3 id="input-preprocessing">
398
398
Preprocessing the input stream</h3>
399
399
400
-
The input stream consists of the <a>code points</a>
400
+
The <dfn>input stream</dfn> consists of the <a>code points</a>
401
401
pushed into it as the input byte stream is decoded.
402
402
403
403
Before sending the input stream to the tokenizer,
@@ -463,7 +463,7 @@ Tokenization</h2>
463
463
<<delim-token>> has a value composed of a single <a>code point</a>.
464
464
465
465
<li>
466
-
<<number-token>>, <<percentage-token>>, and <<dimension-token>> have a representation composed of one or more <a>code points</a>, and a numeric value.
466
+
<<number-token>>, <<percentage-token>>, and <<dimension-token>> have a numeric value.
467
467
<<number-token>> and <<dimension-token>> additionally have a type flag set to either "integer" or "number". The type flag defaults to "integer" if not otherwise set.
468
468
<<dimension-token>> additionally have a unit composed of one or more <a>code points</a>.
469
469
</ul>
@@ -727,22 +727,22 @@ Definitions</h3>
727
727
728
728
<dt><dfn>next input code point</dfn>
729
729
<dd>
730
-
The first <a>code point</a> in the input stream that has not yet been consumed.
730
+
The first <a>code point</a> in the <a>input stream</a> that has not yet been consumed.
731
731
732
732
<dt><dfn>current input code point</dfn>
733
733
<dd>
734
734
The last <a>code point</a> to have been consumed.
735
735
736
736
<dt><dfn>reconsume the current input code point</dfn>
737
737
<dd>
738
-
Push the <a>current input code point</a> back onto the front of the input stream,
738
+
Push the <a>current input code point</a> back onto the front of the <a>input stream</a>,
739
739
so that the next time you are instructed to consume the <a>next input code point</a>,
740
740
it will instead reconsume the <a>current input code point</a>.
741
741
742
742
<dt><dfn>EOF code point</dfn>
743
743
<dd>
744
-
A conceptual <a>code point</a> representing the end of the input stream.
745
-
Whenever the input stream is empty,
744
+
A conceptual <a>code point</a> representing the end of the <a>input stream</a>.
745
+
Whenever the <a>input stream</a> is empty,
746
746
the <a>next input code point</a> is always an EOF code point.
747
747
748
748
<dt><dfn export>digit</dfn>
@@ -817,6 +817,33 @@ Definitions</h3>
817
817
<<hash-token>> with the "id" ty
8000
pe flag,
818
818
and the unit of <<dimension-token>>.
819
819
820
+
<dt><dfn>representation</dfn>
821
+
<dd>
822
+
The <a>representation</a> of a token
823
+
is the subsequence of the <a>input stream</a>
824
+
consumed by the invocation of the <a>consume a token</a> algorithm
825
+
that produced it.
826
+
This is preserved for a few algorithms that rely on subtle details of the input text,
827
+
which a simple "re-serialization" of the tokens might disturb.
828
+
829
+
The <a>representation</a> is only consumed by internal algorithms,
830
+
and never directly exposed,
831
+
so it's not actually required to preserve the exact text;
832
+
equivalent methods,
833
+
such as associating each token with offsets into the source text,
834
+
also suffice.
835
+
836
+
Note: In particular, the <a>representation</a> preserves details
837
+
such as whether .009 was written as ''.009'' or ''9e-3'',
838
+
and whether a character was written literally
839
+
or as a CSS escape.
840
+
The former is necessary to properly parse <<urange>> productions;
841
+
the latter is basically an accidental leak of the tokenizing abstraction,
842
+
but allowed because it makes the impl easier to define.
843
+
844
+
If a token is ever produced by an algorithm directly,
845
+
rather than thru the tokenization algorithm in this specification,
846
+
its representation is the empty string.
820
847
</dl>
821
848
822
849
<!--
@@ -1063,7 +1090,7 @@ Consume a numeric token</h4>
1063
1090
then:
1064
1091
1065
1092
<ol>
1066
-
<li>Create a <<dimension-token>> with the same representation, value, and type flag as the returned number,
1093
+
<li>Create a <<dimension-token>> with the same value and type flag as the returned number,
1067
1094
and a unit set initially to the empty string.
1068
1095
1069
1096
<li><a>Consume a name</a>.
@@ -1075,11 +1102,11 @@ Consume a numeric token</h4>
1075
1102
Otherwise,
1076
1103
if the <a>next input code point</a> is U+0025 PERCENTAGE SIGN (%),
1077
1104
consume it.
1078
-
Create a <<percentage-token>> with the same representation and value as the returned number,
1105
+
Create a <<percentage-token>> with the same value as the returned number,
1079
1106
and return it.
1080
1107
1081
1108
Otherwise,
1082
-
create a <<number-token>> with the same representation, value, and type flag as the returned number,
1109
+
create a <<number-token>> with the same value and type flag as the returned number,
1083
1110
and return it.
1084
1111
1085
1112
@@ -1431,10 +1458,8 @@ Consume a name</h4>
1431
1458
Consume a number</h4>
1432
1459
1433
1460
This section describes how to <dfn>consume a number</dfn> from a stream of <a>code points</a>.
1434
-
It returns a 3-tuple of
1435
-
a string representation,
1436
-
a numeric value,
1437
-
and a type flag which is either "integer" or "number".
1461
+
It returns a numeric |value|,
1462
+
and a |type| which is either "integer" or "number".
1438
1463
1439
1464
Note: This algorithm does not do the verification of the first few <a>code points</a>
1440
1465
that are necessary to ensure a number can be obtained from the stream.
@@ -1445,8 +1470,8 @@ Consume a number</h4>
1445
1470
1446
1471
<ol>
1447
1472
<li>
1448
-
Initially set <var>repr</var> to the empty string
1449
-
and <var>type</var> to "integer".
1473
+
Initially set <var>type</var> to "integer".
1474
+
Let |repr| be the empty string.
1450
1475
1451
1476
<li>
1452
1477
If the <a>next input code point</a> is U+002B PLUS SIGN (+) or U+002D HYPHEN-MINUS (-),
@@ -1487,7 +1512,7 @@ Consume a number</h4>
1487
1512
and set the <var>value</var> to the returned value.
1488
1513
1489
1514
<li>
1490
-
Return a 3-tuple of <var>repr</var>, <var>value</var>, and <var>type</var>.
1515
+
Return <var>value</var> and <var>type</var>.
1491
1516
</ol>
1492
1517
1493
1518
@@ -1967,8 +1992,8 @@ Parse something according to a CSS grammar</h4>
1967
1992
or the result of parsing the input according to the grammar,
1968
1993
which is an unspecified structure corresponding to the provided grammar specification.
1969
1994
The return value must only be interacted with by specification prose,
1970
-
where the representation ambiguity is not problematic
1971
-
if it is meant to be exposed outside of spec language,
1995
+
where the representation ambiguity is not problematic.
1996
+
If it is meant to be exposed outside of spec language,
1972
1997
the spec using the result must explicitly translate it into a well-specified representation,
1973
1998
such as, for example, by invoking a CSS serialization algorithm
1974
1999
(like "serialize as a CSS <<string>> value").
@@ -2660,8 +2685,8 @@ The <code><an+b></code> type</h3>
2660
2685
<li><dfn><code><ndashdigit-ident></code></dfn> is an <<ident-token>> whose value is an <a>ASCII case-insensitive</a> match for "n-*", where "*" is a series of one or more <a>digits</a>
2661
2686
<li><dfn><code><dashndashdigit-ident></code></dfn> is an <<ident-token>> whose value is an <a>ASCII case-insensitive</a> match for "-n-*", where "*" is a series of one or more <a>digits</a>
2662
2687
<li><dfn><code><integer></code></dfn> is a <<number-token>> with its type flag set to "integer"
2663
-
<li><dfn><code><signed-integer></code></dfn> is a <<number-token>> with its type flag set to "integer", and whose representation starts with "+" or "-"
2664
-
<li><dfn><code><signless-integer></code></dfn> is a <<number-token>> with its type flag set to "integer", and whose representation start with a <a>digit</a>
2688
+
<li><dfn><code><signed-integer></code></dfn> is a <<number-token>> with its type flag set to "integer", and whose <a>representation</a> starts with "+" or "-"
2689
+
<li><dfn><code><signless-integer></code></dfn> is a <<number-token>> with its type flag set to "integer", and whose <a>representation</a> start with a <a>digit</a>
2665
2690
</ul>
2666
2691
2667
2692
<p id="anb-plus">
@@ -2779,6 +2804,28 @@ The <<urange>> type</h3>
2779
2804
in terms of existing CSS tokens,
2780
2805
and how to interpret it as a range of unicode codepoints.
2781
2806
2807
+
<details class=note>
2808
+
<summary>What are the confusing collisions?</summary>
2809
+
2810
+
For example, in the CSS <nobr>''u + a { color: green; }''</nobr>,
2811
+
the intended meaning is that an <code>a</code> element
2812
+
following a <code>u</code> element
2813
+
should be colored green.
2814
+
Whitespace is not normally required between combinators
2815
+
and the surrounding selectors,
2816
+
so it <em>should</em> be equivalent to minify it to
2817
+
<nobr>''u+a{color:green;}''</nobr>.
2818
+
2819
+
With any other combinator, the two pieces of CSS would be equivalent,
2820
+
but due to the previous existence of a specialized unicode-range token,
2821
+
the selector portion of the minified code now contains a unicode-range,
2822
+
not two idents and a combinator.
2823
+
It thus fails to match the Selectors grammar,
2824
+
and the rule is thrown out as invalid.
2825
+
2826
+
(This example is taken from a real-world bug reported to Firefox.)
2827
+
</details>
2828
+
2782
2829
Note: The syntax described here is intentionally very low-level,
2783
2830
and geared toward implementors.
2784
2831
Authors should instead read the informal syntax description in the previous section,
@@ -2808,9 +2855,7 @@ The <<urange>> type</h3>
2808
2855
execute the following steps in order:
2809
2856
2810
2857
1. Skipping the first ''u'' token,
2811
-
concatenate the representations of all the tokens in the production together
2812
-
(or, in the case of <<dimension-token>>s,
2813
-
the representation followed by the unit).
2858
+
concatenate the <a>representations</a> of all the tokens in the production together.
2814
2859
Let this be <var>text</var>.
2815
2860
2816
2861
2. If the first character of <var>text</var> is U+002B PLUS SIGN,
0 commit comments