8000 [css-syntax] Rewrite representation to be the literal text consumed b… · w3c/csswg-drafts@a313f90 · GitHub
Skip to content

Commit a313f90

Browse files
committed
[css-syntax] Rewrite representation to be the literal text consumed by the tokenizing algo, allowing it to isntead be offsets into the input stream. Make the appropriate corrections to match this.
1 parent 4995eb6 commit a313f90

1 file changed

Lines changed: 68 additions & 23 deletions

File tree

css-syntax/Overview.bs

Lines changed: 68 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -397,7 +397,7 @@ The input byte stream</h3>
397397
<h3 id="input-preprocessing">
398398
Preprocessing the input stream</h3>
399399

400-
The input stream consists of the <a>code points</a>
400+
The <dfn>input stream</dfn> consists of the <a>code points</a>
401401
pushed into it as the input byte stream is decoded.
402402

403403
Before sending the input stream to the tokenizer,
@@ -463,7 +463,7 @@ Tokenization</h2>
463463
<<delim-token>> has a value composed of a single <a>code point</a>.
464464

465465
<li>
466-
<<number-token>>, <<percentage-token>>, and <<dimension-token>> have a representation composed of one or more <a>code points</a>, and a numeric value.
466+
<<number-token>>, <<percentage-token>>, and <<dimension-token>> have a numeric value.
467467
<<number-token>> and <<dimension-token>> additionally have a type flag set to either "integer" or "number". The type flag defaults to "integer" if not otherwise set.
468468
<<dimension-token>> additionally have a unit composed of one or more <a>code points</a>.
469469
</ul>
@@ -727,22 +727,22 @@ Definitions</h3>
727727

728728
<dt><dfn>next input code point</dfn>
729729
<dd>
730-
The first <a>code point</a> in the input stream that has not yet been consumed.
730+
The first <a>code point</a> in the <a>input stream</a> that has not yet been consumed.
731731

732732
<dt><dfn>current input code point</dfn>
733733
<dd>
734734
The last <a>code point</a> to have been consumed.
735735

736736
<dt><dfn>reconsume the current input code point</dfn>
737737
<dd>
738-
Push the <a>current input code point</a> back onto the front of the input stream,
738+
Push the <a>current input code point</a> back onto the front of the <a>input stream</a>,
739739
so that the next time you are instructed to consume the <a>next input code point</a>,
740740
it will instead reconsume the <a>current input code point</a>.
741741

742742
<dt><dfn>EOF code point</dfn>
743743
<dd>
744-
A conceptual <a>code point</a> representing the end of the input stream.
745-
Whenever the input stream is empty,
744+
A conceptual <a>code point</a> representing the end of the <a>input stream</a>.
745+
Whenever the <a>input stream</a> is empty,
746746
the <a>next input code point</a> is always an EOF code point.
747747

748748
<dt><dfn export>digit</dfn>
@@ -817,6 +817,33 @@ Definitions</h3>
817817
<<hash-token>> with the "id" ty 8000 pe flag,
818818
and the unit of <<dimension-token>>.
819819

820+
<dt><dfn>representation</dfn>
821+
<dd>
822+
The <a>representation</a> of a token
823+
is the subsequence of the <a>input stream</a>
824+
consumed by the invocation of the <a>consume a token</a> algorithm
825+
that produced it.
826+
This is preserved for a few algorithms that rely on subtle details of the input text,
827+
which a simple "re-serialization" of the tokens might disturb.
828+
829+
The <a>representation</a> is only consumed by internal algorithms,
830+
and never directly exposed,
831+
so it's not actually required to preserve the exact text;
832+
equivalent methods,
833+
such as associating each token with offsets into the source text,
834+
also suffice.
835+
836+
Note: In particular, the <a>representation</a> preserves details
837+
such as whether .009 was written as ''.009'' or ''9e-3'',
838+
and whether a character was written literally
839+
or as a CSS escape.
840+
The former is necessary to properly parse <<urange>> productions;
841+
the latter is basically an accidental leak of the tokenizing abstraction,
842+
but allowed because it makes the impl easier to define.
843+
844+
If a token is ever produced by an algorithm directly,
845+
rather than thru the tokenization algorithm in this specification,
846+
its representation is the empty string.
820847
</dl>
821848

822849
<!--
@@ -1063,7 +1090,7 @@ Consume a numeric token</h4>
10631090
then:
10641091

10651092
<ol>
1066-
<li>Create a <<dimension-token>> with the same representation, value, and type flag as the returned number,
1093+
<li>Create a <<dimension-token>> with the same value and type flag as the returned number,
10671094
and a unit set initially to the empty string.
10681095

10691096
<li><a>Consume a name</a>.
@@ -1075,11 +1102,11 @@ Consume a numeric token</h4>
10751102
Otherwise,
10761103
if the <a>next input code point</a> is U+0025 PERCENTAGE SIGN (%),
10771104
consume it.
1078-
Create a <<percentage-token>> with the same representation and value as the returned number,
1105+
Create a <<percentage-token>> with the same value as the returned number,
10791106
and return it.
10801107

10811108
Otherwise,
1082-
create a <<number-token>> with the same representation, value, and type flag as the returned number,
1109+
create a <<number-token>> with the same value and type flag as the returned number,
10831110
and return it.
10841111

10851112

@@ -1431,10 +1458,8 @@ Consume a name</h4>
14311458
Consume a number</h4>
14321459

14331460
This section describes how to <dfn>consume a number</dfn> from a stream of <a>code points</a>.
1434-
It returns a 3-tuple of
1435-
a string representation,
1436-
a numeric value,
1437-
and a type flag which is either "integer" or "number".
1461+
It returns a numeric |value|,
1462+
and a |type| which is either "integer" or "number".
14381463

14391464
Note: This algorithm does not do the verification of the first few <a>code points</a>
14401465
that are necessary to ensure a number can be obtained from the stream.
@@ -1445,8 +1470,8 @@ Consume a number</h4>
14451470

14461471
<ol>
14471472
<li>
1448-
Initially set <var>repr</var> to the empty string
1449-
and <var>type</var> to "integer".
1473+
Initially set <var>type</var> to "integer".
1474+
Let |repr| be the empty string.
14501475

14511476
<li>
14521477
If the <a>next input code point</a> is U+002B PLUS SIGN (+) or U+002D HYPHEN-MINUS (-),
@@ -1487,7 +1512,7 @@ Consume a number</h4>
14871512
and set the <var>value</var> to the returned value.
14881513

14891514
<li>
1490-
Return a 3-tuple of <var>repr</var>, <var>value</var>, and <var>type</var>.
1515+
Return <var>value</var> and <var>type</var>.
14911516
</ol>
14921517

14931518

@@ -1967,8 +1992,8 @@ Parse something according to a CSS grammar</h4>
19671992
or the result of parsing the input according to the grammar,
19681993
which is an unspecified structure corresponding to the provided grammar specification.
19691994
The return value must only be interacted with by specification prose,
1970-
where the representation ambiguity is not problematic
1971-
if it is meant to be exposed outside of spec language,
1995+
where the representation ambiguity is not problematic.
1996+
If it is meant to be exposed outside of spec language,
19721997
the spec using the result must explicitly translate it into a well-specified representation,
19731998
such as, for example, by invoking a CSS serialization algorithm
19741999
(like "serialize as a CSS <<string>> value").
@@ -2660,8 +2685,8 @@ The <code>&lt;an+b></code> type</h3>
26602685
<li><dfn><code>&lt;ndashdigit-ident></code></dfn> is an <<ident-token>> whose value is an <a>ASCII case-insensitive</a> match for "n-*", where "*" is a series of one or more <a>digits</a>
26612686
<li><dfn><code>&lt;dashndashdigit-ident></code></dfn> is an <<ident-token>> whose value is an <a>ASCII case-insensitive</a> match for "-n-*", where "*" is a series of one or more <a>digits</a>
26622687
<li><dfn><code>&lt;integer></code></dfn> is a <<number-token>> with its type flag set to "integer"
2663-
<li><dfn><code>&lt;signed-integer></code></dfn> is a <<number-token>> with its type flag set to "integer", and whose representation starts with "+" or "-"
2664-
<li><dfn><code>&lt;signless-integer></code></dfn> is a <<number-token>> with its type flag set to "integer", and whose representation start with a <a>digit</a>
2688+
<li><dfn><code>&lt;signed-integer></code></dfn> is a <<number-token>> with its type flag set to "integer", and whose <a>representation</a> starts with "+" or "-"
2689+
<li><dfn><code>&lt;signless-integer></code></dfn> is a <<number-token>> with its type flag set to "integer", and whose <a>representation</a> start with a <a>digit</a>
26652690
</ul>
26662691

26672692
<p id="anb-plus">
@@ -2779,6 +2804,28 @@ The <<urange>> type</h3>
27792804
in terms of existing CSS tokens,
27802805
and how to interpret it as a range of unicode codepoints.
27812806

2807+
<details class=note>
2808+
<summary>What are the confusing collisions?</summary>
2809+
2810+
For example, in the CSS <nobr>''u + a { color: green; }''</nobr>,
2811+
the intended meaning is that an <code>a</code> element
2812+
following a <code>u</code> element
2813+
should be colored green.
2814+
Whitespace is not normally required between combinators
2815+
and the surrounding selectors,
2816+
so it <em>should</em> be equivalent to minify it to
2817+
<nobr>''u+a{color:green;}''</nobr>.
2818+
2819+
With any other combinator, the two pieces of CSS would be equivalent,
2820+
but due to the previous existence of a specialized unicode-range token,
2821+
the selector portion of the minified code now contains a unicode-range,
2822+
not two idents and a combinator.
2823+
It thus fails to match the Selectors grammar,
2824+
and the rule is thrown out as invalid.
2825+
2826+
(This example is taken from a real-world bug reported to Firefox.)
2827+
</details>
2828+
27822829
Note: The syntax described here is intentionally very low-level,
27832830
and geared toward implementors.
27842831
Authors should instead read the informal syntax description in the previous section,
@@ -2808,9 +2855,7 @@ The <<urange>> type</h3>
28082855
execute the following steps in order:
28092856

28102857
1. Skipping the first ''u'' token,
2811-
concatenate the representations of all the tokens in the production together
2812-
(or, in the case of <<dimension-token>>s,
2813-
the representation followed by the unit).
2858+
concatenate the <a>representations</a> of all the tokens in the production together.
28142859
Let this be <var>text</var>.
28152860

28162861
2. If the first character of <var>text</var> is U+002B PLUS SIGN,

0 commit comments

Comments
 (0)
</