[css2] Added that \0 is undefined.

bert-github · bert-github · commit 926790544b70 · 2004-03-25T18:09:51.000Z
--HG--
extra : convert_revision : svn%3A73dc7c4b-06e6-40f3-b4f7-9ed1dbc14bfc/trunk%402247
diff --git a/css2/syndata.src b/css2/syndata.src
@@ -1,7 +1,7 @@
 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
         "http://www.w3.org/TR/1998/REC-html40-19980424/loose.dtd">
 <html lang="en">
-<!-- $Id: syndata.src,v 2.103 2004-03-08 18:40:14 bbos Exp $ -->
+<!-- $Id: syndata.src,v 2.104 2004-03-25 18:09:51 bbos Exp $ -->
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
 <title>Syntax and basic data types</title>
@@ -264,6 +264,8 @@ href="#parsing-errors">rules for handling parsing errors</a>. However, because t
     is followed by at most six hexadecimal digits (0..9A..F), which
     stand for the ISO 10646 ([[ISO10646]]) 
     character with that number, which must not be zero. 
+    (It is undefined in CSS&nbsp;2.1 what happens if a style sheet
+    <em>does</em> contain a zero.)
     If a character in the range [0-9a-fA-F] follows the hexadecimal number,
     the end of the number needs to be made clear. There are two ways
     to do that:
@@ -1203,10 +1205,10 @@ encoding</span> (from highest priority to lowest):
 <li>Assume UTF-8</li>
 </ol>
 
-<p>At most one @charset rule may appear in an external style sheet and
-it must appear at the very start of the style sheet or immediately
-after a Byte Order Mark (BOM, U+FEFF) that is at the very start of the
-style sheet. Any other @charset rules must be ignored by the UA.
+<p>Authors using an @charset rule must place the rule at the very
+beginning of the style sheet, preceded by no characters.  (If a byte
+order mark is appropriate for the encoding used, it may precede the
+@charset rule.)
 </p>
 
 <p>After "@charset", authors specify the name of a character encoding
@@ -1229,25 +1231,78 @@ registry.
 <p>This specification does not mandate which character encodings a
 user agent must support.
 
-<p>This specification does not specify what algorithm a UA must
-apply to derive the encoding from the BOM and the @charset. In
-particular, it does not specify the encoding to use if the BOM and the
-@charset conflict. This is expected to be defined in CSS3.
+<p>User agents must ignore any @charset rule not at the beginning of the
+style sheet.  When user agents detect the character encoding using the
+BOM and/or the @charset rule, they should follow the following rules:
 </p>
 
-<p class="note">Note that reliance on the @charset construct
-theoretically poses a
-problem since there is no <em>a priori</em> information on how it is
-encoded. In practice, however, the encodings in wide use on the
-Internet are either based on ASCII, UTF-16, UCS-4, or (rarely) on
-EBCDIC.  This means that in general, the initial byte values of a
-style sheet enable a user agent to detect the encoding family reliably,
-which provides enough information to decode the @charset rule, which
-in turn determines the exact character encoding.
-</p>
-<!-- More examples of good encodings to use? -IJ -->
+<ul>
+
+<li>Except as specified in these rules, all @charset rules are ignored.</li>
+
+<li>The encoding is detected based on the stream of bytes that begins
+the stylesheet.  The following table gives a set of possibilities for
+initial byte sequences (written in hexadecimal).  The first row that
+matches the beginning of the stylesheet gives the result of encoding
+detection based on the BOM and/or @charset rule.  If no rows match, the
+encoding cannot be detected based on the BOM and/or @charset rule.  The
+notation (...)* refers to repetition for which the best match is the one
+that repeats as few times as possible.  The bytes marked "XX" are those
+used to determine the name of the encoding, by treating them, in the
+order given, as a sequence of ASCII characters.  Bytes marked "YY" are
+similar, but need to be transcoded into ASCII as noted.  User agents may
+ignore entries in the table if they do not support any encodings
+relevant to the entry.
+
+<table border="1"
+  summary="Relationship between initial bytes of sheet and chosen encoding">
+<tr><th scope="col">Initial Bytes</th><th scope="col">Result</th></tr>
+<tr><td>EF BB BF 40 63 68 61 72 73 65 74 20 22 (XX)* 22 3B</td><td>as specified</td></tr>
+<tr><td>EF BB BF</td><td>UTF-8</td></tr>
+<tr><td>40 63 68 61 72 73 65 74 20 22 (XX)* 22 3B</td><td>as specified</td></tr>
+<tr><td>FE FF 00 40 00 63 00 68 00 61 00 72 00 73 00 65 00 74 00 20 00 22 (00 XX)* 00 22 00 3B</td><td>as specified (with BE endianness if not specified)</td></tr>
+<tr><td>00 40 00 63 00 68 00 61 00 72 00 73 00 65 00 74 00 20 00 22 (00 XX)* 00 22 00 3B</td><td>as specified (with BE endianness if not specified)</td></tr>
+<tr><td>FF FE 40 00 63 00 68 00 61 00 72 00 73 00 65 00 74 00 20 00 22 00 (XX 00)* 22 00 3B 00</td><td>as specified (with LE endianness if not specified)</td></tr>
+<tr><td>40 00 63 00 68 00 61 00 72 00 73 00 65 00 74 00 20 00 22 00 (XX 00)* 22 00 3B 00</td><td>as specified (with LE endianness if not specified)</td></tr>
+<tr><td>00 00 FE FF 00 00 00 40 00 00 00 63 00 00 00 68 00 00 00 61 00 00 00 72 00 00 00 73 00 00 00 65 00 00 00 74 00 00 00 20 00 00 00 22 (00 00 00 XX)* 00 00 00 22 00 00 00 3B</td><td>as specified (with BE endianness if not specified)</td></tr>
+<tr><td>00 00 00 40 00 00 00 63 00 00 00 68 00 00 00 61 00 00 00 72 00 00 00 73 00 00 00 65 00 00 00 74 00 00 00 20 00 00 00 22 (00 00 00 XX)* 00 00 00 22 00 00 00 3B</td><td>as specified (with BE endianness if not specified)</td></tr>
+<tr><td>00 00 FF FE 00 00 40 00 00 00 63 00 00 00 68 00 00 00 61 00 00 00 72 00 00 00 73 00 00 00 65 00 00 00 74 00 00 00 20 00 00 00 22 00 (00 00 XX 00)* 00 00 22 00 00 00 3B 00</td><td>as specified (with 2143 endianness if not specified)</td></tr>
+<tr><td>00 00 40 00 00 00 63 00 00 00 68 00 00 00 61 00 00 00 72 00 00 00 73 00 00 00 65 00 00 00 74 00 00 00 20 00 00 00 22 00 (00 00 XX 00)* 00 00 22 00 00 00 3B 00</td><td>as specified (with 2143 endianness if not specified)</td></tr>
+<tr><td>FE FF 00 00 00 40 00 00 00 63 00 00 00 68 00 00 00 61 00 00 00 72 00 00 00 73 00 00 00 65 00 00 00 74 00 00 00 20 00 00 00 22 00 00 (00 XX 00 00)* 00 22 00 00 00 3B 00 00</td><td>as specified (with 3412 endianness if not specified)</td></tr>
+<tr><td>00 40 00 00 00 63 00 00 00 68 00 00 00 61 00 00 00 72 00 00 00 73 00 00 00 65 00 00 00 74 00 00 00 20 00 00 00 22 00 00 (00 XX 00 00)* 00 22 00 00 00 3B 00 00</td><td>as specified (with 3412 endianness if not specified)</td></tr>
+<tr><td>FF FE 00 00 40 00 00 00 63 00 00 00 68 00 00 00 61 00 00 00 72 00 00 00 73 00 00 00 65 00 00 00 74 00 00 00 20 00 00 00 22 00 00 00 (XX 00 00 00)* 22 00 00 00 3B 00 00 00</td><td>as specified (with LE endianness if not specified)</td></tr>
+<tr><td>40 00 00 00 63 00 00 00 68 00 00 00 61 00 00 00 72 00 00 00 73 00 00 00 65 00 00 00 74 00 00 00 20 00 00 00 22 00 00 00 (XX 00 00 00)* 22 00 00 00 3B 00 00 00</td><td>as specified (with LE endianness if not specified)</td></tr>
+<tr><td>00 00 FE FF</td><td>UTF-32-BE</td></tr>
+<tr><td>FF FE 00 00</td><td>UTF-32-LE</td></tr>
+<tr><td>00 00 FF FE</td><td>UTF-32-2143</td></tr>
+<tr><td>FE FF 00 00</td><td>UTF-32-3412</td></tr>
+<tr><td>FE FF</td><td>UTF-16-BE</td></tr>
+<tr><td>FF FE</td><td>UTF-16-LE</td></tr>
+<tr><td>7C 83 88 81 99 A2 85 A3 40 7F (YY)* 7F 5E</td><td>as specified, transcoded from EBCDIC to ASCII</td></tr>
+<tr><td>AE 83 88 81 99 A2 85 A3 40 FC (YY)* FC 5E</td><td>as specified, transcoded from IBM1026 to ASCII</td></tr>
+<tr><td>00 63 68 61 72 73 65 74 20 22 (YY)* 22 3B</td><td>as specified, transcoded from GSM 03.38 to ASCII</td></tr>
+<tr><td>analogous patterns</td><td>User agents may
+    support additional, analogous, patterns if they support encodings
+    that are not handled by the patterns here</td></tr>
+</table>
+
+</li>
 
-<!-- Encodings not to use? (cf. HTML 4.0) -IJ -->
+<li>If the encoding is detected based on one of the entries in the table
+above marked "as specified", the user agent ignores the stylesheet if it
+does not parse an appropriate @charset rule at the beginning of the
+stream of characters resulting from decoding in the chosen @charset.
+This ensures that:
+  <ul>
+    <li>@charset rules should only function if they are in the
+    encoding of the stylesheet,</li>
+    <li>byte order marks are ignored only
+    in encodings that support a byte order mark, and</li>
+    <li>encoding names cannot contain newlines.</li>
+  </ul>
+</li>
+
+</ul>
 
 <h3>Referring to characters not represented in a character encoding</h3>