@@ -24,6 +24,7 @@ At Risk: the 'hanging-punctuation' property
2424
2525 <style type="text/css">
2626 img { vertical-align: middle; }
27+ span[lang] { font-size: 125%; line-height: 1; vertical-align: middle;}
2728
2829 /* Bidi & spaces example */
2930 .egbidiwsaA,.egbidiwsbB,.egbidiwsaB,.egbidiwsbC
@@ -154,69 +155,77 @@ Characters and Letters</h4>
154155 However, because writing systems are not always as simple as the basic English alphabet,
155156 what a <i> character</i> actually is depends on the context in which the term is used.
156157 For example, in Hangul (the Korean writing system),
157- each square representation of a syllable can be considered a <i> character</i> .
158- However, the square symbol is really composed of multiple symbols each representing a phoneme,
158+ each square representation of a syllable
159+ (e.g. <span lang=ko-hang title="Hangul syllable HAN"> 한</span> =<span lang=ko-Latn> Han</span> )
160+ can be considered a <i> character</i> .
161+ However, the square symbol is really composed of multiple letters each representing a phoneme
162+ (e.g. <span lang=ko-hang title="Hangul letter HIEUH"> ㅎ</span> =<span lang=ko-Latn> h</span> ,
163+ <span lang=ko-hang title="Hangul letter HIEUH"> ㅏ</span> =<span lang=ko-Latn> a</span> ,
164+ <span lang=ko-hang title="Hangul letter HIEUH"> ㄴ</span> =<span lang=ko-Latn> n</span> )
159165 and these also could each be considered a <i> character</i> .
160- A basic unit of computer text encoding, for any given encoding, is also called a <i> character</i> ,
161- and depending on the encoding, a single encoding <i> character</i> might correspond
162- either to a single phonemic <i> character</i>
163- or to a unitary pre-composed syllabic <i> character</i> .
164- In turn, a single encoding <i> character</i> can be represented in the data stream as one or more bytes;
165- and in programming environments one or a pair of such bytes is sometimes also called a <i> character</i> .
166-
167- <p> For text layout, the relevant unit is
168- the “user-perceived character”, also known as the <dfn>grapheme cluster</dfn> .
169- It is roughly equivalent to what a <em> language user</em> (as opposed to a computer programmer)
170- considers to be a <i> character</i> or basic unit of the script.
171- This term is described in detail in the Unicode Technical Report: Text Boundaries [[!UAX29]] .
172- Since even typesetting alone requires different notions of <i> grapheme clusters</i>
173- depending on the application, CSS introduces the following terms:
174-
175- <dl>
176- <dt> <dfn>semantically-perceived character</dfn>
177- <dd>
178- <p> Represents a unit of the writing system,
179- such as a Latin alphabetic letter (including its diacritics),
180- Hangul syllable,
181- Chinese ideographic character,
182- Myanmar syllable cluster,
183- that is indivisible with regards to segmentation
184- (line-breaking, first-letter effects, etc).
185-
186- <p> The UA must interpret this as an <em> extended grapheme cluster</em>
187- (not <em> legacy grapheme cluster</em> ) as defined in [[!UAX29]] .
188- However, the UA should tailor the definition as required by typographical tradition,
189- since the default rules are not always appropriate.
190-
191- <dt> <dfn>visually-perceived character</dfn>
192- <dd>
193- <p> Represents a unit of the writing system,
194- that is indivisible with regards to spacing separation
195- (letter-spacing, justification, etc).
196- </dl>
197166
198- <p> The UA must interpret both <i> visually-perceived characters</i> and <i> semantically-perceived characters</i>
199- as <em> extended grapheme clusters</em> (not <em> legacy grapheme clusters</em> )
200- as defined in [[!UAX29]] .
201- However, the UA may tailor the definitions as required by typographical tradition,
167+ <p> A basic unit of computer text encoding, for any given encoding,
168+ is also called a <i> character</i> ,
169+ and depending on the encoding,
170+ a single encoding <i> character</i> might correspond
171+ to the entire pre-composed syllabic <i> character</i> (e.g. <span lang=ko-hang title="Hangul syllable HAN"> 한</span> ),
172+ to the individual phonemic <i> character</i> (e.g. <span lang=ko-hang title="Hangul letter HIEUH"> ㅎ</span> ),
173+ or to smaller units such as
174+ a base letterform (e.g. <span lang=ko-hang title="Hangul letter IEUNG"> ㅇ</span> )
175+ and any combining marks that vary it (e.g. extra strokes that represent aspiration).
176+
177+ <p> In turn, a single encoding <i> character</i> can be represented in the data stream as one or more bytes;
178+ and in programming environments one byte is sometimes also called a <i> character</i> .
179+
180+ <p> For text layout, we will refer to the <dfn title="typographic character unit|typographic character">typographic character unit</dfn>
181+ as the basic unit of text.
182+ Even within the realm of text layout,
183+ the relevant <i> character</i> unit depends on the operation.
184+ For example, line-breaking and letter-spacing will segment
185+ a sequence of Thai characters that include U+0E33 THAI CHARACTER SARA AM differently;
186+ or the behaviour of a conjunct consonant in a script such as Devanagari
187+ may depend on the font in use.
188+ So the <i> typographic character</i> represents a unit of the writing system—<!--
189+ -->such as a Latin alphabetic letter (including its diacritics),
190+ Hangul syllable,
191+ Chinese ideographic character,
192+ Myanmar syllable cluster—<!--
193+ -->that is indivisible with respect to a particular typographic operation
194+ (line-breaking, first-letter effects, tracking, justification, vertical arrangement, etc.).
195+
196+ <a href="http://www.unicode.org/reports/tr29/">Unicode Standard Annex #29: Text Segmentation</a>
197+ defines a unit called the <dfn>grapheme cluster</dfn>
198+ which approximates the <i> typographic character</i> .
199+ A UA must use the <em> extended grapheme cluster</em>
200+ (not <em> legacy grapheme cluster</em> ), as defined in [[!UAX29]] ,
201+ as the basis for its <i> typographic character unit</i> .
202+ However, the UA should tailor the definitions
203+ as required by typographic tradition
202204 since the default rules are not always appropriate or ideal,
203205 and is expected to tailor them differently
204- for <i> visually-perceived characters</i> than <i> semantically-perceived characters</i>
205- as needed.
206-
206+ depending on the operation as needed.
207+
208+ <!--
209+ <p class="note">
210+ The rules for such tailorings are out of scope for CSS,
211+ however W3C currently maintains a wiki page
212+ where some known tailorings are collected.
213+ -->
214+
207215 <div class="example">
208216 <p> For example,
209217 in some scripts such as Myanmar or Devanagari,
210- the typographic unit for both justification and line-breaking
211- (<i> visually-perceived character</i> and <i> semantically-perceived characters</i> )
218+ the <i> typographic character unit</i> for both justification and line-breaking
212219 is an entire syllable,
213- which can include multiple [[!UAX29]] <i> grapheme clusters </i> .
220+ which can include more than one [[!UAX29]] <i> grapheme cluster </i> .
214221
215222 <p> In other scripts such as Thai or Lao,
216- even though a <i> semantically-perceived character</i> matches the default <i> grapheme cluster</i> definition,
217- a <i> visually-perceived character</i>
223+ even though for line-breaking the <i> typographic character</i>
224+ matches Unicode’s default <i> grapheme clusters</i> ,
225+ for line-breaking the relevant unit
218226 is <em> less</em> than a [[!UAX29]] <i> grapheme cluster</i> ,
219- and may require decomposition or other substitutions before spacing can be inserted.
227+ and may require decomposition or other substitutions
228+ before spacing can be inserted.
220229
221230 <p> For instance,
222231 to properly letter-space the Thai word คำ (U+0E04 + U+0E33),
@@ -229,25 +238,24 @@ Characters and Letters</h4>
229238 As before the extra letter-space is then inserted before the U+0E32: นํ้ า.
230239 </div>
231240
232- <p> A <dfn>letter</dfn> for the purpose of this specification
233- is a <i> character</i> belonging to one of the Letter or Number general
241+ <p> A <dfn>typographic letter unit</dfn> or <dfn> letter</dfn> for the purpose of this specification
242+ is a <i> typographic character unit </i> belonging to one of the Letter or Number general
234243 categories in Unicode. [[!UAX44]]
235- To be more precise,
236- a <dfn>semantically-perceived letter</dfn> is a <i> semantically-perceived character</i>
237- belonging to one of the Letter or Number general categories
238- and a <dfn>visually-perceived letter</dfn> is likewise a <i> visually-perceived character</i>
239- belonging to one of the Letter or Number general categories.
244+ See <a href="#character-properties">Character Properties</a>
245+ for how to determine the Unicode properties of a <i> typographic character unit</i> .
240246
241- See <a href="http://www.w3.org/TR/css3-writing-modes/#character-properties">Characters and Properties</a>
242- for how to determine the Unicode properties of a <i> character</i> .
243-
244- <p> The rendering characteristics of a <i> character</i> divided
247+ <p> The rendering characteristics of a <i> typographic character unit</i> divided
245248 by an element boundary is undefined:
246249 it may be rendered as belonging to either side of the boundary,
247250 or as some approximation of belonging to both.
248251 Authors are forewarned that dividing <i> grapheme clusters</i>
249252 by element boundaries may give inconsistent or undesired results.
250253
254+ <p> <dfn>semantically-perceived character</dfn>
255+ <dfn>visually-perceived character</dfn>
256+ <dfn>semantically-perceived letter</dfn>
257+ <dfn>visually-perceived letter</dfn>
258+
251259<h4 id="languages">
252260Languages and Typesetting</h4>
253261
@@ -2492,6 +2500,40 @@ Appendix C: Scripts and Spacing</h2>
24922500 to handle as-yet-unencoded cursive scripts in future versions of Unicode,
24932501 and are encouraged to ask the CSSWG to update this spec accordingly.
24942502
2503+ <h2 id="character-properties" class="no-num">Appendix D.
2504+ Characters and Properties</h2>
2505+
2506+ <p> Unicode defines three codepoint-level properties that are referenced
2507+ in CSS Text:
2508+ <dl>
2509+ <dt> <a href="http://www.unicode.org/reports/tr11/#Definitions">East Asian width</a>
2510+ <dd> Defined in [[!UAX11]] and given as the East_Asian_Width property
2511+ in the Unicode Character Database [[!UAX44]] .
2512+ <dt> <a href="http://www.unicode.org/reports/tr44/#General_Category_Values">General Category</a>
2513+ <dd> Defined in [[!UAX44]] and given as the General_Category property
2514+ in the Unicode Character Database [[!UAX44]] .
2515+ <dt> <a href="http://www.unicode.org/reports/tr24/#Values">Script property</a>
2516+ <dd> Defined in [[!UAX24]] and given as the Script property
2517+ in the Unicode Character Database [[!UAX44]] . (UAs should
2518+ include any ScriptExtensions.txt assignments in this mapping.)
2519+ </dl>
2520+
2521+ <p> Unicode defines properties for individual codepoints, but sometimes
2522+ it is necessary to determine the properties of a <i> typographic character unit</i> .
2523+ For the purposes of CSS Text,
2524+ the properties of a <i> typographic character unit</i> are given by
2525+ the base character of its first <i> grapheme cluster</i> —except in two cases:
2526+ <ul>
2527+ <li><i> Grapheme clusters</i> formed with an Enclosing Mark (<code> Me</code> ) of the Common script
2528+ are considered to be Other Symbols (<code> So</code> ) in the Common script.
2529+ They are assumed to have the same Unicode properties as the Replacement Character U+FFFD.
2530+ <li> Grapheme clusters</i> formed with a Space Separator (<code> Zs</code> ) as the base
2531+ are considered to be Modifier Symbols (<code> Sk</code> ).
2532+ They are assumed to have the same East Asian Width property as the base,
2533+ but take their other properties from the first combining character in the sequence.
2534+ </ul>
2535+ <p> The
2536+
24952537<h2 class="no-num" id="acknowledgements">
24962538Acknowledgements</h2>
24972539
0 commit comments