Skip to content

Commit b19684b

Browse files
committed
[css-text] Revise grapheme cluster terminology section in response to <http://www.w3.org/mid/537E3D64.6020801@w3.org>. (Note the rest of the spec has not yet been updated to reflect this change. Saving that for tomorrow.)
1 parent 2f7ea66 commit b19684b

2 files changed

Lines changed: 268 additions & 222 deletions

File tree

css-text/Overview.bs

Lines changed: 105 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ At Risk: the 'hanging-punctuation' property
2424

2525
<style type="text/css">
2626
img { vertical-align: middle; }
27+
span[lang] { font-size: 125%; line-height: 1; vertical-align: middle;}
2728

2829
/* Bidi & spaces example */
2930
.egbidiwsaA,.egbidiwsbB,.egbidiwsaB,.egbidiwsbC
@@ -154,69 +155,77 @@ Characters and Letters</h4>
154155
However, because writing systems are not always as simple as the basic English alphabet,
155156
what a <i>character</i> actually is depends on the context in which the term is used.
156157
For example, in Hangul (the Korean writing system),
157-
each square representation of a syllable can be considered a <i>character</i>.
158-
However, the square symbol is really composed of multiple symbols each representing a phoneme,
158+
each square representation of a syllable
159+
(e.g. <span lang=ko-hang title="Hangul syllable HAN"></span>=<span lang=ko-Latn>Han</span>)
160+
can be considered a <i>character</i>.
161+
However, the square symbol is really composed of multiple letters each representing a phoneme
162+
(e.g. <span lang=ko-hang title="Hangul letter HIEUH"></span>=<span lang=ko-Latn>h</span>,
163+
<span lang=ko-hang title="Hangul letter HIEUH"></span>=<span lang=ko-Latn>a</span>,
164+
<span lang=ko-hang title="Hangul letter HIEUH"></span>=<span lang=ko-Latn>n</span>)
159165
and these also could each be considered a <i>character</i>.
160-
A basic unit of computer text encoding, for any given encoding, is also called a <i>character</i>,
161-
and depending on the encoding, a single encoding <i>character</i> might correspond
162-
either to a single phonemic <i>character</i>
163-
or to a unitary pre-composed syllabic <i>character</i>.
164-
In turn, a single encoding <i>character</i> can be represented in the data stream as one or more bytes;
165-
and in programming environments one or a pair of such bytes is sometimes also called a <i>character</i>.
166-
167-
<p>For text layout, the relevant unit is
168-
the “user-perceived character”, also known as the <dfn>grapheme cluster</dfn>.
169-
It is roughly equivalent to what a <em>language user</em> (as opposed to a computer programmer)
170-
considers to be a <i>character</i> or basic unit of the script.
171-
This term is described in detail in the Unicode Technical Report: Text Boundaries [[!UAX29]].
172-
Since even typesetting alone requires different notions of <i>grapheme clusters</i>
173-
depending on the application, CSS introduces the following terms:
174-
175-
<dl>
176-
<dt><dfn>semantically-perceived character</dfn>
177-
<dd>
178-
<p>Represents a unit of the writing system,
179-
such as a Latin alphabetic letter (including its diacritics),
180-
Hangul syllable,
181-
Chinese ideographic character,
182-
Myanmar syllable cluster,
183-
that is indivisible with regards to segmentation
184-
(line-breaking, first-letter effects, etc).
185-
186-
<p>The UA must interpret this as an <em>extended grapheme cluster</em>
187-
(not <em>legacy grapheme cluster</em>) as defined in [[!UAX29]].
188-
However, the UA should tailor the definition as required by typographical tradition,
189-
since the default rules are not always appropriate.
190-
191-
<dt><dfn>visually-perceived character</dfn>
192-
<dd>
193-
<p>Represents a unit of the writing system,
194-
that is indivisible with regards to spacing separation
195-
(letter-spacing, justification, etc).
196-
</dl>
197166

198-
<p>The UA must interpret both <i>visually-perceived characters</i> and <i>semantically-perceived characters</i>
199-
as <em>extended grapheme clusters</em> (not <em>legacy grapheme clusters</em>)
200-
as defined in [[!UAX29]].
201-
However, the UA may tailor the definitions as required by typographical tradition,
167+
<p>A basic unit of computer text encoding, for any given encoding,
168+
is also called a <i>character</i>,
169+
and depending on the encoding,
170+
a single encoding <i>character</i> might correspond
171+
to the entire pre-composed syllabic <i>character</i> (e.g. <span lang=ko-hang title="Hangul syllable HAN"></span>),
172+
to the individual phonemic <i>character</i> (e.g. <span lang=ko-hang title="Hangul letter HIEUH"></span>),
173+
or to smaller units such as
174+
a base letterform (e.g. <span lang=ko-hang title="Hangul letter IEUNG"></span>)
175+
and any combining marks that vary it (e.g. extra strokes that represent aspiration).
176+
177+
<p>In turn, a single encoding <i>character</i> can be represented in the data stream as one or more bytes;
178+
and in programming environments one byte is sometimes also called a <i>character</i>.
179+
180+
<p>For text layout, we will refer to the <dfn title="typographic character unit|typographic character">typographic character unit</dfn>
181+
as the basic unit of text.
182+
Even within the realm of text layout,
183+
the relevant <i>character</i> unit depends on the operation.
184+
For example, line-breaking and letter-spacing will segment
185+
a sequence of Thai characters that include U+0E33 THAI CHARACTER SARA AM differently;
186+
or the behaviour of a conjunct consonant in a script such as Devanagari
187+
may depend on the font in use.
188+
So the <i>typographic character</i> represents a unit of the writing system&mdash;<!--
189+
-->such as a Latin alphabetic letter (including its diacritics),
190+
Hangul syllable,
191+
Chinese ideographic character,
192+
Myanmar syllable cluster&mdash;<!--
193+
-->that is indivisible with respect to a particular typographic operation
194+
(line-breaking, first-letter effects, tracking, justification, vertical arrangement, etc.).
195+
196+
<a href="http://www.unicode.org/reports/tr29/">Unicode Standard Annex #29: Text Segmentation</a>
197+
defines a unit called the <dfn>grapheme cluster</dfn>
198+
which approximates the <i>typographic character</i>.
199+
A UA must use the <em>extended grapheme cluster</em>
200+
(not <em>legacy grapheme cluster</em>), as defined in [[!UAX29]],
201+
as the basis for its <i>typographic character unit</i>.
202+
However, the UA should tailor the definitions
203+
as required by typographic tradition
202204
since the default rules are not always appropriate or ideal,
203205
and is expected to tailor them differently
204-
for <i>visually-perceived characters</i> than <i>semantically-perceived characters</i>
205-
as needed.
206-
206+
depending on the operation as needed.
207+
208+
<!--
209+
<p class="note">
210+
The rules for such tailorings are out of scope for CSS,
211+
however W3C currently maintains a wiki page
212+
where some known tailorings are collected.
213+
-->
214+
207215
<div class="example">
208216
<p>For example,
209217
in some scripts such as Myanmar or Devanagari,
210-
the typographic unit for both justification and line-breaking
211-
(<i>visually-perceived character</i> and <i>semantically-perceived characters</i>)
218+
the <i>typographic character unit</i> for both justification and line-breaking
212219
is an entire syllable,
213-
which can include multiple [[!UAX29]] <i>grapheme clusters</i>.
220+
which can include more than one [[!UAX29]] <i>grapheme cluster</i>.
214221

215222
<p>In other scripts such as Thai or Lao,
216-
even though a <i>semantically-perceived character</i> matches the default <i>grapheme cluster</i> definition,
217-
a <i>visually-perceived character</i>
223+
even though for line-breaking the <i>typographic character</i>
224+
matches Unicode’s default <i>grapheme clusters</i>,
225+
for line-breaking the relevant unit
218226
is <em>less</em> than a [[!UAX29]] <i>grapheme cluster</i>,
219-
and may require decomposition or other substitutions before spacing can be inserted.
227+
and may require decomposition or other substitutions
228+
before spacing can be inserted.
220229

221230
<p>For instance,
222231
to properly letter-space the Thai word คำ (U+0E04 + U+0E33),
@@ -229,25 +238,24 @@ Characters and Letters</h4>
229238
As before the extra letter-space is then inserted before the U+0E32: นํ้ า.
230239
</div>
231240

232-
<p>A <dfn>letter</dfn> for the purpose of this specification
233-
is a <i>character</i> belonging to one of the Letter or Number general
241+
<p>A <dfn>typographic letter unit</dfn> or <dfn>letter</dfn> for the purpose of this specification
242+
is a <i>typographic character unit</i> belonging to one of the Letter or Number general
234243
categories in Unicode. [[!UAX44]]
235-
To be more precise,
236-
a <dfn>semantically-perceived letter</dfn> is a <i>semantically-perceived character</i>
237-
belonging to one of the Letter or Number general categories
238-
and a <dfn>visually-perceived letter</dfn> is likewise a <i>visually-perceived character</i>
239-
belonging to one of the Letter or Number general categories.
244+
See <a href="#character-properties">Character Properties</a>
245+
for how to determine the Unicode properties of a <i>typographic character unit</i>.
240246

241-
See <a href="http://www.w3.org/TR/css3-writing-modes/#character-properties">Characters and Properties</a>
242-
for how to determine the Unicode properties of a <i>character</i>.
243-
244-
<p>The rendering characteristics of a <i>character</i> divided
247+
<p>The rendering characteristics of a <i>typographic character unit</i> divided
245248
by an element boundary is undefined:
246249
it may be rendered as belonging to either side of the boundary,
247250
or as some approximation of belonging to both.
248251
Authors are forewarned that dividing <i>grapheme clusters</i>
249252
by element boundaries may give inconsistent or undesired results.
250253

254+
<p><dfn>semantically-perceived character</dfn>
255+
<dfn>visually-perceived character</dfn>
256+
<dfn>semantically-perceived letter</dfn>
257+
<dfn>visually-perceived letter</dfn>
258+
251259
<h4 id="languages">
252260
Languages and Typesetting</h4>
253261

@@ -2492,6 +2500,40 @@ Appendix C: Scripts and Spacing</h2>
24922500
to handle as-yet-unencoded cursive scripts in future versions of Unicode,
24932501
and are encouraged to ask the CSSWG to update this spec accordingly.
24942502

2503+
<h2 id="character-properties" class="no-num">Appendix D.
2504+
Characters and Properties</h2>
2505+
2506+
<p>Unicode defines three codepoint-level properties that are referenced
2507+
in CSS Text:
2508+
<dl>
2509+
<dt><a href="http://www.unicode.org/reports/tr11/#Definitions">East Asian width</a>
2510+
<dd>Defined in [[!UAX11]] and given as the East_Asian_Width property
2511+
in the Unicode Character Database [[!UAX44]].
2512+
<dt><a href="http://www.unicode.org/reports/tr44/#General_Category_Values">General Category</a>
2513+
<dd>Defined in [[!UAX44]] and given as the General_Category property
2514+
in the Unicode Character Database [[!UAX44]].
2515+
<dt><a href="http://www.unicode.org/reports/tr24/#Values">Script property</a>
2516+
<dd>Defined in [[!UAX24]] and given as the Script property
2517+
in the Unicode Character Database [[!UAX44]]. (UAs should
2518+
include any ScriptExtensions.txt assignments in this mapping.)
2519+
</dl>
2520+
2521+
<p>Unicode defines properties for individual codepoints, but sometimes
2522+
it is necessary to determine the properties of a <i>typographic character unit</i>.
2523+
For the purposes of CSS Text,
2524+
the properties of a <i>typographic character unit</i> are given by
2525+
the base character of its first <i>grapheme cluster</i>—except in two cases:
2526+
<ul>
2527+
<li><i>Grapheme clusters</i> formed with an Enclosing Mark (<code>Me</code>) of the Common script
2528+
are considered to be Other Symbols (<code>So</code>) in the Common script.
2529+
They are assumed to have the same Unicode properties as the Replacement Character U+FFFD.
2530+
<li>Grapheme clusters</i> formed with a Space Separator (<code>Zs</code>) as the base
2531+
are considered to be Modifier Symbols (<code>Sk</code>).
2532+
They are assumed to have the same East Asian Width property as the base,
2533+
but take their other properties from the first combining character in the sequence.
2534+
</ul>
2535+
<p>The
2536+
24952537
<h2 class="no-num" id="acknowledgements">
24962538
Acknowledgements</h2>
24972539

0 commit comments

Comments
 (0)