@@ -140,59 +140,113 @@ Values</h3>
140140<h3 id="terms">
141141Terminology</h3>
142142
143- <p> <dfn>semantically-perceived character</dfn>
144- <p> <dfn>visually-perceived character</dfn>
145- <p> <dfn>semantically-perceived letters</dfn>
146- <p> <dfn>visually-perceived letters</dfn>
147-
148- <p id="grapheme-cluster"> A <dfn>grapheme cluster</dfn> is what
149- a language user considers to be a character or a basic unit of the
150- script. The term is described in detail in the Unicode Technical
151- Report: Text Boundaries [[!UAX29]] . This specification uses the
152- <em> extended grapheme cluster</em> definition in [[!UAX29]] (not
153- the <em> legacy grapheme cluster</em> definition).
154- The UA may further tailor the definition as required by typographical tradition.
143+ <p> In addition to the terms defined below,
144+ other terminology and concepts used in this specification are defined
145+ in [[!CSS21]] and [[!CSS3-WRITING-MODES]] .
146+
147+ <h4 id="characters">
148+ Characters and Letters</h4>
149+
150+ <p> The basic unit of typesetting is the <dfn>character</dfn> .
151+ However, because writing systems are not always as simple as the basic English alphabet,
152+ what a <i> character</i> actually is depends on the context in which the term is used.
153+ For example, in Hangul (the Korean writing system),
154+ each square representation of a syllable can be considered a <i> character</i> .
155+ However, the square symbol is really composed of multiple symbols each representing a phoneme,
156+ and these also could each be considered a <i> character</i> .
157+ A basic unit of computer text encoding, for any given encoding, is also called a <i> character</i> ,
158+ and depending on the encoding, a single encoding <i> character</i> might correspond
159+ either to a single phonemic <i> character</i>
160+ or to a unitary pre-composed syllabic <i> character</i> .
161+ In turn, a single encoding <i> character</i> can be represented in the data stream as one or more bytes;
162+ and in programming environments one or a pair of such bytes is sometimes also called a <i> character</i> .
163+
164+ <p> For text layout, the relevant unit is
165+ the “user-perceived character”, also known as the <dfn>grapheme cluster</dfn> .
166+ It is roughly equivalent to what a <em> language user</em> (as opposed to a computer programmer)
167+ considers to be a <i> character</i> or basic unit of the script.
168+ This term is described in detail in the Unicode Technical Report: Text Boundaries [[!UAX29]] .
169+ Since even typesetting alone requires different notions of <i> grapheme clusters</i>
170+ depending on the application, CSS introduces the following terms:
171+
172+ <dl>
173+ <dt> <dfn>semantically-perceived character</dfn>
174+ <dd>
175+ <p> Represents a unit of the writing system,
176+ such as a Latin alphabetic letter (including its diacritics),
177+ Hangul syllable,
178+ Chinese ideographic character,
179+ Myanmar syllable cluster,
180+ that is indivisible with regards to segmentation
181+ (line-breaking, first-letter effects, etc).
182+
183+ <p> The UA must interpret this as an <em> extended grapheme cluster</em>
184+ (not <em> legacy grapheme cluster</em> ) as defined in [[!UAX29]] .
185+ However, the UA should tailor the definition as required by typographical tradition,
186+ since the default rules are not always appropriate.
187+
188+ <dt> <dfn>visually-perceived character</dfn>
189+ <dd>
190+ <p> Represents a unit of the writing system,
191+ that is indivisible with regards to spacing separation
192+ (letter-spacing, justification, etc).
193+ </dl>
155194
195+ <p> The UA must interpret both <i> visually-perceived characters</i> and <i> semantically-perceived characters</i>
196+ as <em> extended grapheme clusters</em> (not <em> legacy grapheme clusters</em> )
197+ as defined in [[!UAX29]] .
198+ However, the UA may tailor the definitions as required by typographical tradition,
199+ since the default rules are not always appropriate or ideal,
200+ and is expected to tailor them differently
201+ for <i> visually-perceived characters</i> than <i> semantically-perceived characters</i>
202+ as needed.
203+
156204 <div class="example">
157205 <p> For example,
158- in some scripts such as Myanmar or Devanagari,
159- the typographic unit for 'letter-spacing' is an entire syllable,
160- which may include multiple [[!UAX29]] <i> grapheme clusters</i> .
161- </div>
206+ in some scripts such as Myanmar or Devanagari,
207+ the typographic unit for both justification and line-breaking
208+ (<i> visually-perceived character</i> and <i> semantically-perceived characters</i> )
209+ is an entire syllable,
210+ which can include multiple [[!UAX29]] <i> grapheme clusters</i> .
162211
163- <div class="example">
164212 <p> In other scripts such as Thai or Lao,
165- the typographic unit for 'letter-spacing' is more than a single Unicode codepoint,
166- but less than a [[!UAX29]] <i> grapheme cluster</i> ,
167- and may require decomposition or other substitutions.
213+ even though a <i> semantically-perceived character</i> matches the default <i> grapheme cluster</i> definition,
214+ a <i> visually-perceived character</i>
215+ is <em> less</em> than a [[!UAX29]] <i> grapheme cluster</i> ,
216+ and may require decomposition or other substitutions before spacing can be inserted.
168217
169- <p> For example ,
170- to properly letter-space the Thai word คำ (U+0E04 + U+0E33),
171- the U+0E33 needs to be decomposed into U+0E4D + U+0E32,
172- and then the extra letter-space inserted before the U+0E32: คํ า.
218+ <p> For instance ,
219+ to properly letter-space the Thai word คำ (U+0E04 + U+0E33),
220+ the U+0E33 needs to be decomposed into U+0E4D + U+0E32,
221+ and then the extra letter-space inserted before the U+0E32: คํ า.
173222
174223 <p> A slightly more complex example is น้ำ (U+0E19 + U+0E49 + U+0E33).
175- In this case, normal Thai shaping will first decompose the U+0E33 into U+0E4D + U+0E32
176- and then swap the U+0E4D with the U+0E49, giving U+0E19 + U+0E4D + U+0E49 + U+0E32.
177- As before the extra letter-space is then inserted before the U+0E32: นํ้ า.
224+ In this case, normal Thai shaping will first decompose the U+0E33 into U+0E4D + U+0E32
225+ and then swap the U+0E4D with the U+0E49, giving U+0E19 + U+0E4D + U+0E49 + U+0E32.
226+ As before the extra letter-space is then inserted before the U+0E32: นํ้ า.
178227 </div>
179228
180- <p> Within this specification,
181- the ambiguous term <dfn>character</dfn> is used as a friendlier synonym
182- for <i> grapheme cluster</i> .
183- See <a href="http://www.w3.org/TR/css3-writing-modes/#character-properties">Characters and Properties</a>
184- for how to determine the Unicode properties of a character.
229+ <p> A <dfn>letter</dfn> for the purpose of this specification
230+ is a <i> character</i> belonging to one of the Letter or Number general
231+ categories in Unicode. [[!UAX44]]
232+ To be more precise,
233+ a <dfn>semantically-perceived letter</dfn> is a <i> semantically-perceived character</i>
234+ belonging to one of the Letter or Number general categories
235+ and a <dfn>visually-perceived letter</dfn> is likewise a <i> visually-perceived character</i>
236+ belonging to one of the Letter or Number general categories.
237+
238+ See <a href="http://www.w3.org/TR/css3-writing-modes/#character-properties">Characters and Properties</a>
239+ for how to determine the Unicode properties of a <i> character</i> .
185240
186- <p id="letter"> A <dfn>letter</dfn> for the purpose of this specification
187- is a <i> semantically-perceived character</i> belonging to one of the Letter or Number general
188- categories in Unicode. [[!UAX44]]
241+ <p> The rendering characteristics of a <i> character</i> divided
242+ by an element boundary is undefined:
243+ it may be rendered as belonging to either side of the boundary,
244+ or as some approximation of belonging to both.
245+ Authors are forewarned that dividing <i> grapheme clusters</i>
246+ by element boundaries may give inconsistent or undesired results.
189247
190- <p> The rendering characteristics of a <i> semantically-perceived character</i>
191- and a <i> visually-perceived character</i> divided by an
192- element boundary is undefined: it may be rendered as belonging to
193- either side of the boundary, or as some approximation of belonging
194- to both. Authors are forewarned that dividing grapheme clusters by
195- element boundaries may give inconsistent or undesired results.
248+ <h4 id="languages">
249+ Languages and Typesetting</h4>
196250
197251 <p> The <dfn>content language</dfn> of an element is the (human) language
198252 the element is declared to be in, according to the rules of the
@@ -208,11 +262,9 @@ Terminology</h3>
208262 <p class="note">
209263 Many typographic effects vary by linguistic context.
210264 In CSS, language-specific typographic tailorings
211- are only applied when the content language is known.
212- Authors should tag their content accurately for the best typographic behavior.
213-
214- <p> Other terminology and concepts used in this specification are defined
215- in [[!CSS21]] and [[!CSS3-WRITING-MODES]] .
265+ are only applied when the content language is known (declared).
266+
267+ <strong> Authors should tag their content accurately for the best typographic behavior.</strong>
216268
217269<h2 id="transforming">
218270 Transforming Text</h2>
0 commit comments