Skip to content

Commit 10d4087

Browse files
committed
[css-text] Define various notions of “character”.
1 parent f5add48 commit 10d4087

2 files changed

Lines changed: 269 additions & 180 deletions

File tree

css-text/Overview.bs

Lines changed: 98 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -140,59 +140,113 @@ Values</h3>
140140
<h3 id="terms">
141141
Terminology</h3>
142142

143-
<p><dfn>semantically-perceived character</dfn>
144-
<p><dfn>visually-perceived character</dfn>
145-
<p><dfn>semantically-perceived letters</dfn>
146-
<p><dfn>visually-perceived letters</dfn>
147-
148-
<p id="grapheme-cluster">A <dfn>grapheme cluster</dfn> is what
149-
a language user considers to be a character or a basic unit of the
150-
script. The term is described in detail in the Unicode Technical
151-
Report: Text Boundaries [[!UAX29]]. This specification uses the
152-
<em>extended grapheme cluster</em> definition in [[!UAX29]] (not
153-
the <em>legacy grapheme cluster</em> definition).
154-
The UA may further tailor the definition as required by typographical tradition.
143+
<p>In addition to the terms defined below,
144+
other terminology and concepts used in this specification are defined
145+
in [[!CSS21]] and [[!CSS3-WRITING-MODES]].
146+
147+
<h4 id="characters">
148+
Characters and Letters</h4>
149+
150+
<p>The basic unit of typesetting is the <dfn>character</dfn>.
151+
However, because writing systems are not always as simple as the basic English alphabet,
152+
what a <i>character</i> actually is depends on the context in which the term is used.
153+
For example, in Hangul (the Korean writing system),
154+
each square representation of a syllable can be considered a <i>character</i>.
155+
However, the square symbol is really composed of multiple symbols each representing a phoneme,
156+
and these also could each be considered a <i>character</i>.
157+
A basic unit of computer text encoding, for any given encoding, is also called a <i>character</i>,
158+
and depending on the encoding, a single encoding <i>character</i> might correspond
159+
either to a single phonemic <i>character</i>
160+
or to a unitary pre-composed syllabic <i>character</i>.
161+
In turn, a single encoding <i>character</i> can be represented in the data stream as one or more bytes;
162+
and in programming environments one or a pair of such bytes is sometimes also called a <i>character</i>.
163+
164+
<p>For text layout, the relevant unit is
165+
the “user-perceived character”, also known as the <dfn>grapheme cluster</dfn>.
166+
It is roughly equivalent to what a <em>language user</em> (as opposed to a computer programmer)
167+
considers to be a <i>character</i> or basic unit of the script.
168+
This term is described in detail in the Unicode Technical Report: Text Boundaries [[!UAX29]].
169+
Since even typesetting alone requires different notions of <i>grapheme clusters</i>
170+
depending on the application, CSS introduces the following terms:
171+
172+
<dl>
173+
<dt><dfn>semantically-perceived character</dfn>
174+
<dd>
175+
<p>Represents a unit of the writing system,
176+
such as a Latin alphabetic letter (including its diacritics),
177+
Hangul syllable,
178+
Chinese ideographic character,
179+
Myanmar syllable cluster,
180+
that is indivisible with regards to segmentation
181+
(line-breaking, first-letter effects, etc).
182+
183+
<p>The UA must interpret this as an <em>extended grapheme cluster</em>
184+
(not <em>legacy grapheme cluster</em>) as defined in [[!UAX29]].
185+
However, the UA should tailor the definition as required by typographical tradition,
186+
since the default rules are not always appropriate.
187+
188+
<dt><dfn>visually-perceived character</dfn>
189+
<dd>
190+
<p>Represents a unit of the writing system,
191+
that is indivisible with regards to spacing separation
192+
(letter-spacing, justification, etc).
193+
</dl>
155194

195+
<p>The UA must interpret both <i>visually-perceived characters</i> and <i>semantically-perceived characters</i>
196+
as <em>extended grapheme clusters</em> (not <em>legacy grapheme clusters</em>)
197+
as defined in [[!UAX29]].
198+
However, the UA may tailor the definitions as required by typographical tradition,
199+
since the default rules are not always appropriate or ideal,
200+
and is expected to tailor them differently
201+
for <i>visually-perceived characters</i> than <i>semantically-perceived characters</i>
202+
as needed.
203+
156204
<div class="example">
157205
<p>For example,
158-
in some scripts such as Myanmar or Devanagari,
159-
the typographic unit for 'letter-spacing' is an entire syllable,
160-
which may include multiple [[!UAX29]] <i>grapheme clusters</i>.
161-
</div>
206+
in some scripts such as Myanmar or Devanagari,
207+
the typographic unit for both justification and line-breaking
208+
(<i>visually-perceived character</i> and <i>semantically-perceived characters</i>)
209+
is an entire syllable,
210+
which can include multiple [[!UAX29]] <i>grapheme clusters</i>.
162211

163-
<div class="example">
164212
<p>In other scripts such as Thai or Lao,
165-
the typographic unit for 'letter-spacing' is more than a single Unicode codepoint,
166-
but less than a [[!UAX29]] <i>grapheme cluster</i>,
167-
and may require decomposition or other substitutions.
213+
even though a <i>semantically-perceived character</i> matches the default <i>grapheme cluster</i> definition,
214+
a <i>visually-perceived character</i>
215+
is <em>less</em> than a [[!UAX29]] <i>grapheme cluster</i>,
216+
and may require decomposition or other substitutions before spacing can be inserted.
168217

169-
<p>For example,
170-
to properly letter-space the Thai word คำ (U+0E04 + U+0E33),
171-
the U+0E33 needs to be decomposed into U+0E4D + U+0E32,
172-
and then the extra letter-space inserted before the U+0E32: คํ า.
218+
<p>For instance,
219+
to properly letter-space the Thai word คำ (U+0E04 + U+0E33),
220+
the U+0E33 needs to be decomposed into U+0E4D + U+0E32,
221+
and then the extra letter-space inserted before the U+0E32: คํ า.
173222

174223
<p>A slightly more complex example is น้ำ (U+0E19 + U+0E49 + U+0E33).
175-
In this case, normal Thai shaping will first decompose the U+0E33 into U+0E4D + U+0E32
176-
and then swap the U+0E4D with the U+0E49, giving U+0E19 + U+0E4D + U+0E49 + U+0E32.
177-
As before the extra letter-space is then inserted before the U+0E32: นํ้ า.
224+
In this case, normal Thai shaping will first decompose the U+0E33 into U+0E4D + U+0E32
225+
and then swap the U+0E4D with the U+0E49, giving U+0E19 + U+0E4D + U+0E49 + U+0E32.
226+
As before the extra letter-space is then inserted before the U+0E32: นํ้ า.
178227
</div>
179228

180-
<p>Within this specification,
181-
the ambiguous term <dfn>character</dfn> is used as a friendlier synonym
182-
for <i>grapheme cluster</i>.
183-
See <a href="http://www.w3.org/TR/css3-writing-modes/#character-properties">Characters and Properties</a>
184-
for how to determine the Unicode properties of a character.
229+
<p>A <dfn>letter</dfn> for the purpose of this specification
230+
is a <i>character</i> belonging to one of the Letter or Number general
231+
categories in Unicode. [[!UAX44]]
232+
To be more precise,
233+
a <dfn>semantically-perceived letter</dfn> is a <i>semantically-perceived character</i>
234+
belonging to one of the Letter or Number general categories
235+
and a <dfn>visually-perceived letter</dfn> is likewise a <i>visually-perceived character</i>
236+
belonging to one of the Letter or Number general categories.
237+
238+
See <a href="http://www.w3.org/TR/css3-writing-modes/#character-properties">Characters and Properties</a>
239+
for how to determine the Unicode properties of a <i>character</i>.
185240

186-
<p id="letter">A <dfn>letter</dfn> for the purpose of this specification
187-
is a <i>semantically-perceived character</i> belonging to one of the Letter or Number general
188-
categories in Unicode. [[!UAX44]]
241+
<p>The rendering characteristics of a <i>character</i> divided
242+
by an element boundary is undefined:
243+
it may be rendered as belonging to either side of the boundary,
244+
or as some approximation of belonging to both.
245+
Authors are forewarned that dividing <i>grapheme clusters</i>
246+
by element boundaries may give inconsistent or undesired results.
189247

190-
<p>The rendering characteristics of a <i>semantically-perceived character</i>
191-
and a <i>visually-perceived character</i> divided by an
192-
element boundary is undefined: it may be rendered as belonging to
193-
either side of the boundary, or as some approximation of belonging
194-
to both. Authors are forewarned that dividing grapheme clusters by
195-
element boundaries may give inconsistent or undesired results.
248+
<h4 id="languages">
249+
Languages and Typesetting</h4>
196250

197251
<p>The <dfn>content language</dfn> of an element is the (human) language
198252
the element is declared to be in, according to the rules of the
@@ -208,11 +262,9 @@ Terminology</h3>
208262
<p class="note">
209263
Many typographic effects vary by linguistic context.
210264
In CSS, language-specific typographic tailorings
211-
are only applied when the content language is known.
212-
Authors should tag their content accurately for the best typographic behavior.
213-
214-
<p>Other terminology and concepts used in this specification are defined
215-
in [[!CSS21]] and [[!CSS3-WRITING-MODES]].
265+
are only applied when the content language is known (declared).
266+
267+
<strong>Authors should tag their content accurately for the best typographic behavior.</strong>
216268

217269
<h2 id="transforming">
218270
Transforming Text</h2>

0 commit comments

Comments
 (0)