xfq
diff --git a/‎css-text/Overview.bs‎
Lines changed: 105 additions & 63 deletions b/‎css-text/Overview.bs‎
Lines changed: 105 additions & 63 deletions
@@ -24,6 +24,7 @@ At Risk: the 'hanging-punctuation' property
 
   <style type="text/css">
     img { vertical-align: middle; }
+    span[lang] { font-size: 125%; line-height: 1; vertical-align: middle;}
 
     /* Bidi & spaces example */
     .egbidiwsaA,.egbidiwsbB,.egbidiwsaB,.egbidiwsbC
@@ -154,69 +155,77 @@ Characters and Letters</h4>
   However, because writing systems are not always as simple as the basic English alphabet,
   what a <i>character</i> actually is depends on the context in which the term is used.
   For example, in Hangul (the Korean writing system),
-  each square representation of a syllable can be considered a <i>character</i>.
-  However, the square symbol is really composed of multiple symbols each representing a phoneme,
+  each square representation of a syllable
+  (e.g. <span lang=ko-hang title="Hangul syllable HAN">한</span>=<span lang=ko-Latn>Han</span>)
+  can be considered a <i>character</i>.
+  However, the square symbol is really composed of multiple letters each representing a phoneme
+  (e.g. <span lang=ko-hang title="Hangul letter HIEUH">ㅎ</span>=<span lang=ko-Latn>h</span>,
+   <span lang=ko-hang title="Hangul letter HIEUH">ㅏ</span>=<span lang=ko-Latn>a</span>,
+   <span lang=ko-hang title="Hangul letter HIEUH">ㄴ</span>=<span lang=ko-Latn>n</span>)
   and these also could each be considered a <i>character</i>.
-  A basic unit of computer text encoding, for any given encoding, is also called a <i>character</i>,
-  and depending on the encoding, a single encoding <i>character</i> might correspond
-  either to a single phonemic <i>character</i>
-  or to a unitary pre-composed syllabic <i>character</i>.
-  In turn, a single encoding <i>character</i> can be represented in the data stream as one or more bytes;
-  and in programming environments one or a pair of such bytes is sometimes also called a <i>character</i>.
-    
-  <p>For text layout, the relevant unit is
-  the “user-perceived character”, also known as the <dfn>grapheme cluster</dfn>.
-  It is roughly equivalent to what a <em>language user</em> (as opposed to a computer programmer)
-  considers to be a <i>character</i> or basic unit of the script.
-  This term is described in detail in the Unicode Technical Report: Text Boundaries [[!UAX29]]. 
-  Since even typesetting alone requires different notions of <i>grapheme clusters</i>
-  depending on the application, CSS introduces the following terms:
-
-  <dl>
-    <dt><dfn>semantically-perceived character</dfn>
-    <dd>
-      <p>Represents a unit of the writing system,
-      such as a Latin alphabetic letter (including its diacritics),
-      Hangul syllable,
-      Chinese ideographic character,
-      Myanmar syllable cluster,
-      that is indivisible with regards to segmentation
-      (line-breaking, first-letter effects, etc).
-    
-      <p>The UA must interpret this as an <em>extended grapheme cluster</em>
-      (not <em>legacy grapheme cluster</em>) as defined in [[!UAX29]].
-      However, the UA should tailor the definition as required by typographical tradition,
-      since the default rules are not always appropriate.
-
-    <dt><dfn>visually-perceived character</dfn>
-    <dd>
-      <p>Represents a unit of the writing system,
-      that is indivisible with regards to spacing separation
-      (letter-spacing, justification, etc).
-  </dl>
 
-  <p>The UA must interpret both <i>visually-perceived characters</i> and <i>semantically-perceived characters</i>
-  as <em>extended grapheme clusters</em> (not <em>legacy grapheme clusters</em>)
-  as defined in [[!UAX29]].
-  However, the UA may tailor the definitions as required by typographical tradition,
+  <p>A basic unit of computer text encoding, for any given encoding,
+  is also called a <i>character</i>,
+  and depending on the encoding,
+  a single encoding <i>character</i> might correspond
+  to the entire pre-composed syllabic <i>character</i> (e.g. <span lang=ko-hang title="Hangul syllable HAN">한</span>),
+  to the individual phonemic <i>character</i> (e.g. <span lang=ko-hang title="Hangul letter HIEUH">ㅎ</span>),
+  or to smaller units such as
+  a base letterform (e.g. <span lang=ko-hang title="Hangul letter IEUNG">ㅇ</span>)
+  and any combining marks that vary it (e.g. extra strokes that represent aspiration).
+
+  <p>In turn, a single encoding <i>character</i> can be represented in the data stream as one or more bytes;
+  and in programming environments one byte is sometimes also called a <i>character</i>.
+
+  <p>For text layout, we will refer to the <dfn title="typographic character unit|typographic character">typographic character unit</dfn>
+  as the basic unit of text.
+  Even within the realm of text layout,
+  the relevant <i>character</i> unit depends on the operation.
+  For example, line-breaking and letter-spacing will segment
+  a sequence of Thai characters that include U+0E33 THAI CHARACTER SARA AM differently;
+  or the behaviour of a conjunct consonant in a script such as Devanagari
+  may depend on the font in use.
+  So the <i>typographic character</i> represents a unit of the writing system&mdash;<!--
+  -->such as a Latin alphabetic letter (including its diacritics),
+  Hangul syllable,
+  Chinese ideographic character,
+  Myanmar syllable cluster&mdash;<!--
+  -->that is indivisible with respect to a particular typographic operation
+  (line-breaking, first-letter effects, tracking, justification, vertical arrangement, etc.).
+
+  <a href="http://www.unicode.org/reports/tr29/">Unicode Standard Annex #29: Text Segmentation</a>
+  defines a unit called the <dfn>grapheme cluster</dfn>
+  which approximates the <i>typographic character</i>.
+  A UA must use the <em>extended grapheme cluster</em>
+  (not <em>legacy grapheme cluster</em>), as defined in [[!UAX29]],
+  as the basis for its <i>typographic character unit</i>.
+  However, the UA should tailor the definitions
+  as required by typographic tradition
   since the default rules are not always appropriate or ideal,
   and is expected to tailor them differently
-  for <i>visually-perceived characters</i> than <i>semantically-perceived characters</i>
-  as needed.
-  
+  depending on the operation as needed.
+
+<!--
+  <p class="note">
+  The rules for such tailorings are out of scope for CSS,
+  however W3C currently maintains a wiki page
+  where some known tailorings are collected.
+-->
+
   <div class="example">
     <p>For example,
     in some scripts such as Myanmar or Devanagari,
-    the typographic unit for both justification and line-breaking
-    (<i>visually-perceived character</i> and <i>semantically-perceived characters</i>)
+    the <i>typographic character unit</i> for both justification and line-breaking
     is an entire syllable,
-    which can include multiple [[!UAX29]] <i>grapheme clusters</i>.
+    which can include more than one [[!UAX29]] <i>grapheme cluster</i>.
 
     <p>In other scripts such as Thai or Lao,
-    even though a <i>semantically-perceived character</i> matches the default <i>grapheme cluster</i> definition,
-    a <i>visually-perceived character</i>
+    even though for line-breaking the <i>typographic character</i>
+    matches Unicode’s default <i>grapheme clusters</i>,
+    for line-breaking the relevant unit
     is <em>less</em> than a [[!UAX29]] <i>grapheme cluster</i>,
-    and may require decomposition or other substitutions before spacing can be inserted.
+    and may require decomposition or other substitutions
+    before spacing can be inserted.
 
     <p>For instance,
     to properly letter-space the Thai word คำ (U+0E04 + U+0E33),
@@ -229,25 +238,24 @@ Characters and Letters</h4>
      As before the extra letter-space is then inserted before the U+0E32: นํ้ า.
   </div>
 
-  <p>A <dfn>letter</dfn> for the purpose of this specification
-  is a <i>character</i> belonging to one of the Letter or Number general
+  <p>A <dfn>typographic letter unit</dfn> or <dfn>letter</dfn> for the purpose of this specification
+  is a <i>typographic character unit</i> belonging to one of the Letter or Number general
   categories in Unicode. [[!UAX44]]
-  To be more precise,
-  a <dfn>semantically-perceived letter</dfn> is a <i>semantically-perceived character</i>
-  belonging to one of the Letter or Number general categories
-  and a <dfn>visually-perceived letter</dfn> is likewise a <i>visually-perceived character</i>
-  belonging to one of the Letter or Number general categories.
+  See <a href="#character-properties">Character Properties</a>
+  for how to determine the Unicode properties of a <i>typographic character unit</i>.
 
-  See <a href="http://www.w3.org/TR/css3-writing-modes/#character-properties">Characters and Properties</a>
-  for how to determine the Unicode properties of a <i>character</i>.
-
-  <p>The rendering characteristics of a <i>character</i> divided
+  <p>The rendering characteristics of a <i>typographic character unit</i> divided
   by an element boundary is undefined:
   it may be rendered as belonging to either side of the boundary,
   or as some approximation of belonging to both.
   Authors are forewarned that dividing <i>grapheme clusters</i>
   by element boundaries may give inconsistent or undesired results.
 
+  <p><dfn>semantically-perceived character</dfn>
+  <dfn>visually-perceived character</dfn>
+  <dfn>semantically-perceived letter</dfn>
+  <dfn>visually-perceived letter</dfn>
+
 <h4 id="languages">
 Languages and Typesetting</h4>
 
@@ -2492,6 +2500,40 @@ Appendix C: Scripts and Spacing</h2>
     to handle as-yet-unencoded cursive scripts in future versions of Unicode,
     and are encouraged to ask the CSSWG to update this spec accordingly.
 
+<h2 id="character-properties" class="no-num">Appendix D.
+Characters and Properties</h2>
+
+  <p>Unicode defines three codepoint-level properties that are referenced
+    in CSS Text:
+  <dl>
+    <dt><a href="http://www.unicode.org/reports/tr11/#Definitions">East Asian width</a>
+    <dd>Defined in [[!UAX11]] and given as the East_Asian_Width property
+      in the Unicode Character Database [[!UAX44]].
+    <dt><a href="http://www.unicode.org/reports/tr44/#General_Category_Values">General Category</a>
+    <dd>Defined in [[!UAX44]] and given as the General_Category property
+      in the Unicode Character Database [[!UAX44]].
+    <dt><a href="http://www.unicode.org/reports/tr24/#Values">Script property</a>
+    <dd>Defined in [[!UAX24]] and given as the Script property
+      in the Unicode Character Database [[!UAX44]]. (UAs should
+      include any ScriptExtensions.txt assignments in this mapping.)
+  </dl>
+
+  <p>Unicode defines properties for individual codepoints, but sometimes
+    it is necessary to determine the properties of a <i>typographic character unit</i>.
+    For the purposes of CSS Text,
+    the properties of a <i>typographic character unit</i> are given by
+    the base character of its first <i>grapheme cluster</i>—except in two cases:
+  <ul>
+    <li><i>Grapheme clusters</i> formed with an Enclosing Mark (<code>Me</code>) of the Common script
+    are considered to be Other Symbols (<code>So</code>) in the Common script.
+    They are assumed to have the same Unicode properties as the Replacement Character U+FFFD.
+    <li>Grapheme clusters</i> formed with a Space Separator (<code>Zs</code>) as the base
+    are considered to be Modifier Symbols (<code>Sk</code>).
+    They are assumed to have the same East Asian Width property as the base,
+    but take their other properties from the first combining character in the sequence.
+  </ul>
+  <p>The 
+
 <h2 class="no-num" id="acknowledgements">
 Acknowledgements</h2>