[css-text-3] Initial draft of Unicode Block -based segment break transformation. #337

fantasai · fantasai · commit b223ecbd2a29 · 2020-04-12T15:22:34.000-07:00
diff --git a/css-text-3/Overview.bs b/css-text-3/Overview.bs
@@ -1825,6 +1825,11 @@ Text Processing</h3>
             segment-break-transformation-removable-1.html
             segment-break-transformation-removable-3.html
           </wpt>
+
+        <li>Otherwise, if both the characters before and after the [=segment break=]
+          belong to the [=space-discarding character set=] (see [[#space-discard-set]]),
+          then the [=segment break=] is removed.
+<!--
         <li>Otherwise, if the <a>East Asian Width property</a> [[!UAX11]] of both
           the character before and after the [=segment break=] is
           <code>Fullwidth</code>, <code>Wide</code>, or <code>Halfwidth</code>
@@ -1912,17 +1917,18 @@ Text Processing</h3>
           <wpt>
             writing-system/writing-system-segment-break-001.html
           </wpt>
+-->
         <li>Otherwise, the [=segment break=] is converted to a space (U+0020).
       </ul>
-
+<!--
       <p>
         For this purpose,
         Emoji (Unicode property <code>Emoji</code>)
         with an <a>East Asian Width property</a> of
         <code>Wide</code> or <code>Neutral</code>
         are treated as having an <a>East Asian Width property</a> of
         <code>Ambiguous</code>.
-
+-->
       <p class="note">Note: The white space processing rules have already
         removed any [=tabs=] and [=spaces=] after the [=segment break=] before these checks
         take place.</p>
@@ -5244,7 +5250,144 @@ Characters and Properties</h2>
     but take their other properties from the first combining character in the sequence.
   </ul>
 
-<h2 id="script-tagging" class="no-num">Appendix F.
+<h2 id="space-discard-set" class="no-num">Appendix F.
+Space-Discarding Unicode Characters</h2>
+
+  <p><em>This appendix is normative.</em></p>
+
+  Characters from the following blocks in Unicode 13.0 [[UNICODE]]
+  are considered part of the <dfn>space-discarding character set</dfn>
+  for the purpose of [[#line-break-transform]]:
+
+  <table class=data>
+    <caption>Space-discarding Unicode Bocks</caption>
+  <thead>
+    <tr>
+      <th>Codepoint Range
+      <th>Block Name
+  <tbody>
+    <tr>
+      <td>U+2E80..U+2EFF
+      <td>CJK Radicals Supplement
+    <tr>
+      <td>U+2F00..U+2FDF
+      <td>Kangxi Radicals
+    <tr>
+      <td>U+2FF0..U+2FFF
+      <td>Ideographic Description Characters
+    <tr>
+      <td>U+3000..U+303F
+      <td>CJK Symbols and Punctuation
+    <tr>
+      <td>U+3040..U+309F
+      <td>Hiragana
+    <tr>
+      <td>U+30A0..U+30FF
+      <td>Katakana
+    <tr>
+      <td>U+3130..U+318F
+      <td>Kanbun
+    <tr>
+      <td>U+3100..U+312F
+      <td>Bopomofo Extended
+    <tr>
+      <td>U+3190..U+319F
+      <td>Kanbun
+    <tr>
+      <td>U+31C0..U+31EF
+      <td>CJK Strokes
+    <tr>
+      <td>U+31F0..U+31FF
+      <td>Katakana Phonetic Extensions
+    <tr>
+      <td>U+3200..U+32FF
+      <td>Enclosed CJK Letters and Months
+    <tr>
+      <td>U+3300..U+33FF
+      <td>CJK Compatibility
+    <tr>
+      <td>U+3400..U+4DBF
+      <td>CJK Unified Ideographs Extension A
+    <tr>
+      <td>U+4DC0..U+4DFF
+      <td>Yijing Hexagram Symbols
+    <tr>
+      <td>U+4E00..U+9FFF
+      <td>CJK Unified Ideographs
+    <tr>
+      <td>U+A000..U+A48F
+      <td>Yi Syllables
+    <tr>
+      <td>U+A490..U+A4CF
+      <td>Yi Radicals
+    <tr>
+      <td>U+F900..U+FAFF
+      <td>CJK Compatibility Ideographs
+    <tr>
+      <td>U+FE10..U+FE1F
+      <td>Vertical Forms
+    <tr>
+      <td>U+FE30..U+FE4F
+      <td>CJK Compatibility Forms
+    <tr>
+      <td>U+FE50..U+FE6F
+      <td>Small Form Variants
+    <tr>
+      <td>U+FF00..U+FFEF
+      <td>Halfwidth and Fullwidth Forms
+    <tr>
+      <td>U+1B000..U+1B0FF
+      <td>Kana Supplement
+    <tr>
+      <td>U+1B100..U+1B12F
+      <td>Kana Extended-A
+    <tr>
+      <td>U+1B130..U+1B16F
+      <td>Small Kana Extension
+    <tr>
+      <td>U+1D300..U+1D35F
+      <td>Tai Xuan Jing Symbols
+    <tr>
+      <td>U+1D360..U+1D37F
+      <td>Counting Rod Numerals
+    <tr>
+      <td>U+1F200..U+1F2FF
+      <td>Enclosed Ideographic Supplement
+    <tr>
+      <td>U+20000..U+2A6DF
+      <td>CJK Unified Ideographs Extension B
+    <tr>
+      <td>U+2A700..U+2B73F
+      <td>CJK Unified Ideographs Extension C
+    <tr>
+      <td>U+2B740..U+2B81F
+      <td>CJK Unified Ideographs Extension D
+    <tr>
+      <td>U+2B820..U+2CEAF
+      <td>CJK Unified Ideographs Extension E
+    <tr>
+      <td>U+2CEB0..U+2EBEF
+      <td>CJK Unified Ideographs Extension F
+    <tr>
+      <td>U+2F800..U+2FA1F
+      <td>CJK Compatibility Ideographs Supplement
+    <tr>
+      <td>U+30000..U+3134F
+      <td>CJK Unified Ideographs Extension G
+  </table>
+
+  ISSUE: Do we include Bopomofo?
+
+  ISSUE: Do we include enclosed ideographs?
+
+  ISSUE: Do we include symbol sets like Yijing Hexagrams / Counting Rods / etc.?
+
+  For future revisions of [[UNICODE]],
+  any new block whose contents comprise at least 50% codepoints belonging to the
+  Han, Hiragana, Katakana, or Yi script
+  shall also be considered part of the [=space-discarding character set=].
+
+<h2 id="script-tagging" class="no-num">Appendix G.
 Tagging Content by Writing System</h2>
 
 	<p><em>This appendix is normative.</em></p>
@@ -5339,7 +5482,7 @@ Tagging Content by Writing System</h2>
 	<a href="https://www.w3.org/International/articles/language-tags/">“Language tags in HTML and XML”</a>
 	and <a href="https://www.w3.org/International/questions/qa-choosing-language-tags">“Choosing a Language Tag”</a>.
 
-<h2 id="small-kana" class=no-num>Appendix G.
+<h2 id="small-kana" class=no-num>Appendix H.
 Small Kana Mappings</h2>
 <style>
 .pairs-table th {