@@ -1825,6 +1825,11 @@ Text Processing</h3>
18251825 segment-break-transformation-removable-1.html
18261826 segment-break-transformation-removable-3.html
18271827 </wpt>
1828+
1829+ <li> Otherwise, if both the characters before and after the [=segment break=]
1830+ belong to the [=space-discarding character set=] (see [[#space-discard-set]] ),
1831+ then the [=segment break=] is removed.
1832+ <!--
18281833 <li> Otherwise, if the <a>East Asian Width property</a> [[!UAX11]] of both
18291834 the character before and after the [=segment break=] is
18301835 <code> Fullwidth</code> , <code> Wide</code> , or <code> Halfwidth</code>
@@ -1912,17 +1917,18 @@ Text Processing</h3>
19121917 <wpt>
19131918 writing-system/writing-system-segment-break-001.html
19141919 </wpt>
1920+ -->
19151921 <li> Otherwise, the [=segment break=] is converted to a space (U+0020).
19161922 </ul>
1917-
1923+ <!--
19181924 <p>
19191925 For this purpose,
19201926 Emoji (Unicode property <code> Emoji</code> )
19211927 with an <a>East Asian Width property</a> of
19221928 <code> Wide</code> or <code> Neutral</code>
19231929 are treated as having an <a>East Asian Width property</a> of
19241930 <code> Ambiguous</code> .
1925-
1931+ -->
19261932 <p class="note"> Note: The white space processing rules have already
19271933 removed any [=tabs=] and [=spaces=] after the [=segment break=] before these checks
19281934 take place.</p>
@@ -5244,7 +5250,144 @@ Characters and Properties</h2>
52445250 but take their other properties from the first combining character in the sequence.
52455251 </ul>
52465252
5247- <h2 id="script-tagging" class="no-num">Appendix F.
5253+ <h2 id="space-discard-set" class="no-num">Appendix F.
5254+ Space-Discarding Unicode Characters</h2>
5255+
5256+ <p><em> This appendix is normative.</em></p>
5257+
5258+ Characters from the following blocks in Unicode 13.0 [[UNICODE]]
5259+ are considered part of the <dfn>space-discarding character set</dfn>
5260+ for the purpose of [[#line-break-transform]] :
5261+
5262+ <table class=data>
5263+ <caption> Space-discarding Unicode Bocks</caption>
5264+ <thead>
5265+ <tr>
5266+ <th> Codepoint Range
5267+ <th> Block Name
5268+ <tbody>
5269+ <tr>
5270+ <td> U+2E80..U+2EFF
5271+ <td> CJK Radicals Supplement
5272+ <tr>
5273+ <td> U+2F00..U+2FDF
5274+ <td> Kangxi Radicals
5275+ <tr>
5276+ <td> U+2FF0..U+2FFF
5277+ <td> Ideographic Description Characters
5278+ <tr>
5279+ <td> U+3000..U+303F
5280+ <td> CJK Symbols and Punctuation
5281+ <tr>
5282+ <td> U+3040..U+309F
5283+ <td> Hiragana
5284+ <tr>
5285+ <td> U+30A0..U+30FF
5286+ <td> Katakana
5287+ <tr>
5288+ <td> U+3130..U+318F
5289+ <td> Kanbun
5290+ <tr>
5291+ <td> U+3100..U+312F
5292+ <td> Bopomofo Extended
5293+ <tr>
5294+ <td> U+3190..U+319F
5295+ <td> Kanbun
5296+ <tr>
5297+ <td> U+31C0..U+31EF
5298+ <td> CJK Strokes
5299+ <tr>
5300+ <td> U+31F0..U+31FF
5301+ <td> Katakana Phonetic Extensions
5302+ <tr>
5303+ <td> U+3200..U+32FF
5304+ <td> Enclosed CJK Letters and Months
5305+ <tr>
5306+ <td> U+3300..U+33FF
5307+ <td> CJK Compatibility
5308+ <tr>
5309+ <td> U+3400..U+4DBF
5310+ <td> CJK Unified Ideographs Extension A
5311+ <tr>
5312+ <td> U+4DC0..U+4DFF
5313+ <td> Yijing Hexagram Symbols
5314+ <tr>
5315+ <td> U+4E00..U+9FFF
5316+ <td> CJK Unified Ideographs
5317+ <tr>
5318+ <td> U+A000..U+A48F
5319+ <td> Yi Syllables
5320+ <tr>
5321+ <td> U+A490..U+A4CF
5322+ <td> Yi Radicals
5323+ <tr>
5324+ <td> U+F900..U+FAFF
5325+ <td> CJK Compatibility Ideographs
5326+ <tr>
5327+ <td> U+FE10..U+FE1F
5328+ <td> Vertical Forms
5329+ <tr>
5330+ <td> U+FE30..U+FE4F
5331+ <td> CJK Compatibility Forms
5332+ <tr>
5333+ <td> U+FE50..U+FE6F
5334+ <td> Small Form Variants
5335+ <tr>
5336+ <td> U+FF00..U+FFEF
5337+ <td> Halfwidth and Fullwidth Forms
5338+ <tr>
5339+ <td> U+1B000..U+1B0FF
5340+ <td> Kana Supplement
5341+ <tr>
5342+ <td> U+1B100..U+1B12F
5343+ <td> Kana Extended-A
5344+ <tr>
5345+ <td> U+1B130..U+1B16F
5346+ <td> Small Kana Extension
5347+ <tr>
5348+ <td> U+1D300..U+1D35F
5349+ <td> Tai Xuan Jing Symbols
5350+ <tr>
5351+ <td> U+1D360..U+1D37F
5352+ <td> Counting Rod Numerals
5353+ <tr>
5354+ <td> U+1F200..U+1F2FF
5355+ <td> Enclosed Ideographic Supplement
5356+ <tr>
5357+ <td> U+20000..U+2A6DF
5358+ <td> CJK Unified Ideographs Extension B
5359+ <tr>
5360+ <td> U+2A700..U+2B73F
5361+ <td> CJK Unified Ideographs Extension C
5362+ <tr>
5363+ <td> U+2B740..U+2B81F
5364+ <td> CJK Unified Ideographs Extension D
5365+ <tr>
5366+ <td> U+2B820..U+2CEAF
5367+ <td> CJK Unified Ideographs Extension E
5368+ <tr>
5369+ <td> U+2CEB0..U+2EBEF
5370+ <td> CJK Unified Ideographs Extension F
5371+ <tr>
5372+ <td> U+2F800..U+2FA1F
5373+ <td> CJK Compatibility Ideographs Supplement
5374+ <tr>
5375+ <td> U+30000..U+3134F
5376+ <td> CJK Unified Ideographs Extension G
5377+ </table>
5378+
5379+ ISSUE: Do we include Bopomofo?
5380+
5381+ ISSUE: Do we include enclosed ideographs?
5382+
5383+ ISSUE: Do we include symbol sets like Yijing Hexagrams / Counting Rods / etc.?
5384+
5385+ For future revisions of [[UNICODE]] ,
5386+ any new block whose contents comprise at least 50% codepoints belonging to the
5387+ Han, Hiragana, Katakana, or Yi script
5388+ shall also be considered part of the [=space-discarding character set=] .
5389+
5390+ <h2 id="script-tagging" class="no-num">Appendix G.
52485391Tagging Content by Writing System</h2>
52495392
52505393 <p><em> This appendix is normative.</em></p>
@@ -5339,7 +5482,7 @@ Tagging Content by Writing System</h2>
53395482 <a href="https://www.w3.org/International/articles/language-tags/">“Language tags in HTML and XML”</a>
53405483 and <a href="https://www.w3.org/International/questions/qa-choosing-language-tags">“Choosing a Language Tag”</a> .
53415484
5342- <h2 id="small-kana" class=no-num>Appendix G .
5485+ <h2 id="small-kana" class=no-num>Appendix H .
53435486Small Kana Mappings</h2>
53445487<style>
53455488.pairs-table th {
0 commit comments