Skip to content

Commit dfa8b4e

Browse files
committed
[css-text-4] Revisions to word-boundary-detection based on review by fantasai
1 parent e5cf69b commit dfa8b4e

1 file changed

Lines changed: 123 additions & 44 deletions

File tree

css-text-4/Overview.bs

Lines changed: 123 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ spec: css-text-3; type: property
2323
text: word-spacing
2424
spec: css-text-3; type: dfn
2525
text: forced line break
26+
text: word-separator character
27+
text: other space separator
2628
</pre>
2729

2830
<pre class=biblio>
@@ -92,74 +94,71 @@ Detecting Word Boundaries: the 'word-boundary-detection' property</h4>
9294
Animation type: discrete
9395
</pre>
9496

97+
This property allows the author to decide
98+
whether and how
99+
the User Agent must analyse the content
100+
to determine where word boundaries are,
101+
and to insert [=virtual word boundaries=] accordingly.
102+
103+
A <dfn data-dfn-for='' data-dfn-type=dfn>virtual word boundary</dfn> is similar to the presence
104+
of the ZERO WIDTH SPACE (U+200B) character:
105+
it introduces a [=soft wrap opportunity=]
106+
and is affected by the 'word-boundary-expansion' property;
107+
its presence has no effect on text shaping,
108+
nor on 'word-spacing'.
109+
However, its insertion must have no effect on the underlying content,
110+
and must not affect the content of a plain text copy &amp; paste operation.
111+
95112
<dl dfn-type=value dfn-for=word-boundary-detection>
96113
<dt><dfn>manual</dfn>
97114
<dd>
98-
This property has no effect.
115+
The User Agent must not insert [=virtual word boundaries=].
99116

100117
<dt><dfn>auto(<<lang>>)</dfn>
101118
<dd>
102119

103120
This value directs the User Agent to perform language-specific content analysis
104-
to determine where word boundaries are.
105-
The specific algorithm to be used is UA-dependent.
106-
However, inline element boundaries and out-of-flow elements must be ignored when determining word boundaries.
121+
to determine where to insert [=virtual word boundaries=].
107122

108123
<dfn dfn-type=type><<lang>></dfn> must be a valid CSS <<ident>> or <<string>>.
109124
It represents an IETF BCP 47 language range
110-
(See [[BCP47]]).
111-
If the UA does not support word-boundary detection for <em>all</em> languages represented by the specified range,
125+
(see [[BCP47]]).
126+
If the UA does not support word-boundary detection
127+
for <em>all</em> languages represented by the specified range,
112128
it must reject that value at parse-time.
113129

114-
<div class=example>
115-
If a User Agent has a word-boundary detection system for Cantonese
116-
that is not suitable for the broader set of Chinese languages,
117-
it must accept ''lang(yue)'', ''lang(zh-yue)'', or ''lang(zh-HK)'',
118-
but not ''lang(zh)'' or ''lang(zh-Hant)''.
119-
120-
However, if the User Agent supports a generic word-boundary detection system
121-
that is suitable for Chinese in general,
122-
it should accept the broad ''lang(zh)'' characterization,
123-
as well as any more specific ones,
124-
such as ''lang(zh-yue)'', ''lang(zh-Hant-HK)'', ''lang(zh-Hans-SG)'', or ''lang(zh-hak).
125-
</div>
130+
Note: Wildcards <em>in the language subtag</em> would imply
131+
support for detecting word boundaries in an undefined and effectively unlimited set of languages.
132+
As this this is not possible,
133+
wildcards in the language subtag always result in the declaration
134+
being treated as invalid.
126135

127136
Note: Whether a word boundary detection system designed for one language
128137
is suitable for some or all dialects of that language is somewhat subjective,
129-
and this specifications leaves it at the appreciation of the User Agent.
138+
and this specifications leaves it at the discretion of the User Agent.
130139
Even if a detection system is not able to cope with all nuances of a particular dialect,
131140
it may be reasonable to claim support
132141
if the detection correctly recognizes word boundaries most of the time.
133142
However, the User Agent would do a disservice to authors and users
134143
if it claimed support for languages
135-
where it fails to detect most word boundaries.
136-
137-
Note: Wildcards <em>in the language subtag</em> would imply
138-
support for detecting word boundaries in an undefined and effectively unlimited set of languages.
139-
As this this is not possible,
140-
wildcards in the language subtag must always be treated as invalid.
144+
where it fails to detect most word boundaries
145+
or has a high error rate.
141146

142147
If the element’s [=content language=],
143148
as represented in BCP 47 syntax [[BCP47]],
144-
does <em>not</em> matches the language range described by the computed value's <<lang>>
149+
does <em>not</em> match the language range described by the computed value's <<lang>>
145150
in an extended filtering operation
146151
per [[RFC4647]] <cite>Matching of Language Tags</cite> (section 3.3.2),
147-
then the [=used value=] is set to ''word-boundary-detection/manual'',
152+
then the [=used value=] is ''word-boundary-detection/manual'',
148153
and this property has no effect on this element.
149-
<span class=note>(This is the same maching logic as the one used for the '':lang()'' selector, negated.)</span>
150154
Otherwise,
151-
the User Agent must insert the ZERO WIDTH SPACE (U+200B) character
155+
the User Agent must insert a [=virtual word boundary=]
152156
at each detected word boundary
153157
within the [=text run=] children of this element.
154-
However, the UA must not insert U+200B:
155-
* at the beginning or end of a [=block container=]
156-
* at the beginning or end of an [=inline box=] whose parent box has a [=used value=] of ''word-boundary-detection/manual''
158+
Within the constraints set by this specification,
159+
the specific algorithm used is UA-dependent.
157160

158-
The insertion happens before layout,
159-
so all layout operations that depend on the characters in the content
160-
(such as [[CSS-TEXT-3#white-space-rules]], [=line breaking=], or [=intrinsic sizing=])
161-
must take the presence of that character into account.
162-
[=Selectors=] are not affected.
161+
Note: This is the same matching logic as the one used for the '':lang()'' selector.
163162

164163
Issue: Should we allow, or require, Canonicalization of language tags and ranges,
165164
as per [[RFC5646]] section 4.5,
@@ -172,11 +171,24 @@ Detecting Word Boundaries: the 'word-boundary-detection' property</h4>
172171
for such mappings.
173172
</dl>
174173

175-
Note: Specifying the language for which the word boundary detection is to be performed
176-
is required in order to make this feature meaningfully testable with ''@supports''.
174+
<div class=example>
175+
If a User Agent has a word-boundary detection system for Cantonese
176+
that is not suitable for the broader set of Chinese languages,
177+
it is expected to accept ''auto(yue)'', ''auto(zh-yue)'', or ''auto(zh-HK)'',
178+
but not ''auto(zh)'' or ''auto(zh-Hant)''.
179+
180+
However, if the User Agent supports a generic word-boundary detection system
181+
that is suitable for Chinese in general,
182+
it is expected to accept the broad ''auto(zh)'' characterization,
183+
as well as any more specific ones,
184+
such as ''auto(zh-yue)'', ''auto(zh-Hant-HK)'', ''auto(zh-Hans-SG)'', or ''auto(zh-hak).
185+
</div>
177186
178187
<div class=example>
179-
Japanese text normally allows line breaking between letters of a word
188+
Specifying the language for which the word boundary detection is to be performed
189+
is required in order to make this feature meaningfully testable with ''@supports''.
190+
191+
For example, Japanese text normally allows line breaking between letters of a word
180192
(see ''word-break: normal'').
181193
The following code disables that in <code>h1</code> elements,
182194
and only allows line breaking at autodetected word boundaries instead,
@@ -195,6 +207,66 @@ Detecting Word Boundaries: the 'word-boundary-detection' property</h4>
195207
</code></pre>
196208
</div>
197209
210+
[=Virtual word boundary=] insertion happens before [[CSS-TEXT-3#white-space-phase-1]]
211+
and before [[#word-boundary-expansion]].
212+
Later operations
213+
(including [[CSS-TEXT-3#white-space-rules]], [=line breaking=], and [=intrinsic sizing=])
214+
must take the presence of the [=virtual word boundary=] into account.
215+
[=Selectors=] are not affected.
216+
217+
Inline box boundaries
218+
and out-of-flow elements must be ignored
219+
when determining word boundaries.
220+
221+
If a word boundary is found at the same position as
222+
one or more inline box boundaries,
223+
the [=virtual word boundary=] must be inserted
224+
in the outermost element that participates in this inline box boundary.
225+
226+
<div class=example>
227+
In the following example,
228+
the red “<code><span style="color:red">|</span></code>” indicates
229+
reasonable positions for a User Agent to insert virtual word boundaries:
230+
<pre><code highlight=html>กรุงเทพ<span style="color:red">|</span>คือ<span style="color:red">|</span>สวยงาม</code></pre>
231+
If that sentence had contained some inline markup,
232+
the following example shows the correct position to insert the virtual word boundaries:
233+
<pre><code highlight=html>กรุงเทพ<span style="color:red">|</span>คือ<span style="color:red">|</span>&lt;em>สวยงาม&lt;/em></code></pre>
234+
The following example shows <em>incorrect</em> positions:
235+
<pre><code highlight=html>กรุงเทพ<span style="color:red">|</span>คือ&lt;em><span style="color:red">|</span>สวยงาม&lt;/em></code></pre>
236+
The following shows the correct positions in a more contrieved situation:
237+
<pre><code highlight=html>กรุงเทพ<span style="color:red">|</span>&lt;b>&lt;u>คือ&lt;/u><span style="color:red">|</span>&lt;em>สวยงาม&lt;/em>&lt;/b></code></pre>
238+
</div>
239+
240+
The User Agent may tailor its word boundary detection algorithm
241+
depending on whether 'line-break' is
242+
''loose''/''line-break/normal''/''line-break/strict''.
243+
244+
The User Agent must not insert a [=virtual word boundary=]:
245+
<ul>
246+
<li>
247+
at the beginning or end of any box
248+
(including [=inline boxes=])
249+
whose parent box has a [=used value=]
250+
of ''word-boundary-detection/manual''.
251+
252+
<li>
253+
immediately adjacent to a [=word-separator character=],
254+
or an [=other space separator=],
255+
or a ZERO WIDTH SPACE (U+200B) character.
256+
257+
Note: This implies that for languages such as English
258+
where words are separated by spaces or other separating characters,
259+
''word-boundary-detection/auto(<lang>)'' has no effect.
260+
261+
<li>
262+
between a [=typographic letter unit=]
263+
and a subsequent [=typographic character unit=] from the [[!UNICODE]] Pe or Pf classes,
264+
or between a [=typographic letter unit=]
265+
and a preceeding [=typographic character unit=] from the [[!UNICODE]] Ps or Pi classes,
266+
or between a [=typographic letter unit=]
267+
and an adjacent [=typographic character unit=] from the [[!UNICODE]] Pc or Pd or Po classes.
268+
</ul>
269+
198270
<h4 id=word-boundary-expansion>
199271
Makig Word Boundaries Visible: the 'word-boundary-expansion' property</h4>
200272
@@ -221,26 +293,33 @@ Makig Word Boundaries Visible: the 'word-boundary-expansion' property</h4>
221293
into other word-separating characters,
222294
to accomodate variant typesetting styles.
223295
224-
<dl dfn-for="zero-width-space-expansion" dfn-type="value">
296+
<dl dfn-for="word-boundary-expansion" dfn-type="value">
225297
<dt><dfn>none</dfn>
226298
<dd>This property has no effect.
227299
228300
<dt><dfn>space</dfn>
229301
<dd>
230-
All instances of U+200B ZERO WIDTH SPACE
302+
Instances of U+200B ZERO WIDTH SPACE
231303
within the [=text run=] children of this element
232304
are replaced by U+0020 SPACE.
233305
234306
<dt><dfn>ideographic-space</dfn>
235307
<dd>
236-
All instances of U+200B ZERO WIDTH SPACE
308+
Instances of U+200B ZERO WIDTH SPACE
237309
within the [=text run=] children of this element
238310
are replaced by U+3000 IDEOGRAPHIC SPACE.
239311
</dl>
240312
313+
The User Agent must not replace
314+
instances of U+200B imediately preceding or following
315+
a [=forced line break=]
316+
(ignoring any intervening inline box boundaries,
317+
and associated 'margin'/'border'/'padding').
318+
241319
Instances of <{wbr}> are considered equivalent to U+200B,
242320
and are also replaced,
243-
as are U+200B inserted by ''word-boundary-detection: auto()''.
321+
as are [=virtual word boundaries=] inserted by 'word-boundary-detection'.
322+
244323
Unlike 'text-transform',
245324
this substitution happens before [[CSS-TEXT-3#white-space-phase-1]]
246325
so that later operations that depend on the characters in the content

0 commit comments

Comments
 (0)