[css-text-4] Revisions to word-boundary-detection based on review by fantasai

frivoal · frivoal · commit dfa8b4e9c2c4 · 2019-08-16T19:18:18.000+02:00
diff --git a/css-text-4/Overview.bs b/css-text-4/Overview.bs
@@ -23,6 +23,8 @@ spec: css-text-3; type: property
 	text: word-spacing
 spec: css-text-3; type: dfn
 	text: forced line break
+	text: word-separator character
+	text: other space separator
 </pre>
 
 <pre class=biblio>
@@ -92,74 +94,71 @@ Detecting Word Boundaries: the 'word-boundary-detection' property</h4>
 	Animation type: discrete
 	</pre>
 
+	This property allows the author to decide
+	whether and how
+	the User Agent must analyse the content
+	to determine where word boundaries are,
+	and to insert [=virtual word boundaries=] accordingly.
+
+	A <dfn data-dfn-for='' data-dfn-type=dfn>virtual word boundary</dfn> is similar to the presence
+	of the ZERO WIDTH SPACE (U+200B) character:
+	it introduces a [=soft wrap opportunity=]
+	and is affected by the 'word-boundary-expansion' property;
+	its presence has no effect on text shaping,
+	nor on 'word-spacing'.
+	However, its insertion must have no effect on the underlying content,
+	and must not affect the content of a plain text copy &amp; paste operation.
+
 	<dl dfn-type=value dfn-for=word-boundary-detection>
 		<dt><dfn>manual</dfn>
 		<dd>
-			This property has no effect.
+			The User Agent must not insert [=virtual word boundaries=].
 
 		<dt><dfn>auto(<<lang>>)</dfn>
 		<dd>
 
 			This value directs the User Agent to perform language-specific content analysis
-			to determine where word boundaries are.
-			The specific algorithm to be used is UA-dependent.
-			However, inline element boundaries and out-of-flow elements must be ignored when determining word boundaries.
+			to determine where to insert [=virtual word boundaries=].
 
 			<dfn dfn-type=type><<lang>></dfn> must be a valid CSS <<ident>> or <<string>>.
 			It represents an IETF BCP 47 language range
-			(See [[BCP47]]).
-			If the UA does not support word-boundary detection for <em>all</em> languages represented by the specified range,
+			(see [[BCP47]]).
+			If the UA does not support word-boundary detection
+			for <em>all</em> languages represented by the specified range,
 			it must reject that value at parse-time.
 
-			<div class=example>
-				If a User Agent has a word-boundary detection system for Cantonese
-				that is not suitable for the broader set of Chinese languages,
-				it must accept ''lang(yue)'', ''lang(zh-yue)'', or ''lang(zh-HK)'',
-				but not ''lang(zh)'' or ''lang(zh-Hant)''.
-
-				However, if the User Agent supports a generic word-boundary detection system
-				that is suitable for Chinese in general,
-				it should accept the broad ''lang(zh)'' characterization,
-				as well as any more specific ones,
-				such as ''lang(zh-yue)'', ''lang(zh-Hant-HK)'', ''lang(zh-Hans-SG)'', or ''lang(zh-hak).
-			</div>
+			Note: Wildcards <em>in the language subtag</em> would imply
+			support for detecting word boundaries in an undefined and effectively unlimited set of languages.
+			As this this is not possible,
+			wildcards in the language subtag always result in the declaration
+			being treated as invalid.
 
 			Note: Whether a word boundary detection system designed for one language
 			is suitable for some or all dialects of that language is somewhat subjective,
-			and this specifications leaves it at the appreciation of the User Agent.
+			and this specifications leaves it at the discretion of the User Agent.
 			Even if a detection system is not able to cope with all nuances of a particular dialect,
 			it may be reasonable to claim support
 			if the detection correctly recognizes word boundaries most of the time.
 			However, the User Agent would do a disservice to authors and users
 			if it claimed support for languages
-			where it fails to detect most word boundaries.
-
-			Note: Wildcards <em>in the language subtag</em> would imply
-			support for detecting word boundaries in an undefined and effectively unlimited set of languages.
-			As this this is not possible,
-			wildcards in the language subtag must always be treated as invalid.
+			where it fails to detect most word boundaries
+			or has a high error rate.
 
 			If the element’s [=content language=],
 			as represented in BCP 47 syntax [[BCP47]],
-			does <em>not</em> matches the language range described by the computed value's <<lang>>
+			does <em>not</em> match the language range described by the computed value's <<lang>>
 			in an extended filtering operation
 			per [[RFC4647]] <cite>Matching of Language Tags</cite> (section 3.3.2),
-			then the [=used value=] is set to ''word-boundary-detection/manual'',
+			then the [=used value=] is ''word-boundary-detection/manual'',
 			and this property has no effect on this element.
-			<span class=note>(This is the same maching logic as the one used for the '':lang()'' selector, negated.)</span>
 			Otherwise,
-			the User Agent must insert the ZERO WIDTH SPACE (U+200B) character
+			the User Agent must insert a [=virtual word boundary=]
 			at each detected word boundary
 			within the [=text run=] children of this element.
-			However, the UA must not insert U+200B:
-			* at the beginning or end of a [=block container=]
-			* at the beginning or end of an [=inline box=] whose parent box has a [=used value=] of ''word-boundary-detection/manual''
+			Within the constraints set by this specification,
+			the specific algorithm used is UA-dependent.
 
-			The insertion happens before layout,
-			so all layout operations that depend on the characters in the content
-			(such as [[CSS-TEXT-3#white-space-rules]], [=line breaking=], or [=intrinsic sizing=])
-			must take the presence of that character into account.
-			[=Selectors=] are not affected.
+			Note: This is the same matching logic as the one used for the '':lang()'' selector.
 
 			Issue: Should we allow, or require, Canonicalization of language tags and ranges,
 			as per [[RFC5646]] section 4.5,
@@ -172,11 +171,24 @@ Detecting Word Boundaries: the 'word-boundary-detection' property</h4>
 			for such mappings.
 	</dl>
 
-	Note: Specifying the language for which the word boundary detection is to be performed
-	is required in order to make this feature meaningfully testable with ''@supports''.
+	<div class=example>
+		If a User Agent has a word-boundary detection system for Cantonese
+		that is not suitable for the broader set of Chinese languages,
+		it is expected to accept ''auto(yue)'', ''auto(zh-yue)'', or ''auto(zh-HK)'',
+		but not ''auto(zh)'' or ''auto(zh-Hant)''.
+
+		However, if the User Agent supports a generic word-boundary detection system
+		that is suitable for Chinese in general,
+		it is expected to accept the broad ''auto(zh)'' characterization,
+		as well as any more specific ones,
+		such as ''auto(zh-yue)'', ''auto(zh-Hant-HK)'', ''auto(zh-Hans-SG)'', or ''auto(zh-hak).
+	</div>
 
 	<div class=example>
-		Japanese text normally allows line breaking between letters of a word
+		Specifying the language for which the word boundary detection is to be performed
+		is required in order to make this feature meaningfully testable with ''@supports''.
+
+		For example, Japanese text normally allows line breaking between letters of a word
 		(see ''word-break: normal'').
 		The following code disables that in <code>h1</code> elements,
 		and only allows line breaking at autodetected word boundaries instead,
@@ -195,6 +207,66 @@ Detecting Word Boundaries: the 'word-boundary-detection' property</h4>
 		</code></pre>
 	</div>
 
+	[=Virtual word boundary=] insertion happens before [[CSS-TEXT-3#white-space-phase-1]]
+	and before [[#word-boundary-expansion]].
+	Later operations
+	(including [[CSS-TEXT-3#white-space-rules]], [=line breaking=], and [=intrinsic sizing=])
+	must take the presence of the [=virtual word boundary=] into account.
+	[=Selectors=] are not affected.
+
+	Inline box boundaries
+	and out-of-flow elements must be ignored
+	when determining word boundaries.
+
+	If a word boundary is found at the same position as
+	one or more inline box boundaries,
+	the [=virtual word boundary=] must be inserted
+	in the outermost element that participates in this inline box boundary.
+
+	<div class=example>
+		In the following example,
+		the red “<code><span style="color:red">|</span></code>” indicates
+		reasonable positions for a User Agent to insert virtual word boundaries:
+		<pre><code highlight=html>กรุงเทพ<span style="color:red">|</span>คือ<span style="color:red">|</span>สวยงาม</code></pre>
+		If that sentence had contained some inline markup,
+		the following example shows the correct position to insert the virtual word boundaries:
+		<pre><code highlight=html>กรุงเทพ<span style="color:red">|</span>คือ<span style="color:red">|</span>&lt;em>สวยงาม&lt;/em></code></pre>
+		The following example shows <em>incorrect</em> positions:
+		<pre><code highlight=html>กรุงเทพ<span style="color:red">|</span>คือ&lt;em><span style="color:red">|</span>สวยงาม&lt;/em></code></pre>
+		The following shows the correct positions in a more contrieved situation:
+		<pre><code highlight=html>กรุงเทพ<span style="color:red">|</span>&lt;b>&lt;u>คือ&lt;/u><span style="color:red">|</span>&lt;em>สวยงาม&lt;/em>&lt;/b></code></pre>
+	</div>
+
+	The User Agent may tailor its word boundary detection algorithm
+	depending on whether 'line-break' is
+	''loose''/''line-break/normal''/''line-break/strict''.
+
+	The User Agent must not insert a [=virtual word boundary=]:
+	<ul>
+		<li>
+			at the beginning or end of any box
+			(including [=inline boxes=])
+			whose parent box has a [=used value=]
+			of ''word-boundary-detection/manual''.
+
+		<li>
+			immediately adjacent to a [=word-separator character=],
+			or an [=other space separator=],
+			or a ZERO WIDTH SPACE (U+200B) character.
+
+			Note: This implies that for languages such as English
+			where words are separated by spaces or other separating characters,
+			''word-boundary-detection/auto(<lang>)'' has no effect.
+
+		<li>
+			between a [=typographic letter unit=]
+			and a subsequent [=typographic character unit=] from the [[!UNICODE]] Pe or Pf classes,
+			or between a [=typographic letter unit=]
+			and a preceeding [=typographic character unit=] from the [[!UNICODE]] Ps or Pi classes,
+			or between a [=typographic letter unit=]
+			and an adjacent [=typographic character unit=] from the [[!UNICODE]] Pc or Pd or Po classes.
+	</ul>
+
 <h4 id=word-boundary-expansion>
 Makig Word Boundaries Visible: the 'word-boundary-expansion' property</h4>
 
@@ -221,26 +293,33 @@ Makig Word Boundaries Visible: the 'word-boundary-expansion' property</h4>
 	into other word-separating characters,
 	to accomodate variant typesetting styles.
 
-	<dl dfn-for="zero-width-space-expansion" dfn-type="value">
+	<dl dfn-for="word-boundary-expansion" dfn-type="value">
 		<dt><dfn>none</dfn>
 		<dd>This property has no effect.
 
 		<dt><dfn>space</dfn>
 		<dd>
-			All instances of U+200B ZERO WIDTH SPACE
+			Instances of U+200B ZERO WIDTH SPACE
 			within the [=text run=] children of this element
 			are replaced by U+0020 SPACE.
 
 		<dt><dfn>ideographic-space</dfn>
 		<dd>
-			All instances of U+200B ZERO WIDTH SPACE
+			Instances of U+200B ZERO WIDTH SPACE
 			within the [=text run=] children of this element
 			are replaced by U+3000 IDEOGRAPHIC SPACE.
 	</dl>
 
+	The User Agent must not replace
+	instances of U+200B imediately preceding or following
+	a [=forced line break=]
+	(ignoring any intervening inline box boundaries,
+	and associated 'margin'/'border'/'padding').
+
 	Instances of <{wbr}> are considered equivalent to U+200B,
 	and are also replaced,
-	as are U+200B inserted by ''word-boundary-detection: auto()''.
+	as are [=virtual word boundaries=] inserted by 'word-boundary-detection'.
+
 	Unlike 'text-transform',
 	this substitution happens before [[CSS-TEXT-3#white-space-phase-1]]
 	so that later operations that depend on the characters in the content