Skip to content

[css-text-3] Disentangle content language and writing system #3202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 15, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion css-fonts-4/Overview.bs
Original file line number Diff line number Diff line change
Expand Up @@ -3448,8 +3448,10 @@ traditions found in Spanish, Italian and French orthography:
If the content language of the element is known according to the
rules of the <a href="https://www.w3.org/TR/CSS21/conform.html#doclanguage">document language</a>,
user agents are required to infer the OpenType language system from
the content language and use that when selecting and positioning
the [=content language=] and use that when selecting and positioning
glyphs using an OpenType font.
If a [=writing system=] has been explicitely specified,
it must take precedence over the customary one implied by the [=content language=].

<!-- previously in level 3, now moved to Level 4 -->
For OpenType fonts, in some cases it may be necessary to explicitly
Expand Down
48 changes: 41 additions & 7 deletions css-text-3/Overview.bs
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,16 @@ Languages and Typesetting</h3>
the universal <code>xml:lang</code> attribute in XML,
and the HTTP <code>Content-Language</code> header for content served over HTTP.

The [=content language=] an element is declared to be in
also identifies the specific written form of that language used in that element,
known as the <dfn export>writing system</dfn>.

Note: Depending on the [=document language=]'s facilities for identifying the [=content language=],
information about the [=writing system=] may only be carried implicitly.
That is typically the case with the [[BCP47]] language tag used in [[HTML]],
although it can optionally indicate the [=writing system=] explicitly
using a script subtag.

Language and writing system conventions can affect
line breaking, hyphenation, justification, glyph selection,
and many other typographic effects.
Expand Down Expand Up @@ -391,6 +401,7 @@ Characters and Letters</h3>
replaced with a different set of mappings to their respective
undotted/dotted counterparts, which do not exist in English. This
mapping must only take effect if the <a>content language</a> is Turkish
written in its modern Latin-based <a>writing system</a>
(or another Turkic language that uses Turkish casing rules);
in other languages, the usual mapping of &ldquo;I&rdquo;
and &ldquo;i&rdquo; is required. This rule is thus conditionally
Expand Down Expand Up @@ -726,8 +737,8 @@ Characters and Letters</h3>
<code>W</code>, or <code>H</code> (not <code>A</code>),
and neither side is Hangul or Emoji (Unicode property <code>Emoji</code>),
then the segment break is removed.
<li>Otherwise, if the <a>content language</a> of the <a>segment break</a>
is Chinese, Japanese, or Yi,
<li>Otherwise, if the <a>writing system</a> of the <a>segment break</a>
is <a for=writing-system>Chinese</a>, <a for=writing-system>Japanese</a>, or Yi,
and the character before or after the segment break
is punctuation or a symbol (Unicode <a>general category</a> P* or S*)
and has an <a>East Asian Width property</a> of <code>A</code>
Expand Down Expand Up @@ -1166,7 +1177,7 @@ Line Breaking Details</h3>
</ul>
<li>
The following breaks are allowed for ''line-break/normal'' and ''loose'' line breaking
if the <a>content language</a> is Chinese or Japanese,
if the <a>writing system</a> is <a for=writing-system>Chinese</a> or <a for=writing-system>Japanese</a>,
and are otherwise forbidden:
<ul>
<li>breaks before hyphens:<br>
Expand All @@ -1187,7 +1198,7 @@ Line Breaking Details</h3>
</ul>
<li>
The following breaks are allowed for ''loose''
if the <a>content language</a> is Chinese or Japanese
if the <a>writing system</a> is <a for=writing-system>Chinese</a> or <a for=writing-system>Japanese</a>
and are otherwise forbidden:
<ul>
<li>breaks before certain centered punctuation marks:<br>
Expand All @@ -1208,7 +1219,7 @@ Line Breaking Details</h3>
<p class="note">In the requirements listed above,
no distinction is made among the levels of strictness in non-CJK text:
only CJK codepoints are affected,
unless the text is marked as Chinese or Japanese,
unless the text is marked as <a for=writing-system>Chinese</a> or <a for=writing-system>Japanese</a>,
in which case some additional common codepoints are affected.

<div class="example">
Expand Down Expand Up @@ -2506,7 +2517,7 @@ Appendix D: Scripts and Spacing</h2>
The following <a>Unicode scripts</a> are included:
Bopomofo, Han, Hangul, Hiragana, Katakana, and Yi.
Characters of the <a>East Asian Width property</a> <code>W</code> and <code>F</code> are also included,
but <code>A</code> characters are included only if the <a>content language</a> is Chinese, Korean, or Japanese.
but <code>A</code> characters are included only if the <a>writing system</a> is <a for=writing-system>Chinese</a>, <a for=writing-system>Korean</a>, or <a for=writing-system>Japanese</a>.
<dt><dfn>clustered scripts</dfn></dt>
<dd>Clustered scripts have discrete units
and break only at word boundaries,
Expand Down Expand Up @@ -2579,6 +2590,8 @@ Characters and Properties</h2>
<h2 id="script-tagging" class="no-num">Appendix F.
Tagging Content by Writing System</h2>

<p><em>This appendix is normative.</em></p>

While most languages have a preferred writing system,
many can also be transcribed into a different writing system.
As a common example, most languages have at least one Latin transcription,
Expand All @@ -2591,7 +2604,8 @@ Tagging Content by Writing System</h2>
does not use word spaces,
and should therefore be typeset as for Chinese.

Authors can indicate the use of an atypical writing system
In [[HTML]] or any other <a>document language</a> using [[BCP47]] to identify the [=content language=],
authors can indicate the use of an atypical writing system
with script subtags.
For example, to indicate use of the Latin writing system
for languages which don't natively use it,
Expand Down Expand Up @@ -2629,6 +2643,26 @@ Tagging Content by Writing System</h2>
not the conventions of that language in a different writing system,
which would be inappropriate to the writing system used in this case.

The full correspondence between languages and their most common writing system
is out of scope for this document.
However, User Agents must assume at least the following:

* If the [=content language=] is Chinese and the [=writing system=] is unspecified,
or for any [=content language=] if the [=writing system=] to specified to be one of the ''Hant'', ''Hans'', ''Hani'', ''Hanb'', or ''Bopo'' [[ISO15924]] codes,
then the [=writing system=] is <dfn no-export for=writing-system>Chinese</dfn>.
* If the [=content language=] is Japanese and the [=writing system=] is unspecified,
or for any [=content language=] if the [=writing system=] to specified to be one of the ''Jpan'', ''Hrkt'', ''Hira'' or ''Kana'' [[ISO15924]] codes,
then the [=writing system=] is <dfn no-export for=writing-system>Japanese</dfn>.
* If the [=content language=] is Korean and the [=writing system=] is unspecified,
or for any [=content language=] if the [=writing system=] to specified to be one of the ''Kore'', ''Hang'', or ''Jamo'' [[ISO15924]] codes,
then the [=writing system=] is <dfn no-export for=writing-system>Korean</dfn>.
* The [=writing system=] is only considered to be <dfn for=writing-system lt='known | unknown'>unknown</dfn>
if the [=content language=] itself is unknown,
or if it explicitly indicates an unknown writing system.

Note: Mere omission of the [=writing system=] information when the [=content language=] is specified
means the that the [=writing system=] is implied, not unknown.

More advice on language tagging can be found in
the <a href="https://www.w3.org/International/core/">Internationalization Working Group</a>’s
<a href="https://www.w3.org/International/articles/language-tags/">“Language tags in HTML and XML”</a>
Expand Down