Skip to content

[css-fonts] Handling of Standardized Variation Sequences #1710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hfhchan opened this issue Aug 10, 2017 · 19 comments
Open

[css-fonts] Handling of Standardized Variation Sequences #1710

hfhchan opened this issue Aug 10, 2017 · 19 comments
Labels
css-fonts-4 Current Work i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@hfhchan
Copy link

hfhchan commented Aug 10, 2017

In Step 2a of Section 5.3 Cluster matching of CSS Fonts 3:

If c1 is a variation selector, system fallback must be used to find
a font that supports the full sequence of `b + c1.

I've tested Chrome and Firefox, and both don't do any system font fallback when the given font contains a glyph for the b but don't for b + c1.

I tested with 齋󠄁齋 (the first has a variation selector U+E0101 appended after it), and with a CSS declaration of font-family: SimSun, HanaMinA;, of which SimSun does not contain a glyph for the variation selector, but HanaMinA does:

Result in Firefox:
image

Result in Chrome:
image

The spec should either be amended to reflect the behavior implemented by browsers, or the browser's behavior should be changed. Unicode Variation Selectors involve a GSUB CMAP14 lookup and it would be understandable that reordering complex table lookups in the font-selection phase could be prohibitively expensive.

Due to the nature of the Han script, it is often hard to objectively quantify what is the same character and what is not. Different people have different expectations. CJK Unification in ISO10646 was a very controversial decision and continues on to be controversial today. Reliably rendering Unicode Variations is necessary and may have legal ramifications.

The fallback to b behavior is problematic because it may not be what the author intended and the user has completely no idea. More often, the preferred behavior is that a "tofu" (.nodef) is displayed instead.

In addition, China and TCA are likely to be using Unicode Variation Selectors to encode historic variants of CJK Unified Ideographs (assuming the decision by the IRG is approved by WG2 in the coming meeting in September). Variant characters with visually-significant differences will be approved for unification with their more common character, provided that the variant is similar in structure, rare in modern use and is attested to be exactly equivalent to the base character in semantics. In these cases, getting .nodef is usually preferred over getting the base character's glyph if a given font doesn't have that specific glyph variant.

At the same time, it may be useful in historical text digitization projects to dynamically switch between showing characters in glyphs as they are in the books, and glyphs that are used in the modern day. This could be accomplished by stripping all the variation selectors out via regex and innerHTML, or more preferrably activated via a CSS property or feature / flag.

To cater for such behavior, I suggest that a new CSS property and/or OpenType feature / flag is introduced.

These behaviors could be implemented via a new CSS property such as font-variation-sequences with the following values:

  • auto (fallback to b for VS-16 and below if b + c1 is missing, .nodef for VS-17 and above if b + c1 is missing)
  • ignore-missing (fallback to b if b + c1 is missing),
  • tofu-missing (fallback to .nodef if b + c1 is missing),
  • ignore-all (ignore all Unicode Variation Selectors).

It could also be piggybacked by introducing a new OpenType flag, maybe named as "tofu", so the different behaviors could be activated directly via font-variation-settings.

@litherum litherum added the css-fonts-4 Current Work label Aug 10, 2017
@litherum
Copy link
Contributor

litherum commented Mar 6, 2018

I've tested Chrome and Firefox, and both don't do any system font fallback when the given font contains a glyph for the b but don't for b + c1.

Yep, browser support for variation selectors is quite lacking right now. I don't know of any browser which handles them correctly.

The spec should either be amended to reflect the behavior implemented by browsers, or the browsers' behavior should be changed.

Yep, the browser's behavior should be changed.

The fallback to b behavior is problematic because it may not be what the author intended and the user has completely no idea. More often, the preferred behavior is that a "tofu" (.nodef) is displayed instead.

Yep, this is why I wrote the spec to describe this behavior as required.

In these cases, getting .nodef is usually preferred over getting the base character's glyph if a given font doesn't have that specific glyph variant.

Yep.

These behaviors could be implemented via a new CSS property such as font-variation-sequences

I don't think this is worth a whole new property. Control over these facilities already exists in the form of variation selectors. The niche applications which need to perform their own handling of these characters can use Javascript to implement their desired behavior.

@Crissov
Copy link
Contributor

Crissov commented Mar 6, 2018

JFTR, this is related to text-transform-variant proposed in #2166 and #1144.

@hfhchan
Copy link
Author

hfhchan commented Mar 6, 2018

Ignoring all variation selectors (showing the modern form) and honoring all variation selectors (showing the historical forms) are extremely common use cases. The main controversy of Han Unification was there wasn't an easy way to switch between the two.

That could be implemented in JavaScript, yes. But the case of showing historical text in its modern forms and in its historical form as in the text is a purely presentation / style issue, which would be better accomplished via CSS.

Implementing a dynamic override it in JavaScript would entail traversing the DOM and storing the original string of every text node.

@js-choi
Copy link

js-choi commented Mar 6, 2018

@hfhchan As far as I recall, the controversy over Unicode Han Unification was over the very idea that a Han ideograph written in one style is fundamentally the same plain-text abstract character as the ideograph written in another style—as well as, to a lesser extent, the increased difficulty of writing two different styles of the same Han ideograph, in the same run of plain text without rich-text formatting. The latter problem was ameliorated by the introduction of ideographic variation selectors.

But it is easy for an author to switch between two styles of ideographs in non-VS-using text: just switch the font. The point of Han Unification is that the choice of an ideograph’s style is not a concern of their fundamental plain-text semantics but rather of their rich-text formatting.

However, variation selectors are part of the fundamental semantics of text: the plain text level. If an author uses variation selectors in their plain text’s ideographs, that means the author believes that the particular style of those ideographs are a part of their text’s fundamental, plain-text semantics, rather than merely rich-text formatting. The author therefore expects their ideographs to always render with that specific style, as long as that style is able to be rendered. It’s the same rule with variation selectors for emoji characters and mathematical symbols.

At least, that’s my understanding of the Unicode Standard. Default rendering styles are one thing, but overriding explicitly-included-by-the-author variation selectors strikes me as a strange feature: a one that makes less stable the standard semantics of plain text. The feature would, for instance, would cause copying web text with variation selectors to unexpectedly differ, when pasted into some destination (as well as when the text is interpreted by crawlers and other plain-text-processing applications), from how it is visually rendered by a web browser. Why not actually delete the variation selectors, when they are no longer a desirable part of the text’s fundamental plain-text semantics?

@hfhchan
Copy link
Author

hfhchan commented Mar 7, 2018

The feature would, for instance, would give copying web text with variation selectors unexpectedly differ, when pasted into some destination (as well as when the text is interpreted by crawlers and other plain-text-processing applications), from how it is visually rendered by a web browser.

That happens with text-transform: uppercase too.

The author therefore expects their ideographs to always render with that specific style, as long as that style is able to be rendered.

Not exactly. In China some users will prefer a more accurate representation of the text but more will prefer the characters to be rendered in the modern orthography. For many digitization projects that means building two different strings, one often using PUA (because ideographic variation selectors are not well supported).

The point is, Han variants could both indicate gloss and could indicate semantic intention. The person doing the digitization may not be able to make a good guess - so they have to preserve it in some way and provide a client side toggle - in a way like choosing between serif and sans-serif in a browser's reading mode.

@hfhchan
Copy link
Author

hfhchan commented Mar 7, 2018

As far as I recall, the controversy over Unicode Han Unification was over the very idea that a Han ideograph written in one style is fundamentally the same plain-text abstract character as the ideograph written in another style

The idea that a character has a dozen glyphs and still be the same character has been the majority opinion reflected in dictionaries as early as the Tang dynasty. The mainstream linguistic opinion in modern China has also long concluded that multiple glyphs can be grouped into the same character, and standardized orthographies are also the national/regional policies in China. Also, the Japanese extensively unified glyphs in the earlier Japanese Industrial Standards. It is rather obvious that the initial pushback against Han unification was by far more than the linguistic or technical issues.

The current pushback is largely based on the inability to accurately reproduce a certain glyph the author intends.

@js-choi
Copy link

js-choi commented Mar 7, 2018

I will definitely defer to your expertise. I just have a couple more questions as to the use case of the proposal; my apologies, and thank you for your patience.

The feature would, for instance, would give copying web text with variation selectors unexpectedly differ, when pasted into some destination (as well as when the text is interpreted by crawlers and other plain-text-processing applications), from how it is visually rendered by a web browser.

That happens with text-transform: uppercase too.

This is indeed true.

In this case indeed, the author of the stylesheet risks modifying the fundamental plain-text semantics intended by the author of the text content—but this risk is generally low, at least for Latin scripts. But the risks of text transformation increase as the significance of characters increase.

The author therefore expects their ideographs to always render with that specific style, as long as that style is able to be rendered.

Not exactly. In China some users will prefer a more accurate representation of the text but more will prefer the characters to be rendered in the modern orthography. For many digitization projects that means building two different strings, one often using PUA (because ideographic variation selectors are not well supported).

The point is, Han variants could both indicate gloss and could indicate semantic intention. The person doing the digitization may not be able to make a good guess - so they have to preserve it in some way and provide a client side toggle - in a way like choosing between serif and sans-serif in a browser's reading mode.

What I’m wondering about here is the desires of text authors versus the desires of stylesheet authors for websites that display the text.

When a stylesheet author writes text-transform: uppercase;, the stylesheet author risks modifying the fundamental plain-text semantics intended by the authors of text content.

This risk is generally low, at least for Latin scripts, which do not change meaning that much depending on case (though there are certain significant exceptions).

However, the risks of text transformation increase as the significance of characters increase. A hypothetical text transformation (from styled Latin mathematical letters to plain Latin characters and from superscript numbers to regular numbers) risks rendering “ℋ = ∫ 𝑑𝜏(𝜖𝐸² + 𝜇𝐻²)” as the quite semantically different “H = ∫ dτ(ϵE2 + µH2)”. Another hypothetical text transformation (from colored symbols to regular symbols) risks changing “He had difficulty distinguishing 🔴 and 💚” to the semantically different “He had difficulty distinguishing ⚫️ and 🖤”. And a hypothetical text transformation (from ideograph variation sequences to ideographs all of one variation) would have similar risks.

“The current pushback [of Han Unification] is largely based on the inability to accurately reproduce a certain glyph the author intends,” but here “author” presumably refers to the authors of text content, rather than the authors of stylesheets. Ideographic variation sequences already give to text authors the power to precisely control what glyph is rendered on a per-character basis, so I am uncertain why pushback would still be occurring. Yet this proposal would take some of that power away from text authors, by giving stylesheets the ability to override the text author’s intent.

What I’m wondering about here is the conflict between the desires of text authors versus the desires of stylesheet authors for websites that display the text. For whose benefit would this proposal be? As far as I could tell, this proposal would weaken the ability of text authors to have their content be rendered using the correct glyphs, giving more that power to the stylesheet author instead.

The point is, Han variants could both indicate gloss and could indicate semantic intention. The person doing the digitization may not be able to make a good guess - so they have to preserve it in some way and provide a client side toggle - in a way like choosing between serif and sans-serif in a browser's reading mode.

Basically—why would the person doing the digitization simply not use the right variation selector, instead of relying on rich-text markup? I actually am very interested in this, since all this seems to indeed be a real-world occurrence.

Of note is that Unicode Technical Report 51: Unicode Emoji recommends that emoji/text variation selectors always be respected, regardless of the environment’s default rendering style. This is presumably because of the same reason: the very existence of variation selectors as plain text is because they are part of the intrinsic semantics of the plain text.

But again, I am almost certainly much less of an expert here compared to you. My apologies for the questions, and thank you for your patience.

@hfhchan
Copy link
Author

hfhchan commented Mar 7, 2018

In the use case of the digitization of old Han manuscripts, most of the authors have long since passed away. The entity that is converting the manuscript into digital text is also the stylesheet author. In this case, the text is covered into digital text with all the variation selectors to reproduce the form that was used in the historical documents.

Give the example of The Analects 論語. Sinologists which are studying the text may prefer a very close approximation of the exact glyphs used since it may include important information on when that particular version was published or amended. However, for the general Chinese user, they would not care about the exact orthography used and care more about the semantic content. They may prefer all variants to be canonicalized to their modern orthography.

In this case, the stylesheet author only provides a mode (say, input.normalize:checked+p) which strips the variants, while the reader is the person who activates the stripping to his/her convenience. The reader is responsible for his/her actions if there is any loss in semantic meaning. (For phrases where the exact style should never be stripped, it could be further marked up with HTML+CSS).

This risk is generally low, at least for Latin scripts, which do not change meaning that much depending on case (though there are certain significant exceptions).

IRG member bodies are supposed to verify that the variants they submit to be encoded via variation selectors are indeed unifyable and not non-cognate. If two variants are recognized as semantically different, they would be disunified by the non-cognate rule.

@js-choi
Copy link

js-choi commented Mar 7, 2018

@hfhchan: Thanks for the explanation; this use case makes sense to me now.

If this proposal handles standardized variation sequences, then it probably needs to address how it applies to emoji/text, mathematical, and Mongolian variation sequences too.

@kojiishi
Copy link
Contributor

kojiishi commented Mar 7, 2018

/cc @drott

@Crissov
Copy link
Contributor

Crissov commented Mar 7, 2018

I’m afraid we have two closely related but different issues at hand.

  1. Ignoring all VSs.
  2. Specifying fallback in case the VS sequence has no glyph in the current font but the base character does.

The first issue, which I proposed independently, is in scope of [css-text], the second sounds more like [css-fonts] was the right place indeed.

Unicode does not concern itself with font selection much, so while it clearly mandates that the default glyph of the base character should be shown instead of no glyph (other than .notdef), it does not say whether choosing an alternative font that does have a glyph for the SVS would be preferred over that. Unless I missed something or course.

@js-choi
Copy link

js-choi commented Mar 7, 2018

@Crissov Good points. Also, the question in that last paragraph may be worth bringing up to the Unicode mailing lists; I think it could be within Unicode’s scope.

@hfhchan
Copy link
Author

hfhchan commented Mar 7, 2018

@kenlunde

@jfkthame
Copy link
Contributor

jfkthame commented Mar 8, 2018

I find the second part of this statement

The fallback to b behavior is problematic because it may not be what the author intended and the user has completely no idea. More often, the preferred behavior is that a "tofu" (.nodef) is displayed instead.

somewhat surprising. Displaying a .notdef glyph makes the intended character completely opaque to the reader; there is no clue as to what the author intended. It's hard to imagine a situation where this would be desirable for a reader.

ISTM that a more useful fallback, in the case where no available font supports the sequence b + c1, might be to render b, .notdef, or even b, <visual representation of VS code>; i.e. the reader would see the base glyph, and therefore be able to read the character, but would also see some kind of additional mark -- even if only a tofu -- which provides a clue that there is "something special" about the text at this point.

This would, however, conflict with Unicode's classification of the variation selectors as Default Ignorable codepoints, which implies that when not supported, they should have no visible rendering: "When such a special-use character is not supported by an implementation, it should not be displayed with a visible fallback glyph, but instead simply not be rendered at all." (Unicode, §5.3.)

This suggests that fallback to b (with no visible indication of the unsupported variation selector) is the correct default behavior when no available font supports b + c1; a mode that makes the variation selector visible would be a special, non-default rendering comparable to a word processor's "show invisibles" mode that adds visible marks to spaces, tabs, carriage return, etc.

@hfhchan
Copy link
Author

hfhchan commented Mar 8, 2018

Hmm, that behavior makes sense for emoji and mathematical operators, but probably would not align with expectations for CJK users.

Luckily, a quick check seems to indicate that default ignorable is not covered by the Unicode stability policy.

@kenlunde
Copy link
Member

kenlunde commented Mar 8, 2018

Default Ignorable in the context of UVSes (Unicode Variation Sequences, which can be SVSes, EVSes, or IVSes) simply means that if an implementation cannot resolve a UVS to its intended glyph, which typically means that the selected font does not support the specified UVS, the base character should instead be rendered, and the VS should not be displayed (but not discarded). The extent to which an implementation makes an attempt to resolve a UVS—especially during font fallback—is completely up to the implementation.

@Crissov
Copy link
Contributor

Crissov commented Mar 8, 2018

JFTR, from the Unicode FAQ (with my highlighting):

Q: How should variation sequences be displayed?

A: When they are valid variation sequences, they should be displayed as illustrated in the Unicode code charts, the emoji charts, or in the Ideographic Variation Database. When a variation sequence is not valid or its display is not supported, the base character is displayed as usual, and the variation selector is invisible. See Display of Unsupported Characters.

Q: What changes does a browser developer need to make to support variation sequences?

A: Browsers generally use a font fallback mechanism to display web pages. This allows users to read text when the font specified in the web page is unavailable or doesn't support all the characters that are referenced on that web page. A simple but insufficient mechanism is to display characters in a font up until the first character that can't be displayed. Such a mechanism fails with variation sequences. A better mechanism is to treat a combining character sequence as a single entity for the purpose of font substitution. Because variation selectors have the General_Category property value of Nonspacing_Mark, this treatment allows variation sequences to be handled correctly.

@drott
Copy link
Collaborator

drott commented Mar 12, 2018

I would prefer not to introduce an additional property. I think the best way to address this is to specify at least one font known to have coverage to avoid going into system fallback for variation selectors plus improvements to browsers following the Cluster Matching rules more closely.

While Chrome does take the whole grapheme cluster into account for font-fallback, for variation selectors our shaping library considers a grapheme successfully shaped if b is found, already in the "first pass", i.e. before doing one fallback pass to find a full b + c1 suitable glyph. I agree that this should be improved to follow the Cluster Matching rules, filed Chromium issue 820929

I've tested Chrome and Firefox, and both don't do any system font fallback when the given font contains a glyph for the b but don't for b + c1.

I tested with 齋󠄁齋 (the first has a variation selector U+E0101 appended after it), and with a CSS declaration of font-family: SimSun, HanaMinA;, of which SimSun does not contain a glyph for the variation selector, but HanaMinA does:

Can you clarify whether you were intending to test system fallback or fallback within the font-stack? If HanaMinA has coverage, no system fallback is needed, but browser should fall back from SimSun to HanaMinA.

@hfhchan
Copy link
Author

hfhchan commented Mar 12, 2018

Fallback within font stack. I suppose system fallback should be tested too, but that would be less crucial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
css-fonts-4 Current Work i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
None yet
Development

No branches or pull requests

9 participants