-
Notifications
You must be signed in to change notification settings - Fork 707
[css-fonts] Handling of Standardized Variation Sequences #1710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yep, browser support for variation selectors is quite lacking right now. I don't know of any browser which handles them correctly.
Yep, the browser's behavior should be changed.
Yep, this is why I wrote the spec to describe this behavior as required.
Yep.
I don't think this is worth a whole new property. Control over these facilities already exists in the form of variation selectors. The niche applications which need to perform their own handling of these characters can use Javascript to implement their desired behavior. |
Ignoring all variation selectors (showing the modern form) and honoring all variation selectors (showing the historical forms) are extremely common use cases. The main controversy of Han Unification was there wasn't an easy way to switch between the two. That could be implemented in JavaScript, yes. But the case of showing historical text in its modern forms and in its historical form as in the text is a purely presentation / style issue, which would be better accomplished via CSS. Implementing a dynamic override it in JavaScript would entail traversing the DOM and storing the original string of every text node. |
@hfhchan As far as I recall, the controversy over Unicode Han Unification was over the very idea that a Han ideograph written in one style is fundamentally the same plain-text abstract character as the ideograph written in another style—as well as, to a lesser extent, the increased difficulty of writing two different styles of the same Han ideograph, in the same run of plain text without rich-text formatting. The latter problem was ameliorated by the introduction of ideographic variation selectors. But it is easy for an author to switch between two styles of ideographs in non-VS-using text: just switch the font. The point of Han Unification is that the choice of an ideograph’s style is not a concern of their fundamental plain-text semantics but rather of their rich-text formatting. However, variation selectors are part of the fundamental semantics of text: the plain text level. If an author uses variation selectors in their plain text’s ideographs, that means the author believes that the particular style of those ideographs are a part of their text’s fundamental, plain-text semantics, rather than merely rich-text formatting. The author therefore expects their ideographs to always render with that specific style, as long as that style is able to be rendered. It’s the same rule with variation selectors for emoji characters and mathematical symbols. At least, that’s my understanding of the Unicode Standard. Default rendering styles are one thing, but overriding explicitly-included-by-the-author variation selectors strikes me as a strange feature: a one that makes less stable the standard semantics of plain text. The feature would, for instance, would cause copying web text with variation selectors to unexpectedly differ, when pasted into some destination (as well as when the text is interpreted by crawlers and other plain-text-processing applications), from how it is visually rendered by a web browser. Why not actually delete the variation selectors, when they are no longer a desirable part of the text’s fundamental plain-text semantics? |
That happens with text-transform: uppercase too.
Not exactly. In China some users will prefer a more accurate representation of the text but more will prefer the characters to be rendered in the modern orthography. For many digitization projects that means building two different strings, one often using PUA (because ideographic variation selectors are not well supported). The point is, Han variants could both indicate gloss and could indicate semantic intention. The person doing the digitization may not be able to make a good guess - so they have to preserve it in some way and provide a client side toggle - in a way like choosing between serif and sans-serif in a browser's reading mode. |
The idea that a character has a dozen glyphs and still be the same character has been the majority opinion reflected in dictionaries as early as the Tang dynasty. The mainstream linguistic opinion in modern China has also long concluded that multiple glyphs can be grouped into the same character, and standardized orthographies are also the national/regional policies in China. Also, the Japanese extensively unified glyphs in the earlier Japanese Industrial Standards. It is rather obvious that the initial pushback against Han unification was by far more than the linguistic or technical issues. The current pushback is largely based on the inability to accurately reproduce a certain glyph the author intends. |
I will definitely defer to your expertise. I just have a couple more questions as to the use case of the proposal; my apologies, and thank you for your patience.
This is indeed true. In this case indeed, the author of the stylesheet risks modifying the fundamental plain-text semantics intended by the author of the text content—but this risk is generally low, at least for Latin scripts. But the risks of text transformation increase as the significance of characters increase.
What I’m wondering about here is the desires of text authors versus the desires of stylesheet authors for websites that display the text. When a stylesheet author writes This risk is generally low, at least for Latin scripts, which do not change meaning that much depending on case (though there are certain significant exceptions). However, the risks of text transformation increase as the significance of characters increase. A hypothetical text transformation (from styled Latin mathematical letters to plain Latin characters and from superscript numbers to regular numbers) risks rendering “ℋ = ∫ 𝑑𝜏(𝜖𝐸² + 𝜇𝐻²)” as the quite semantically different “H = ∫ dτ(ϵE2 + µH2)”. Another hypothetical text transformation (from colored symbols to regular symbols) risks changing “He had difficulty distinguishing 🔴 and 💚” to the semantically different “He had difficulty distinguishing ⚫️ and 🖤”. And a hypothetical text transformation (from ideograph variation sequences to ideographs all of one variation) would have similar risks. “The current pushback [of Han Unification] is largely based on the inability to accurately reproduce a certain glyph the author intends,” but here “author” presumably refers to the authors of text content, rather than the authors of stylesheets. Ideographic variation sequences already give to text authors the power to precisely control what glyph is rendered on a per-character basis, so I am uncertain why pushback would still be occurring. Yet this proposal would take some of that power away from text authors, by giving stylesheets the ability to override the text author’s intent. What I’m wondering about here is the conflict between the desires of text authors versus the desires of stylesheet authors for websites that display the text. For whose benefit would this proposal be? As far as I could tell, this proposal would weaken the ability of text authors to have their content be rendered using the correct glyphs, giving more that power to the stylesheet author instead.
Basically—why would the person doing the digitization simply not use the right variation selector, instead of relying on rich-text markup? I actually am very interested in this, since all this seems to indeed be a real-world occurrence. Of note is that Unicode Technical Report 51: Unicode Emoji recommends that emoji/text variation selectors always be respected, regardless of the environment’s default rendering style. This is presumably because of the same reason: the very existence of variation selectors as plain text is because they are part of the intrinsic semantics of the plain text. But again, I am almost certainly much less of an expert here compared to you. My apologies for the questions, and thank you for your patience. |
In the use case of the digitization of old Han manuscripts, most of the authors have long since passed away. The entity that is converting the manuscript into digital text is also the stylesheet author. In this case, the text is covered into digital text with all the variation selectors to reproduce the form that was used in the historical documents. Give the example of The Analects 論語. Sinologists which are studying the text may prefer a very close approximation of the exact glyphs used since it may include important information on when that particular version was published or amended. However, for the general Chinese user, they would not care about the exact orthography used and care more about the semantic content. They may prefer all variants to be canonicalized to their modern orthography. In this case, the stylesheet author only provides a mode (say, input.normalize:checked+p) which strips the variants, while the reader is the person who activates the stripping to his/her convenience. The reader is responsible for his/her actions if there is any loss in semantic meaning. (For phrases where the exact style should never be stripped, it could be further marked up with HTML+CSS).
IRG member bodies are supposed to verify that the variants they submit to be encoded via variation selectors are indeed unifyable and not non-cognate. If two variants are recognized as semantically different, they would be disunified by the non-cognate rule. |
@hfhchan: Thanks for the explanation; this use case makes sense to me now. If this proposal handles standardized variation sequences, then it probably needs to address how it applies to emoji/text, mathematical, and Mongolian variation sequences too. |
/cc @drott |
I’m afraid we have two closely related but different issues at hand.
The first issue, which I proposed independently, is in scope of [css-text], the second sounds more like [css-fonts] was the right place indeed. Unicode does not concern itself with font selection much, so while it clearly mandates that the default glyph of the base character should be shown instead of no glyph (other than |
@Crissov Good points. Also, the question in that last paragraph may be worth bringing up to the Unicode mailing lists; I think it could be within Unicode’s scope. |
I find the second part of this statement
somewhat surprising. Displaying a .notdef glyph makes the intended character completely opaque to the reader; there is no clue as to what the author intended. It's hard to imagine a situation where this would be desirable for a reader. ISTM that a more useful fallback, in the case where no available font supports the sequence This would, however, conflict with Unicode's classification of the variation selectors as Default Ignorable codepoints, which implies that when not supported, they should have no visible rendering: "When such a special-use character is not supported by an implementation, it should not be displayed with a visible fallback glyph, but instead simply not be rendered at all." (Unicode, §5.3.) This suggests that fallback to |
Hmm, that behavior makes sense for emoji and mathematical operators, but probably would not align with expectations for CJK users. Luckily, a quick check seems to indicate that default ignorable is not covered by the Unicode stability policy. |
Default Ignorable in the context of UVSes (Unicode Variation Sequences, which can be SVSes, EVSes, or IVSes) simply means that if an implementation cannot resolve a UVS to its intended glyph, which typically means that the selected font does not support the specified UVS, the base character should instead be rendered, and the VS should not be displayed (but not discarded). The extent to which an implementation makes an attempt to resolve a UVS—especially during font fallback—is completely up to the implementation. |
JFTR, from the Unicode FAQ (with my highlighting):
|
I would prefer not to introduce an additional property. I think the best way to address this is to specify at least one font known to have coverage to avoid going into system fallback for variation selectors plus improvements to browsers following the Cluster Matching rules more closely. While Chrome does take the whole grapheme cluster into account for font-fallback, for variation selectors our shaping library considers a grapheme successfully shaped if
Can you clarify whether you were intending to test system fallback or fallback within the font-stack? If |
Fallback within font stack. I suppose system fallback should be tested too, but that would be less crucial. |
In Step 2a of Section 5.3 Cluster matching of CSS Fonts 3:
I've tested Chrome and Firefox, and both don't do any system font fallback when the given font contains a glyph for the
b
but don't forb + c1
.I tested with
齋󠄁齋
(the first has a variation selector U+E0101 appended after it), and with a CSS declaration offont-family: SimSun, HanaMinA;
, of which SimSun does not contain a glyph for the variation selector, but HanaMinA does:Result in Firefox:

Result in Chrome:

The spec should either be amended to reflect the behavior implemented by browsers, or the browser's behavior should be changed. Unicode Variation Selectors involve a GSUB CMAP14 lookup and it would be understandable that reordering complex table lookups in the font-selection phase could be prohibitively expensive.
Due to the nature of the Han script, it is often hard to objectively quantify what is the same character and what is not. Different people have different expectations. CJK Unification in ISO10646 was a very controversial decision and continues on to be controversial today. Reliably rendering Unicode Variations is necessary and may have legal ramifications.
The fallback to
b
behavior is problematic because it may not be what the author intended and the user has completely no idea. More often, the preferred behavior is that a "tofu" (.nodef) is displayed instead.In addition, China and TCA are likely to be using Unicode Variation Selectors to encode historic variants of CJK Unified Ideographs (assuming the decision by the IRG is approved by WG2 in the coming meeting in September). Variant characters with visually-significant differences will be approved for unification with their more common character, provided that the variant is similar in structure, rare in modern use and is attested to be exactly equivalent to the base character in semantics. In these cases, getting .nodef is usually preferred over getting the base character's glyph if a given font doesn't have that specific glyph variant.
At the same time, it may be useful in historical text digitization projects to dynamically switch between showing characters in glyphs as they are in the books, and glyphs that are used in the modern day. This could be accomplished by stripping all the variation selectors out via regex and innerHTML, or more preferrably activated via a CSS property or feature / flag.
To cater for such behavior, I suggest that a new CSS property and/or OpenType feature / flag is introduced.
These behaviors could be implemented via a new CSS property such as
font-variation-sequences
with the following values:auto
(fallback tob
for VS-16 and below ifb + c1
is missing, .nodef for VS-17 and above ifb + c1
is missing)ignore-missing
(fallback tob
ifb + c1
is missing),tofu-missing
(fallback to .nodef ifb + c1
is missing),ignore-all
(ignore all Unicode Variation Selectors).It could also be piggybacked by introducing a new OpenType flag, maybe named as "tofu", so the different behaviors could be activated directly via
font-variation-settings
.The text was updated successfully, but these errors were encountered: