-
Notifications
You must be signed in to change notification settings - Fork 715
[css-text-3] Discarding Line Breaks Adjacent to Ambiguous Characters #5017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
My recent thoughts in adapting our J-specific layout to be more universal, I am finding it easier to support several aspects of conflicting conventions by thinking of them as language-specific and introducing the idea of a mode, often at the paragraph level. Embox-based line heights and leading versus ascent-descent; space character widths between Korean words versus in English; how to treat justification between Latin and CJK for JISx4051 versus for English; and these ambiguous Unicode characters, where the font (default glyph) used or the paragraph convention preferred, causes a conflict in desired behavior. There are two dimensions of this conflict -- I am not convinced that we only need to look at the code points around these characters to derive desired behavior. I suspect that in one paragraph mode there is one default for these characters in context of their surroundings, but that in another mode the answer is different, even with the same surrounding characters. |
Most fonts I know have Latin double quotes by default, so I think the space is desired. It becomes fullwidth only when author sets |
After discussion with @kojiishi, particularly about cases where the quotation mark is used adjacent to spaces intentionally, I agree that we should not make double quotes Ambiguous here. Question still stands about whether we should do this with Symbols generally, such as Emoji and and characters in the Enclosed Ideographics block. |
Which aspect of ambiguity are we talking about here? Whichever applies, i suspect that my comments at #4992 (comment) about using common-sense or internationalised apps for line breaking may apply. But Chinese & Japanese use a quite a lot of characters that are not in the discard blocks, not just quote marks. I've actually been trying to make lists of what characters are used by what language, and fwiw currently i have the following from non-discard blocks: Chinese Japanese |
fwiw, just added the following to my lists above: ‼⁇⁈⁉ (sorry about the emojification) |
@fantasai I know it is a CSS novice question but may I? Even with English, there are cases where you would not want a space inserted between two characters. Examples include between quotation marks or parentheses and its contents, around EM DASH depending on the style, before a colon, etc. Does the rule has an expectation that authors would not fold a line at certain places, i.e. where they do not want a space inserted? |
a more interesting case in English is a semi compound word connected by a hyphen, e.g. "well-known". As they look two separate word one might insert a line break after the hyphen, exactly like hyphenated words. If the Segment Break Translation Rule inserts a space after the hyphen it would be different from the naive user's expectation. What I am trying to understand is what we can naturally expect from the segment break transformation rules. |
Yes, this is indeed possible. Here's a test (you may need to adjust the width of the test box in your browser, because the result may be different depending on your font). We may be able to add some informative notes/examples (for line break on hyphens, line break in Chinese/Japanese and Western mixed text, etc.) to the css-text spec, and point it to authors and authors of authoring tools. |
FWIW, I just tried the hard line wrap functionality in some editors (both in HTML mode and in plain text mode) and here is the result. The text is:
In Vim with textwidth=80, gq produces the following result:
That is, Vim won't break on hyphens. (Relevant code) In Sublime Text 2 and 3 with Sublime Wrap Plus, there is a
That is, no break on hyphens. In Visual Studio Code with Rewrap (with
In Emacs with
|
Editors oriented at programmers may possibly be more protective of hyphens because of their use in certain types of identifiers. That would be different from regular text. |
I think most editors just treat hyphen as a part of the word, and won't explicitly protect it from line-breaking (i.e., won't treat hyphens specially, unless the user changes the config and/or provides a customized function for hard wrapping). But regardless of what the editors do, my point (as in #5017 (comment) , and I believe it's also @kidayasuo's point) is that we should document where we expect the authors (and authoring tools) to break lines in the source code, maybe as informative notes with examples, in addition to the existing examples in https://drafts.csswg.org/css-text-3/#line-break-transform |
This issue is about trying to address a) without introducing problems with b). @kidayasuo: CSS3 is trying to make this a bit more sophisticated, so that Chinese and Japanese can benefit from being able to break within paragraphs in their source code also. Since these languages don't use spaces at all, under the historical rules they cannot break at all, otherwise it introduces a space. Because historically breaks became spaces, the new rules are biased to be "conservative" in that they will apply the historical behavior of turning the break into a space when it's not clear whether to discard or not. The currently proposed rule (which is similar to what Firefox currently implements) thus says, if both sides of the break are CJ, discard. If either side is not, use a space. This issue is about fine-tuning those new rules, specifically about whether to introduce a concept of an "ambiguous" character, which defers to the opposite side whether CJ or not (and which defaults to space if the other side is also ambiguous). The primary use case for this would be Symbols. |
Agenda+ to propose deferring this issue to L4. |
When I argued for using Unicode blocks in #337, it was exactly for the reason that we wouldn't have to do this sort of codepoint-by-codepoint analysis in the spec and in UAs. I think this entire feature is not worth the complexity it seems to require. |
+1 to @litherum but let me add another reason. Adding more heuristic rules have both pros and cons for authors. I think this case is net-negative for authors. As @r12a pointed out in his comment in #4992:
In order for authors to perform the adjustments, the predictability is critical. Authors must run the rules in their brain and predict whether a space is inserted or not, whenever they see new lines in their HTML files or when reviewing someone else's HTML files. Adding more heuristic rules has a negative impact to this process. If the heuristic rules are accurate enough so that authors don't have to worry about the adjustments, adding rules is good for them, but by now I think we agree that it is not technically possible. If we expect authors to make the adjustments, keeping the rules simple is critical for them. |
The CSS Working Group just discussed
The full IRC log of that discussion<dael> Topic: [css-text-3] Discarding Line Breaks Adjacent to Ambiguous Characters<dael> github: https://github.com//issues/5017 <dael> fantasai: Prop was defer to L4 <dael> Rossen_: That's an easy proposal <dael> Rossen_: Any reason not to defer? <dael> Rossen_: Not hearing any reasons <dael> Rossen_: Objections to defer the behavior of Discarding Line Breaks Adjacent to Ambiguous Characters to L4 <dael> RESOLVED: defer the behavior of Discarding Line Breaks Adjacent to Ambiguous Characters to L4 of css-text <dael> Rossen_: Issue has a lot more to discuss so I encourage reengagement |
The discussion in #337 has veered off in a wide variety of directions, but @hax originally filed the issue to bring up the question of "ambiguous" characters, i.e. those which are commonly used both within and outside Chinese and Japanese context:
We decided to switch to a Unicode Block listing instead of relying on the East Asian Width property (in particular due to some backwards-incompatible changes on Unicode's side). The current draft does not have a concept of ambiguous characters: all characters are strong "discard" or "don't discard", with discarding behavior requiring both sides of the line break to be "discard".
We might want to consider classifying some characters as "ambiguous", particularly symbols and maybe also the few common punctuation marks used in Chinese (double quotes, specifically). These could defer to the character on the other side, and if both are ambiguous, default to "don't discard".
Do we want to do this? If so, should it be language-dependent or universal?
The text was updated successfully, but these errors were encountered: