-
Notifications
You must be signed in to change notification settings - Fork 715
[css‑fonts‑4] Create keywords for unicode‑range
#4573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
emoji
keyword to unicode‑range
emoji
as a keyword to unicode‑range
Alternatively, a list of ISO 15924 script codes could be allowed for the PS: #1744 was somewhat similar, requesting a |
Adding script names or language names is a recurring request, and we do need to address it at some point. I agree with @Crissov that scripts make more sense than languages. For example, this is both cumbersome and fragile against future additions:
I agree with @Crissov that using an existing list, provided it is well maintained, well documented and readily available, is much better than getting into the business of script or language registries. It isn't clear to me that ISO 15924:2004 defines which characters are included in each script. I wasn't keen to spend the CHF 68 to find out. Anyone know? Or is that all contained in the registry? Unicode® Standard Annex #24 Unicode Script Property is online, and freely available, and appears to be a superset of ISO 15924. I'm happy that the registry is online and that Unicode is the registration authority. That at least means that ISO and Unicode are striving to be in alignment here (with a few exceptions, like Fractur and Gaelige being distinct in ISO 15924 and unified in Unicode UAX 24. I plan to reach out to the maintainers of the registry to confirm the exact status. |
Hmm. From the registry
That says that Hiragana exists, but not which code points are covered. |
The complete list is in the Scripts file. For example
|
Using the Unicode Script property as shorthand for a ranges is a really, really good idea. More intuitive, less error prone, less verbose, and it's a public list that's already baked into CSS implementations - the Unicode Script property is already referenced by css-text-3. Thumbs up to this whole issue. |
One issue with the Unicode Script property is the characters that have Script=Inherited (generally diacritics) or Script=Common (mostly punctuation)... authors might be surprised at things that don't get included by a naïve Script code because they're actually shared by a couple of scripts and so ended up being assigned Script=Common instead of the "expected" script. As a trivial example: Script=Devanagari would (perhaps unexpectedly) exclude the punctuation marks DEVANAGARI DANDA and DEVANAGARI DOUBLE DANDA, despite their apparently script-specific names, because Scripts.txt has
So perhaps ranges should also take account of whatever appears in the Unicode ScriptExtensions list, which would handle this:
This would be more useful than using just the simple Script property, IMO. |
|
The CSS Working Group just discussed
The full IRC log of that discussion<stantonm> topic: Add ISO 15924 script codes to unicode-range<astearns> github: https://github.com//issues/4573 <stantonm> myles: unicode-range takes bunch of code-points <dbaron> the addition of those two agenda items was https://wiki.csswg.org/planning/galicia-2020?do=diff&rev2%5B0%5D=1569210305&rev2%5B1%5D=1570141384&difftype=sidebyside <stantonm> ... bad for a couple reasons, lots of numbers and not clear what they mean <stantonm> ... also when adding some like emoji, you can list all unicode points - but it changes over time <stantonm> ... proposal to add keyword that lets the browsers define the code points <stantonm> florian: what are the keywords <stantonm> myles: issue says use pull keywords from ISO <stantonm> hober: we shouldn't define these things, reference something in unicode <stantonm> myles: different languages use some common code points <stantonm> ... keywords shouldn't be a partition, there will be overlaps <stantonm> ... space character will be in most of them <stantonm> fantasai: two factors, script extensions list - some of these are assigned to common script <stantonm> ... we should be looking up script extensions <stantonm> ... other case is super common things - numbers, space, etc <stantonm> ... alot of things assigned to common script <stantonm> ... probably makes sense to include common by default, but have opt out <stantonm> myles: we should resolve that we would like keywords, but not resolve on the actual keywords <stantonm> fantasai: we should rely on iso <stantonm> faceless2: rely on existing registry <stantonm> astearns: should we have everything in the registry <stantonm> heycam: do the names in the registry match normal css conventions? <stantonm> TabAtkins: looks like no? <stantonm> fantasai: should be a list of keywords 4 chars long <faceless2> https://www.unicode.org/Public/12.1.0/ucd/Scripts.txt <astearns> `Zsye 993: Emoji` <stantonm> TabAtkins: if we're confident they are 4 letters, we can take directly <stantonm> fantasai: think that should be fine, they need to maintain compat <faceless2> example values : "Hebrew", "Devanagari", "Common" <stantonm> myles: we may get it wrong, can we tentatively resolve to try something out first <stantonm> florian: go with 4 letter name of long name? or not deciding <stantonm> faceless2: where did four letter name come from? <stantonm> florian: long name has hyphens, 4 letter is defined somewhere else <stantonm> TabAtkins: casing shouldn't be important <dbaron> The 4 letter script codes are always letters and come from ISO15924: https://tools.ietf.org/html/rfc5646#section-2.2.3 <stantonm> astearns: leave it to the fonts editors to define what keywords we pull, don't need to resolve on that now <stantonm> myles: I'll also contact unicode <stantonm> jfkthame: should there also be exclusion values? <stantonm> hober: if you could exclude a range, you could exclude common range <stantonm> myles: be careful we don't turn this into a full language <stantonm> chris: even if you do a good job, when unicode adds new values you may unintentionally exclude things <stantonm> ... shift burden of defining onto external body <dbaron> also see https://unicode.org/iso15924/iso15924-codes.html <stantonm> RESOLVED: we are going to create keywords for unicode ranges <dbaron> "Zsye" is for Emoji, I think :-/ <dbaron> I think that's a little unfortunate. |
/cc @markusicu |
Hi, I got cc'ed here... As I think you found, ISO 15924 does not define which characters have which script. Use the Unicode properties sc=Script and scx=Script_Extensions for that. scx=Deva should be implemented as "set of code points whose Script_Extensions contain Deva", see UTS 18 (regex spec). For emoji, there are several properties you could look at: http://www.unicode.org/reports/tr51/#Emoji_Properties Elsewhere in UTS 51 you can also find regexes for well-formed emoji sequences. ICU has API to get the emoji character properties (per code point, or as a UnicodeSet). FYI I work on Unicode/CLDR/ICU and am the current 15924 registrar. |
If this is a about ranges, it may make sense to consider blocks instead (or in addition to) scripts. Blocks don't have an ISO standard, they are directly defined by Unicode. There are some overlaps between script name and block names; some regexp engines use e.g. 'hiragana' for the hiragana script, and 'in_hiragana' for the hiragana block. In many cases, there is more than one block for a script. Block data is available in the Blocks file. There are e.g. 8 blocks with the term 'Latin' in their name. There are also cases where characters are not in a block that carries the name of their script. For example, the three blocks |
Unicode blocks are usually not very useful. They are an artifact of the character assignment process and history and are not designed to fit any other purpose. Multiple blocks for one script is one problem (and growing). Blocks also include unassigned code points, and sometimes unrelated characters. That's why the Script and Script_Extensions properties are generally recommended and used.
FYI Outside of the ISO script code, "Kana" refers to both Hiragana and Katakana. https://en.wikipedia.org/wiki/Kana |
Title changed to reflect the CSS WG resolution and clarify that emoji is just one of the use cases. This would make a great topic for a TPAC joint session with I18n Core WG |
I'm not sure that the
See: /^\p{Emoji}$/v.test('👨❤️👨')
// → false ❌
/^\p{RGI_Emoji}$/v.test('👨❤️👨')
// → true ✅ |
The Emoji property is a property of characters.
You are testing whether a five-character sequence matches a single-character property. |
The CSS Working Group just discussed
The full IRC log of that discussion<TabAtkins> astearns: We had a reoslution, but there was continued discussion. Chris, take it away<TabAtkins> ChrisL: I was getting an idea of what we had consensus on, that was to use unicode proeprties SCript and Script Extension <TabAtkins> ChrisL: so if Script Extension says its "deva", you'd get all the characters with that extension from the keyword `deva` <TabAtkins> ChrisL: and you'd get the Common block by default, but with a way to exclude perhaps <TabAtkins> ChrisL: Also, the resolution covered a way to add some ranges, but people might want to exclude some ranges. <TabAtkins> addison: so use the 1594 script code to include scripts <TabAtkins> addison: with maybe special handling for common? <TabAtkins> ChrisL: common would always be included if not listed, but you could exclude it <TabAtkins> addison: this is in addition to do codepoints? <TabAtkins> ChrisL: yes <fantasai> TabAtkins: basically a shorthand for one or more codepoint ranges <TabAtkins> addison: have you looked at - this isn't regex, but at the regex unicode categories? <TabAtkins> addison: people might want character classes <TabAtkins> addison: also, CLDR's sets of characters by locale or by language <TabAtkins> addison: maybe a source <TabAtkins> addison: just trying to think of why people would be using this <TabAtkins> addison: a common thing iv'e seen is people only wanting to accept certain chars, so only the ones actualyl used by finnish or hungarian. that's a bigger list than just the alphabet. unicode has a list like that. <TabAtkins> addison: otherwise this seems, well, not unreasonble <TabAtkins> addison: don't want to sound reticent, no shade <TabAtkins> addison: jsut want to suggest other places to potentially look <florian> q? <fantasai> https://www.w3.org/TR/css-text-4/#character-properties <TabAtkins> fantasai: we have precedent for including script extensions <astearns> ack fantasai <TabAtkins> fantasai: we generically include it in Appendix E of Text, it's the right thing to do pretty much everywhere we reference the Script proeprty <TabAtkins> fantasai: including common makes sense; ability to exclude common seems interesting but tricky, especially with combining marks <TabAtkins> ChrisL: yeah, coudln't think of a use-case for it <TabAtkins> fantasai: yeah, having a hyphen or something probably doesn't want to use a different font <TabAtkins> fantasai: so my suggestion is not have the common-exclusion ability unless people ask <TabAtkins> astearns: So do you still want to exclude other keywords? <TabAtkins> fantasai: seems reasonable, yes <florian> q+ <TabAtkins> astearns: Big reason to exclude common is if you have a stack, the first font is for Korean, the rest of the stack is for everything else. You'd exclude common from teh Korean font. But you can also do that by flipping the font stack and excluding Korean, instead <astearns> ack florian <TabAtkins> florian: Yes, but also affects which fonts line-sizing and units takes from. If it's predominantly from Korean, you might want to take from that font even if there's fallback <TabAtkins> fantasai: you're more likely to want to exclude punctuation than Common, like combining marks are in Common. You don't want base characters form one font and combining from another. <TabAtkins> TabAtkins: yeah, like addison said about the regex unicode categories, they have Punctuation <TabAtkins> fantasai: not full power, you can match on like east-asian width, doesn't seem useful. just some things. <TabAtkins> addison: Yeah, just looking at it for a few suggestions, not necessarily all. I'm spitballing. <TabAtkins> fantasai: I think we really only need Script and General category. <TabAtkins> astearns: So we'd only need Script ,is that excluding script Extensions? <TabAtkins> fantasai: No, including that. No use of CSS wouldn't want Script Extensions. Our *definition* of the Script proeprty includes that by default. <TabAtkins> astearns: So can we resovle on using SCript and Script Extensions to create keywords? <TabAtkins> xfq: Also General? <TabAtkins> fantasai: We can start from the Script and add a few others as needed <TabAtkins> astearns: Any objections? <TabAtkins> RESOLVED: Add a set of keywords from SCript and Script Extensions <TabAtkins> astearns: now about General <TabAtkins> fantasai: Yeah, I'm not as sure about Common, if they're trying to include letters but don't get combining marks. But excluding General is okay <fantasai> s/Common/General Category/ <TabAtkins> xfq: Yeah, and they can always generate a codepoint list if they need <dbaron> fantasai: (clarifying) including General is bad, excluding General is ok <TabAtkins> fantasai: *Including* General seems footgun-y, but *excluding* General seems reasoanble. <TabAtkins> astearns: that's fine by me <TabAtkins> astearns: anyone want to argue for something more than that now, rather than waiting until it's justified later by requests? <TabAtkins> RESOLVED: Punting on General category for now. <TabAtkins> astearns: switching to the question of whether we do "exclusion" as well as inclusion <TabAtkins> ChrisL: Yes, let's get a resolution <TabAtkins> fantasai: When excluding, dont' want to exclude Common alongside the others (but including it alongside the specified value is okay) <TabAtkins> proposed: script and and script exclusions can be excluded (except for Common) <fantasai> s/and and script exclusions/categories/ <TabAtkins> RESOLVED: script categories can be excluded (except for Common) <TabAtkins> ChrisL: So extending the grammar will break current impls <TabAtkins> fantasai: So you declare it twice? <fantasai> TabAtkins: just like normal <TabAtkins> ChrisL: Ah so last one that's valid <fantasai> scribe+ <fantasai> TabAtkins: existing unicode grammar is the worst <fantasai> TabAtkins: I tried, it cannot be reasonably be described with CSS tokenization rules <fantasai> TabAtkins: options are special tokenization (which breaks selector u+a) <fantasai> TabAtkins: or do cusotm parsing of unicode-range , which is what we're doing now <fantasai> TabAtkins: I suggest keeping to that, and add functional form that expresses with numbers <fantasai> TabAtkins: and build on that <kbabbitt> +1 <fantasai> florian: so unicode(\d\d\d\d) <fantasai> TabAtkins: you can't directly express hex values because might be ident or dimension or number <fantasai> TabAtkins: but you coudl do xHHHH <fantasai> astearns: so you're proposing using the same descriptor that has the current unicode-range syntax <fantasai> astearns: or a new functional value syntax that is cleaner and does what we want <fantasai> astearns: is that better or worse than having an entirely separate descriptor? <fantasai> TabAtkins: no opinion <fantasai> astearns: I think it's probably better to re-use the name; invalidity itneractions are more obvious <fantasai> ChrisL: agreed <fantasai> TabAtkins: actually, i change my mind. I have a very strong opinion which is to agree with you <fantasai> TabAtkins: Right now unicode-range descriptor is special magic syntax <fantasai> TabAtkins: so sure apply them both <fantasai> astearns: what do we call the function? <fantasai> florian: unicode() <fantasai> fantasai: u() <fantasai> ChrisL: what about negation? <fantasai> TabAtkins: a "not" keyword to prefix <fantasai> dbaron: maybe be more explicit about subsetting the font to only characters in that range? <fantasai> TabAtkins: maybe just "codepoints()" <fantasai> florian: I like u() <fantasai> [some mixup] <fantasai> s/mixup/mixup about parsing weirdness/ <fantasai> TabAtkins: Oh, actually I mean I disagree with astearns <fantasai> TabAtkins: we should use a new property <fantasai> s/property/descriptor name/ <fantasai> ?: Then how do they interact? <fantasai> TabAtkins: then let's intersect them. Initial value is 'all' <fantasai> ChrisL: Could also reset unicode-range to all when encountering the new thing <fantasai> dbaron: then you have a weird ordering dependency <TabAtkins> fantasai: it would be weird if you set the new thing, then unicode-range <TabAtkins> fantasai: they're both setting the same thing, it's weird if one invalidates the other <fantasai> fantasai: Maybe unicode-range and unicode-set, and take tab's suggestion to intersect them <TabAtkins> addison: table this, since i18n isn't helpful? you're on the right track. <TabAtkins> addison: A related topic <addison> https://unicode.org/cldr/charts/45/summary/kam.html <TabAtkins> addison: you can see that it has sets of chars in use for a language <TabAtkins> addison: for a locale you can see what's pretty commonly used, if you use that as a range it's similar to what you'd want in a font <addison> https://unicode.org/cldr/charts/45/summary/ks.html <TabAtkins> addison: not *as* exhaustive as some things <TabAtkins> addison: but it kinda describes what your font should support if it's rendering locale=ks, etc <astearns> ack fantasai <TabAtkins> fantasai: I think this is more restrictive than what you usually want. you might include words from another lang, and you've dropped chars you wouldn't otherwise drop <TabAtkins> astearns: Any other comments before line-clamp? |
I've been doing some research on this recently, and from what I can tell, Unicode tr51 doesn't solve the problem very well, so I'll excerpt a quote from the
We need a new proposal for a definitive solution, and proposal L2/22-160(RGI_Emoji_Qualification, by @macchiati ) looks promising. @mathiasbynens has been working on improving regular expressions for Emoji matching, and he presents a similar idea here. |
Back to the question posed by @jfkthame:
In emoji-test.txt However, ❤🔥(
|
I'm not sure I follow what you're saying here:
If we exclude U+2764 from |
It sounds like we might need a little bit of an algorithm spec on top of the RGI_Emoji_Qualification property. Maybe it should say something about matching an RGI_Emoji_Qualification prefix of a longer emoji sequence (validity regex), and consider emoji variation selectors following the prefix even if they don't continue another RGI_Emoji_Qualification sequence? |
I don't think we can put anything that is context-dependent on unicode-range. Unicode range can really just be a simple codepoint filter. |
So we have:
and a Needs Edits label. But subsequent discussion makes me believe this isn't as ready-to-go as we believed in September 2024:
I think this needs to be fleshed out a bit more, in terms of the exact functionality and also how it integrates with the existing |
The CSS Working Group just discussed The full IRC log of that discussion<TabAtkins> ChrisL: I put this on the agenda because i thought i18n people would be here. We ahve a reoslution to do it but it's not enough to evne produce a candidate spec text.<TabAtkins> ChrisL: propose we push it, since they're no longer here <kbabbitt> TabAtkins: ChrisL do you want to arrange something with i18n wg? <kbabbitt> ChrisL: yes |
Uh oh!
There was an error while loading. Please reload this page.
https://drafts.csswg.org/css-fonts/#unicode-range-desc
Inspired by @Crissov’s comment in #2855 (comment):
emoji
would be a new<unicode‑range>
keyword equivalent to enumerating all the Unicode codepoints where emoji reside.The text was updated successfully, but these errors were encountered: