Skip to content

[selectors-4] Clarify :lang() behavior when the language range is not a well-formed BCP 47 code #8720

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jfkthame opened this issue Apr 14, 2023 · 4 comments
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. selectors-4 Current Work

Comments

@jfkthame
Copy link
Contributor

According to https://www.w3.org/TR/selectors-4/#the-lang-pseudo,

An element’s content language matches a language range if, when represented in BCP 47 syntax [BCP47], it matches that language range in an extended filtering operation per [RFC4647] Matching of Language Tags (section 3.3.2).

The text also goes on to mention that

The language range does not need to be a valid language code to perform this comparison.

[my emphasis] which implies, as I understand it, that something like :lang("qq") will match content tagged with lang="qq" even though qq is not a valid language tag (as listed in the IANA registry).

However, the Selectors spec does not specifically address how ill-formed (not merely invalid) tags should be handled.

According to the language tag syntax given in https://www.rfc-editor.org/rfc/rfc5646#section-2.1, a tag like åå (containing non-ASCII characters) would be ill-formed ("the language tags described in this document are sequences of characters from the US-ASCII [ISO646] repertoire"), as would a tag like en--- (the various subtags following the primary language subtag are optional, but the grammar does not allow for them to be empty; if they're not present, the corresponding hyphen delimiters should also be omitted).

So how does :lang() matching work in the presence of ill-formed codes? It seems to me that a literal reading of the spec requires that such codes never match, because its definition of "matches" depends on "when represented in BCP 47 syntax", and such ill-formed codes cannot be represented in BCP 47 at all; they conflict with its basic grammar.

A possible alternative interpretation might be that the handling of ill-formed codes is simply undefined (because the spec only addresses what it means to "match" for codes "represented in BCP 47 syntax".

I'm not aware of any compelling use case for ill-formed language codes. So in the interests of clarity and interoperability I would like to ask the WG to confirm (and explicitly note in the spec) that :lang() matching is based strictly on BCP 47 and RFC4647, and as such, ill-formed codes never match.

(Note that the current implementation in WebKit does allow ill-formed tags to match. Thus if content is tagged with lang="SomeRandomCode-Latn-US", which is ill-formed because the primary language subtag is too long, it is nevertheless matched by :lang(SomeRandomCode), :lang("*-US"), etc. I think this should be considered a bug in the implementation.)

@zcorpan zcorpan added the selectors-4 Current Work label Apr 14, 2023
@zcorpan zcorpan changed the title [css-selectors-4] Clarify :lang() behavior when the language range is not a well-formed BCP 47 code [selectors-4] Clarify :lang() behavior when the language range is not a well-formed BCP 47 code Apr 14, 2023
@zcorpan zcorpan added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Apr 14, 2023
@tabatkins
Copy link
Member

Right, this is undefined in the spec right now. If the :lang() argument can't be interpreted in BCP-47 syntax, it should definitely just fail to match straight away.

@zcorpan
Copy link
Member

zcorpan commented Apr 19, 2023

@tabatkins what if the :lang() argument is well-formed language range but the language tag for an element isn't BCT-47 syntax? e.g. lang="SomeRandomCode-Latn-US"

@jfkthame
Copy link
Contributor Author

lang="SomeRandomCode-Latn-US"

That tag cannot be parsed according to BCP47 syntax, and therefore we have no basis to assume anything about any part of it, such as the meaning of the Latn or US substrings in it, or even that the hyphens delimit "subtags" at all.

Therefore, IMO it should not match any :lang() pseudo, not even :lang("*").

@Crissov
Copy link
Contributor

Crissov commented Apr 19, 2023

Twenty years ago, I briefly argued that implementations should be free to use external knowledge about the data format in order map random language information to BCP-47.

https://lists.w3.org/Archives/Public/www-style/2003Oct/0234.html

mgol added a commit to mgol/jquery that referenced this issue Jun 26, 2023
Firefox 114+ no longer match on backslashes in `:lang()`, even when escaped.
It is an intentional change as `:lang()` parameters are supposed to be valid
BCP 47 strings. Therefore, we won't attempt to patch it.
We'll keep this test here until other browsers match the behavior.

Fixes jquerygh-5271
Ref https://bugzilla.mozilla.org/show_bug.cgi?id=1839747#c1
Ref w3c/csswg-drafts#8720 (comment)
mgol added a commit to jquery/jquery that referenced this issue Jun 27, 2023
Firefox 114+ no longer match on backslashes in `:lang()`, even when escaped.
It is an intentional change as `:lang()` parameters are supposed to be valid
BCP 47 strings. Therefore, we won't attempt to patch it.
We'll keep this test here until other browsers match the behavior.

Fixes gh-5271
Closes gh-5277
Ref https://bugzilla.mozilla.org/show_bug.cgi?id=1839747#c1
Ref w3c/csswg-drafts#8720 (comment)
mgol added a commit to jquery/jquery that referenced this issue Jun 27, 2023
Firefox 114+ no longer match on backslashes in `:lang()`, even when escaped.
It is an intentional change as `:lang()` parameters are supposed to be valid
BCP 47 strings. Therefore, we won't attempt to patch it.
We'll keep this test here until other browsers match the behavior.

Fixes gh-5271
Closes gh-5277
Ref https://bugzilla.mozilla.org/show_bug.cgi?id=1839747#c1
Ref w3c/csswg-drafts#8720 (comment)

(cherry picked from commit 62b9a25)
mgol added a commit to jquery/sizzle that referenced this issue Sep 7, 2023
Firefox 114+ no longer match on backslashes in `:lang()`, even when escaped.
It is an intentional change as `:lang()` parameters are supposed to be valid
BCP 47 strings. Therefore, we won't attempt to patch it.
We'll keep this test here until other browsers match the behavior.

Ref jquery/jquery#5271
Ref jquery/jquery#5277
Ref https://bugzilla.mozilla.org/show_bug.cgi?id=1839747#c1
Ref w3c/csswg-drafts#8720 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. selectors-4 Current Work
Projects
None yet
Development

No branches or pull requests

4 participants