Description
According to https://www.w3.org/TR/selectors-4/#the-lang-pseudo,
An element’s content language matches a language range if, when represented in BCP 47 syntax [BCP47], it matches that language range in an extended filtering operation per [RFC4647] Matching of Language Tags (section 3.3.2).
The text also goes on to mention that
The language range does not need to be a valid language code to perform this comparison.
[my emphasis] which implies, as I understand it, that something like :lang("qq")
will match content tagged with lang="qq"
even though qq
is not a valid language tag (as listed in the IANA registry).
However, the Selectors spec does not specifically address how ill-formed (not merely invalid) tags should be handled.
According to the language tag syntax given in https://www.rfc-editor.org/rfc/rfc5646#section-2.1, a tag like åå
(containing non-ASCII characters) would be ill-formed ("the language tags described in this document are sequences of characters from the US-ASCII [ISO646] repertoire"), as would a tag like en---
(the various subtags following the primary language subtag are optional, but the grammar does not allow for them to be empty; if they're not present, the corresponding hyphen delimiters should also be omitted).
So how does :lang()
matching work in the presence of ill-formed codes? It seems to me that a literal reading of the spec requires that such codes never match, because its definition of "matches" depends on "when represented in BCP 47 syntax", and such ill-formed codes cannot be represented in BCP 47 at all; they conflict with its basic grammar.
A possible alternative interpretation might be that the handling of ill-formed codes is simply undefined (because the spec only addresses what it means to "match" for codes "represented in BCP 47 syntax".
I'm not aware of any compelling use case for ill-formed language codes. So in the interests of clarity and interoperability I would like to ask the WG to confirm (and explicitly note in the spec) that :lang()
matching is based strictly on BCP 47 and RFC4647, and as such, ill-formed codes never match.
(Note that the current implementation in WebKit does allow ill-formed tags to match. Thus if content is tagged with lang="SomeRandomCode-Latn-US"
, which is ill-formed because the primary language subtag is too long, it is nevertheless matched by :lang(SomeRandomCode)
, :lang("*-US")
, etc. I think this should be considered a bug in the implementation.)