Skip to content

[selectors-4] Clarify :lang() behavior when the language range is not a well-formed BCP 47 code #8720

Open
@jfkthame

Description

@jfkthame

According to https://www.w3.org/TR/selectors-4/#the-lang-pseudo,

An element’s content language matches a language range if, when represented in BCP 47 syntax [BCP47], it matches that language range in an extended filtering operation per [RFC4647] Matching of Language Tags (section 3.3.2).

The text also goes on to mention that

The language range does not need to be a valid language code to perform this comparison.

[my emphasis] which implies, as I understand it, that something like :lang("qq") will match content tagged with lang="qq" even though qq is not a valid language tag (as listed in the IANA registry).

However, the Selectors spec does not specifically address how ill-formed (not merely invalid) tags should be handled.

According to the language tag syntax given in https://www.rfc-editor.org/rfc/rfc5646#section-2.1, a tag like åå (containing non-ASCII characters) would be ill-formed ("the language tags described in this document are sequences of characters from the US-ASCII [ISO646] repertoire"), as would a tag like en--- (the various subtags following the primary language subtag are optional, but the grammar does not allow for them to be empty; if they're not present, the corresponding hyphen delimiters should also be omitted).

So how does :lang() matching work in the presence of ill-formed codes? It seems to me that a literal reading of the spec requires that such codes never match, because its definition of "matches" depends on "when represented in BCP 47 syntax", and such ill-formed codes cannot be represented in BCP 47 at all; they conflict with its basic grammar.

A possible alternative interpretation might be that the handling of ill-formed codes is simply undefined (because the spec only addresses what it means to "match" for codes "represented in BCP 47 syntax".

I'm not aware of any compelling use case for ill-formed language codes. So in the interests of clarity and interoperability I would like to ask the WG to confirm (and explicitly note in the spec) that :lang() matching is based strictly on BCP 47 and RFC4647, and as such, ill-formed codes never match.

(Note that the current implementation in WebKit does allow ill-formed tags to match. Thus if content is tagged with lang="SomeRandomCode-Latn-US", which is ill-formed because the primary language subtag is too long, it is nevertheless matched by :lang(SomeRandomCode), :lang("*-US"), etc. I think this should be considered a bug in the implementation.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    i18n-trackerGroup bringing to attention of Internationalization, or tracked by i18n but not needing response.selectors-4Current Work

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions