Lang attribute values

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Neal

    Lang attribute values


    Been searching around, and found
    http://www.w3.org/WAI/ER/IG/ert/iso639.htm which is great, as I've been
    looking for a guide to what codes are acceptable.

    I see stuff like lang="en-us" - that extension, where is that from? Is
    there a codification somewhere?
  • Harlan Messinger

    #2
    Re: Lang attribute values

    Neal <neal413@spamrc n.com> wrote:
    [color=blue]
    >
    >Been searching around, and found
    >http://www.w3.org/WAI/ER/IG/ert/iso639.htm which is great, as I've been
    >looking for a guide to what codes are acceptable.
    >
    >I see stuff like lang="en-us" - that extension, where is that from? Is
    >there a codification somewhere?[/color]

    RFC 1766: http://www.ietf.org/rfc/rfc1766.txt

    --
    Harlan Messinger
    Remove the first dot from my e-mail address.
    Veuillez ôter le premier point de mon adresse de courriel.

    Comment

    • Jukka K. Korpela

      #3
      Re: Lang attribute values

      Harlan Messinger <hmessinger.rem ovethis@comcast .net> wrote:
      [color=blue][color=green]
      >>I see stuff like lang="en-us" - that extension, where is that from?
      >>Is there a codification somewhere?[/color]
      >
      > RFC 1766: http://www.ietf.org/rfc/rfc1766.txt[/color]

      RFC 1766 has been superseded by RFC 3066 and RFC 3282.

      For more info on language codes see
      Information on the use of different human languages on Web sites.



      I'm afraid there's only one detailed survey of language codes in HTML,
      and it's in Finnish ( http://www.cs.tut.fi/~jkorpela/kielimerkkaus/ )
      and I have no plans for translating it. But don't worry. For the most
      of it, language markup is mostly an exercise in writing theoretically
      correct markup, and even the W3C doesn't take that job seriously on
      their own pages (including the pages on language markup).

      In particular, it's best to use lang="en". The country specifier hardly
      helps anyone in the present world.

      --
      Yucca, http://www.cs.tut.fi/~jkorpela/
      Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

      Comment

      • Neal

        #4
        Re: Lang attribute values

        On Wed, 21 Jan 2004 07:13:34 -0500, Harlan Messinger
        <hmessinger.rem ovethis@comcast .net> wrote:
        [color=blue]
        > Neal <neal413@spamrc n.com> wrote:
        >[color=green]
        >>
        >> Been searching around, and found
        >> http://www.w3.org/WAI/ER/IG/ert/iso639.htm which is great, as I've been
        >> looking for a guide to what codes are acceptable.
        >>
        >> I see stuff like lang="en-us" - that extension, where is that from? Is
        >> there a codification somewhere?[/color]
        >
        > RFC 1766: http://www.ietf.org/rfc/rfc1766.txt
        >[/color]

        Thanks. The subtag - what I'm getting from all this is that we basically
        make up something. I'm not confident that's accurate. Are there set limits
        on what the subtag can actually be, aside from the broad types listed in
        the document you linked? I'm imagining there must be a list of those
        floating around... but I'm not finding them.

        Comment

        • Jukka K. Korpela

          #5
          Re: Lang attribute values

          Neal <neal413@spamrc n.com> wrote:
          [color=blue]
          > The subtag - what I'm getting from all this is that we basically
          > make up something.[/color]

          Well, all the language codes have been made up by some people. The
          correct way to define a new subcode for a language code is to register
          it at IANA. But two-letter subcodes are reserved for use as country
          codes.
          [color=blue]
          > I'm not confident that's accurate.[/color]

          If you register a subcode, it's mostly up to you how accurate your
          definition is.
          [color=blue]
          > Are there set
          > limits on what the subtag can actually be, aside from the broad
          > types listed in the document you linked?[/color]

          See RFC 3066.

          On the other hand, why would you use a subcode? Given the fact that
          most software that _could_ make use of language markup (such as
          browsers, search engines, and page editing tools) make almost no use of
          it, and make _wrong_ use at times, even for the most basic and common
          language codes like "en" or "de", is there any reason to play with
          anything that isn't even registered yet? (I don't expect registration
          do much good per se.)

          --
          Yucca, http://www.cs.tut.fi/~jkorpela/
          Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

          Comment

          • Tim

            #6
            Re: Lang attribute values

            Neal <neal413@spamrc n.com> wrote:
            [color=blue][color=green][color=darkred]
            >>> I see stuff like lang="en-us" - that extension, where is that from? Is
            >>> there a codification somewhere?[/color][/color][/color]


            Harlan Messinger <hmessinger.rem ovethis@comcast .net> wrote:
            [color=blue][color=green]
            >> RFC 1766: http://www.ietf.org/rfc/rfc1766.txt[/color][/color]


            Neal <neal413@spamrc n.com> wrote:
            [color=blue]
            > Thanks. The subtag - what I'm getting from all this is that we basically
            > make up something. I'm not confident that's accurate. Are there set limits
            > on what the subtag can actually be, aside from the broad types listed in
            > the document you linked? I'm imagining there must be a list of those
            > floating around... but I'm not finding them.[/color]

            Well, unless you're inventing something new, they're a country code
            (e.g. en-us for U.S.A. English, en-au for Australian English, etc.).

            --
            My "from" address is totally fake. The reply-to address is real, but
            may be only temporary. Reply to usenet postings in the same place as
            you read the message you're replying to.

            This message was sent without a virus, please delete some files yourself.

            Comment

            • Alan J. Flavell

              #7
              Re: Lang attribute values

              On Wed, 21 Jan 2004, Neal wrote:
              [color=blue]
              > Thanks. The subtag - what I'm getting from all this is that we basically
              > make up something.[/color]

              Oh no we don't!!!
              [color=blue]
              > I'm not confident that's accurate. Are there set limits
              > on what the subtag can actually be,[/color]

              They're country codes per the appropriate ISO specification.

              But as usual, the major vendor's dirty tricks department have made
              sure that the specified interworking protocol will fail. For example,
              someone who has installed their operating system component in Austria
              will be presenting "Accept-language: de-AT", as I've seen in our
              server logs (no, we don't have any Austrian German pages on our
              server, sorry), which is supposed to mean that they accept only
              Austrian German. So, even generic German documents appear to be
              unacceptable to them, unless they know enough about it to override the
              installation defaults.

              I'm sure that's a part of why Jukka advised you that the mechanism
              isn't practical for use (no, I can't read Finnish and I don't trust
              the babelfish, so I can only guess what's on his page). He's entitled
              to his view, but with a bit of pragmatism (all multilingual web pages
              should offer _some_[1] way to access alternative languages explicitly)
              I'd say it's usable, with a bit of care.

              As I've recently discovered: at least it isn't as hopelessly broken as
              that same operating system component's implementation of content-type
              negotiation. I'd say: aim at the users of any protocol-conforming WWW
              browser, while making appropriate provision to pander tolerably to the
              operating system component. In that sense, accept-language
              negotiation is workable, given a bit of care and attention.

              Would this page be of any use? the apache supporters were kind enough
              to cite it, so I suppose it's not too bad, at least as a starting
              point: http://ppewww.ph.gla.ac.uk/~flavell/www/lang-neg.html

              have fun

              [1] No "flags of nations" as markers of language, please. Only
              recently I landed on a French web page that insisted on me clicking
              the Stars and Stripes to get English. Wibble.

              Comment

              • Andreas Prilop

                #8
                Re: Lang attribute values

                On Wed, 21 Jan 2004, Jukka K. Korpela wrote:
                [color=blue]
                > Given the fact that
                > most software that _could_ make use of language markup (such as
                > browsers, search engines, and page editing tools) make almost no use of
                > it, and make _wrong_ use at times, even for the most basic and common
                > language codes like "en" or "de", [ ... ][/color]

                Mozilla/Netscape uses the value of the LANG attribute to determine
                the typeface in which the corresponding text is displayed.
                <http://ppewww.ph.gla.a c.uk/~flavell/charset/browsers-fonts.html>

                Comment

                • Jukka K. Korpela

                  #9
                  Re: Lang attribute values

                  Andreas Prilop <nhtcapri@rrz n-user.uni-hannover.de> wrote:
                  [color=blue]
                  > Mozilla/Netscape uses the value of the LANG attribute to determine
                  > the typeface in which the corresponding text is displayed.[/color]

                  That's an example of what I meant by _wrong_ use.

                  If I write about <span lang="ru">Dosto yevsky</span>, I don't want the
                  name appear in a fancy font just because a browser makes foolish
                  guesses. That's why I recommend that lang markup be not used for
                  transliterated texts. (This violates WAI requirements, since the
                  language of the text is surely not changed in transliteration . But WAI
                  pages themselve violate the rule of marking up _all_ language changes,
                  which they present as Priority 1 requirement.)

                  --
                  Yucca, http://www.cs.tut.fi/~jkorpela/
                  Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

                  Comment

                  • Jukka K. Korpela

                    #10
                    Re: Lang attribute values

                    "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:

                    [ regarding poor defaults in browsers, making a particular dialect
                    the only alternative declared in Accept-Language: ][color=blue]
                    > I'm sure that's a part of why Jukka advised you that the mechanism
                    > isn't practical for use (no, I can't read Finnish and I don't trust
                    > the babelfish, so I can only guess what's on his page).[/color]

                    No, my Finnish-only page on language markup doesn't really discuss
                    content negotiation - which is discussed at

                    which is available in English too, via content negotiation.
                    [color=blue]
                    > He's
                    > entitled to his view, but with a bit of pragmatism (all
                    > multilingual web pages should offer _some_[1] way to access
                    > alternative languages explicitly) I'd say it's usable, with a bit
                    > of care.[/color]

                    That's my view too, actually. But content negotiation, based on
                    language preferences, is independent of language markup. Content
                    negotiation works for all media types, not just HTML, and if used for
                    HTML, it does not make any use of lang markup.

                    --
                    Yucca, http://www.cs.tut.fi/~jkorpela/
                    Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

                    Comment

                    • Henri Sivonen

                      #11
                      Re: Lang attribute values

                      In article <Xns9478C780DD0 53jkorpelacstut fi@193.229.0.31 >,
                      "Jukka K. Korpela" <jkorpela@cs.tu t.fi> wrote:
                      [color=blue]
                      > Andreas Prilop <nhtcapri@rrz n-user.uni-hannover.de> wrote:
                      >[color=green]
                      > > Mozilla/Netscape uses the value of the LANG attribute to determine
                      > > the typeface in which the corresponding text is displayed.[/color]
                      >
                      > That's an example of what I meant by _wrong_ use.[/color]

                      Tim Bray mentions "Things you can't do properly in a language-oblivious
                      way include: Render it on a screen or on paper [...]" as one reason for
                      including xml:lang in XML.
                      (http://www.xml.com/axml/notes/WhyLangs.html)
                      [color=blue]
                      > If I write about <span lang="ru">Dosto yevsky</span>, I don't want the
                      > name appear in a fancy font just because a browser makes foolish
                      > guesses.[/color]

                      In the absence of *script* identification, is Mozilla's behavior really
                      that foolish? How do you suggest the font heuristics should work with
                      UTF-8 (that is, when the dominant script can't be guessed from the
                      encoding)? O(N) character counting over the entire document is not a
                      good solution as it would interfere with incremental display.

                      --
                      Henri Sivonen
                      hsivonen@iki.fi

                      Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

                      Comment

                      • Alan J. Flavell

                        #12
                        Re: Lang attribute values

                        On Thu, 22 Jan 2004, Jukka K. Korpela wrote:
                        [color=blue][color=green]
                        > > He's entitled to his view, but with a bit of pragmatism (all
                        > > multilingual web pages should offer _some_[1] way to access
                        > > alternative languages explicitly) I'd say it's usable, with a bit
                        > > of care.[/color]
                        >
                        > That's my view too, actually. But content negotiation, based on
                        > language preferences, is independent of language markup. Content
                        > negotiation works for all media types, not just HTML, and if used for
                        > HTML, it does not make any use of lang markup.[/color]

                        Fully agreed; and I can see now that I was getting the two issues
                        somewhat tangled. Apologies for any confusion caused.

                        Comment

                        • Andreas Prilop

                          #13
                          Re: Lang attribute values

                          "Jukka K. Korpela" <jkorpela@cs.tu t.fi> wrote:
                          [color=blue][color=green]
                          >> Mozilla/Netscape uses the value of the LANG attribute to determine
                          >> the typeface in which the corresponding text is displayed.[/color]
                          >
                          > That's an example of what I meant by _wrong_ use.[/color]

                          It may be unintuitive but I don't consider it wrong. If you have a
                          document with "charset=UT F-8", both Mozilla/Netscape and Internet
                          Explorer would display it in the typeface you chose for West European
                          Latin. (Silly idea BTW.) However, if you have some text with "LANG=ar"
                          this will be displayed in your preferred Arabic typeface in Mozilla,
                          which will probably give better results. If you have a document with
                          "charset=IS O-8859-6", text marked with "LANG=en" will nevertheless
                          be displayed in your preferred Latin typeface.

                          The difference is most notable on Mac OS 9, where Arabic and Hebrew
                          typefaces do *not* contain glyphs for ASCII characters. These are
                          taken from other [West European] typefaces.

                          You might inspect these two identical (!) documents
                          <http://www.unics.uni-hannover.de/nhtcapri/urdu-alphabet.html>
                          <http://www.unics.uni-hannover.de/nhtcapri/urdu-alphabet.html6>
                          I haven't used LANG markup for characters of the Arabic script
                          in order to see the difference between "charset=UT F-8" (*.html)
                          and "charset=IS O-8859-6" (*.html6). Text marked with "LANG=en" is
                          always displayed the same in Mozilla/Netscape.
                          [color=blue]
                          > If I write about <span lang="ru">Dosto yevsky</span>, I don't want the
                          > name appear in a fancy font just because a browser makes foolish
                          > guesses.[/color]

                          On the other hand, you might welcome that "Äîñòîåâñêè é" written in
                          Cyrillic letters is displayed in your preferred Cyrillic typeface.
                          Anyway, it doesn't make "foolish guesses" but uses *your* preferred
                          typeface for Cyrillic and ASCII Latin.
                          [color=blue]
                          > That's why I recommend that lang markup be not used for
                          > transliterated texts.[/color]

                          Good idea! I second that.

                          Comment

                          • Alan J. Flavell

                            #14
                            Re: Lang attribute values

                            On Thu, 22 Jan 2004, Henri Sivonen wrote:
                            [color=blue][color=green]
                            > > If I write about <span lang="ru">Dosto yevsky</span>, I don't want the
                            > > name appear in a fancy font just because a browser makes foolish
                            > > guesses.[/color]
                            >
                            > In the absence of *script* identification,[/color]

                            Well, the writing system is determined by which Unicode characters are
                            used (unless you're interested in disambiguating the Han unification
                            for CJK languages, about which I know rather little...). What did you
                            mean by "in the absence..."? That string "Dostoyevsk y" consists
                            unambiguously of Latin characters! There's no ambiguity about the
                            "script".

                            If those characters were Arabic, then it would be useful to choose,
                            say, a Persian font if it were known that the language is Farsi.

                            I don't know whether there's a similar desire, if the characters were
                            Cyrillic, of choosing a Russian font as opposed to any other "Cyrillic
                            language" font. But as they aren't Cyrillic characters, that
                            consideration doesn't matter anyway.
                            [color=blue]
                            > is Mozilla's behavior really that foolish?[/color]

                            Yes, in this detail I would have to say it is. Those characters are
                            clearly Latin characters, per the HTML character model; it makes no
                            particular sense to display the Latin letters with a Russian flavour -
                            unless you thought that was cosmetically appropriate to do so, but
                            then you'd suggest font(s) via CSS if that's what you wanted, surely?
                            [color=blue]
                            > How do you suggest the font heuristics should work with UTF-8[/color]

                            What's wrong with displaying Latin characters using the selected Latin
                            font? And so on.

                            OK, if the browser had been configured in a perverse way, maybe the
                            Latin and the Cyrillic fonts would look so massively different that
                            mixed texts would look silly. But that's a configuration option IMHO.

                            Remember that in principle in HTML, language and writing system are
                            meant to be separate attributes. Japanese is still Japanese
                            (language) when transliterated into Roman characters; conversely
                            English is still English (language) when transliterated into Japanese
                            (characters). AFAICT the only exception to this comes indirectly via
                            Unicode and its Han unification (but I'll stop there).

                            Comment

                            • Andreas Prilop

                              #15
                              Re: Lang attribute values

                              "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
                              [color=blue][color=green]
                              >> is Mozilla's behavior really that foolish?[/color]
                              >
                              > Yes, in this detail I would have to say it is.[/color]

                              I do not regard Mozilla's behaviour as foolish. And I think it's
                              a lot better than IE's behaviour.
                              [color=blue]
                              > Those characters are
                              > clearly Latin characters, per the HTML character model; it makes no
                              > particular sense to display the Latin letters with a Russian flavour -[/color]

                              What do you mean by "Russian flavour"? Is, e.g., Verdana "Russian
                              flavoured"? Even if some typeface has specific Russian-looking
                              Cyrillic characters, the ASCII characters can still look quite
                              ordinary.

                              <span lang="el">Andre as</span> oops :-)

                              Comment

                              Working...