Transcoding HTML

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • David Dorward

    Transcoding HTML

    I'm sure that I read somewhere that an HTML document might be
    transcoded to a different characterset at some stage in its journey,
    so while it might start out as (for example) ISO-8859-15, by the time
    it is actually viewed its been converted to UTF-8. Maybe by whatever
    the author used to upload the document to the server, maybe a a proxy,
    maybe by the user agent (if it saves it to disk), maybe by the httpd
    in some content negotiation.

    Does anybody have any information on systems that do this in practise?

    --
    David Dorward
    David Dorward's mostly neglected blog

  • Alan J. Flavell

    #2
    Re: Transcoding HTML

    On Tue, 28 Oct 2003, David Dorward wrote:
    [color=blue]
    > I'm sure that I read somewhere that an HTML document might be
    > transcoded to a different characterset at some stage in its journey,
    > so while it might start out as (for example) ISO-8859-15, by the time
    > it is actually viewed its been converted to UTF-8.[/color]

    In theory this is true. In practice the use of such transcoding
    features in servers or proxies seems to be confined to particular
    communities where, for whatever reason, several incompatible character
    codings are in use. I heard of Japanese transcoding proxies, but the
    only ones I met directly were Russian ones, see Russian Apache for
    details.

    There's a URL here http://apache.lexa.ru/english/meta-http-eng.html
    (with a rather remarkable figurehead ;-) but I suspect it may be out
    of date. Still, it'll give you the flavour of the thing, I guess.

    Comment

    • Andrew Graham

      #3
      Re: Transcoding HTML

      David Dorward wrote:[color=blue]
      > I'm sure that I read somewhere that an HTML document might be
      > transcoded to a different characterset at some stage in its journey,
      > so while it might start out as (for example) ISO-8859-15, by the time
      > it is actually viewed its been converted to UTF-8. Maybe by whatever
      > the author used to upload the document to the server, maybe a a proxy,
      > maybe by the user agent (if it saves it to disk), maybe by the httpd
      > in some content negotiation.
      >
      > Does anybody have any information on systems that do this in practise?[/color]

      IE6 will often do this when saving a document locally. The FileSave
      dialog box lets the user choose an encoding, and an appropriate element
      like
      <META http-equiv=Content-Type content="text/html; charset=utf-8">
      is added or changed depending on whether the document had the element
      originally.

      Other changes that are made:
      - <!DOCTYPE...> (HTML4.0 trans.) is added if it wasn't there.
      - <META content="MSHTML 6.00.2800.1264" name=GENERATOR> is added
      - All the elements are capitalized.
      - Line breaks are adjusted.
      - Quotes around attribute values are stripped where not required.
      - Numeric character references like &#169; may be rewritten as the
      actual character if supported by the encoding.

      I'm sure more changes are made, but I noticed these in a quick
      examination.

      I'll speculate that IE6 creates the new document from its internal
      representation without reference to the original source.

      Even more oddly, sometimes the document is saved as a verbatim copy of
      the source. Perhaps this only happens when the declared encoding and the
      user's chosen encoding are identical.

      Andrew Graham


      Comment

      • Jim Ley

        #4
        Re: Transcoding HTML

        On Tue, 28 Oct 2003 18:09:47 GMT, "Andrew Graham"
        <andrewgraham.a t.att.net@nospa m.invalid> wrote:

        [color=blue]
        >I'll speculate that IE6 creates the new document from its internal
        >representati on without reference to the original source.[/color]

        Yes it's a representation of the document tree, and bears no relation
        to the original source.
        [color=blue]
        >Even more oddly, sometimes the document is saved as a verbatim copy of
        >the source. Perhaps this only happens when the declared encoding and the
        >user's chosen encoding are identical.[/color]

        It normally depends if you say "save web page complete" or "save web
        page html only" the first is a normalised source, the second the
        actual source.

        Jim.
        --
        comp.lang.javas cript FAQ - http://jibbering.com/faq/

        Comment

        • Alan J. Flavell

          #5
          Re: Transcoding HTML

          On Tue, 28 Oct 2003, Andrew Graham wrote:
          [color=blue]
          > IE6 will often do this when saving a document locally.[/color]

          Good point. Mozilla Composer can also do this when one chooses an
          encoding and then saves the edited document.

          I thought the questioner was more interested in automated transcoding
          in servers and proxies...?

          Comment

          • David Dorward

            #6
            Re: Transcoding HTML

            Alan J. Flavell wrote:
            [color=blue]
            > I thought the questioner was more interested in automated transcoding
            > in servers and proxies...?[/color]

            No no, any system that does it is of interest.

            --
            David Dorward http://dorward.me.uk/

            Comment

            • Alan J. Flavell

              #7
              Re: Transcoding HTML

              On Tue, 28 Oct 2003, David Dorward wrote:
              [color=blue][color=green]
              > > I thought the questioner was more interested in automated transcoding
              > > in servers and proxies...?[/color]
              >
              > No no, any system that does it is of interest.[/color]

              Well, you're in the best position to know what you're interested in
              ;-) so please excuse me for assuming. Can't think of any other
              examples at the moment though.

              Comment

              • Nick Kew

                #8
                Re: Transcoding HTML

                In article <caa3f16.031028 0618.6f4001eb@p osting.google.c om>, one of infinite monkeys
                at the keyboard of dorward@yahoo.c om (David Dorward) wrote:[color=blue]
                > I'm sure that I read somewhere that an HTML document might be
                > transcoded to a different characterset at some stage in its journey,
                > so while it might start out as (for example) ISO-8859-15, by the time
                > it is actually viewed its been converted to UTF-8.[/color]

                Yes, there are certainly reasons why that might happen.

                Most markup parsers work internally with a selected charset, and
                documents at input. They can transcode back on output, but this
                is then an extra overhead. Several of my modules generate all output
                as UTF-8, leaving you the option to filter it through a transcoding
                module if you want something else. XSLT of course has its own rules,
                but will typically be fastest if you use the processor's internal
                charset for output.
                [color=blue]
                > Does anybody have any information on systems that do this in practise?[/color]

                Come and see my talk at ApacheCon!

                --
                Nick Kew

                In urgent need of paying work - see http://www.webthing.com/~nick/cv.html

                Comment

                • Henri Sivonen

                  #9
                  Re: Transcoding HTML

                  In article <3f9eb19a.99017 038@news.cis.df n.de>,
                  jim@jibbering.c om (Jim Ley) wrote:
                  [color=blue]
                  > On Tue, 28 Oct 2003 18:09:47 GMT, "Andrew Graham"
                  > <andrewgraham.a t.att.net@nospa m.invalid> wrote:
                  >
                  >[color=green]
                  > >I'll speculate that IE6 creates the new document from its internal
                  > >representati on without reference to the original source.[/color]
                  >
                  > Yes it's a representation of the document tree, and bears no relation
                  > to the original source.[/color]

                  However, if the document is reparsed, the new tree is not necessarily
                  the same due to whitespace introduced by pretty printing, which may
                  affect scripts. Also, due to the doctype change, the layout mode may be
                  different after reparse.

                  --
                  Henri Sivonen
                  hsivonen@iki.fi

                  Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

                  Comment

                  • Andreas Prilop

                    #10
                    Re: Transcoding HTML

                    "Andrew Graham" <andrewgraham.a t.att.net@nospa m.invalid> wrote:
                    [color=blue]
                    > IE6 will often do this when saving a document locally.[/color]

                    Don't do this then. Rather choose "View source" and save in your text
                    editor.

                    Comment

                    Working...