-
Notifications
You must be signed in to change notification settings - Fork 757
Description
CSSOM uses WebIDL’s DOMString type for all string parameters and return values. It corresponds to JavaScript strings: arbitrary sequences of 16-bit code units. These are usually interpreted as UTF-16, but they’re not necessarily well-formed in UTF-16: they can contain unpaired surrogate code units. I sometimes call this encoding WTF-16.
(Character encoding decoders never emit surrogates when decoding bytes from the network, even when decoding UTF-16BE or UTF-16LE. So surrogates can’t end up in a string that way, only through JS.)
WebIDL also defines USVString which is a Unicode string. (A sequence of Unicode scalar values, excluding surrogate code points.) When converting to it from a JavaScript string, unpaired surrogate are replaced with the replacement character U+FFFD.
As far as I know all major browser engines currently use WTF-16 internally, so they preserve unpaired surrogates "by default" when strings go through various browser components where no code is actively looking for those.
In Firefox, we’re working on a new style system (Stylo, a.k.a. Quantum CSS) where strings are represented with Rust’s native &str type. &str uses UTF-8 bytes for its in-memory representation of Unicode and guarantees (as part of the type’s contract) that these bytes are well-formed in UTF-8. Unicode designed UTF-8 to specifically exclude surrogate code points, in order to be compatible with (well-formed) UTF-16. As a consequence, well-formed UTF-8 (and &str) can not represent all JavaScript strings without some sort of escape sequence mechanism.
Stylo currently replaces unpaired surrogates with U+FFFD when converting JS strings to UTF-8. This is equivalent to defining WebIDL interfaces with USVString instead of DOMString. This is a deviation from specified and currently-interoperable behavior.
It would be possible to make Stylo preserve surrogates (for example by moving everything to WTF-8). However we’re inclined not to. Preserving surrogates is an historical accident, not a feature. I argue that any occurrence of surrogates in a JS string is likely an error, and coming up with an example where not preserving them in CSSOM makes an observable difference is extremely convoluted. For example:
http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5012
<!DOCTYPE html>
<style></style>
<script>
document.documentElement.classList.add('\uD800');
document.styleSheets[0].insertRule('.\uD800:before { content: "Surrogates can be used in class names." }', 0);
document.styleSheets[0].insertRule('.\uD801:before { content: "Surrogates seem to be mapped to U+FFFD." }', 1);
</script>So I would like to propose changing CSSOM and other CSS specifications that declare WebIDL interfaces to use USVString instead of DOMString. This makes CSS syntax “Unicode-clean”, and enable implementations to use UTF-8 internally.
CSSWG discussed and rejected in 2014 a proposal that was effectively the same. However neither USVString nor Stylo existed at the time. What has changed is that WebIDL now gives us the tool to easily specify this change, and one major implementation is on a path to likely to make this change.