Skip to content

Utf 16 columns #192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 1, 2017
Merged

Utf 16 columns #192

merged 3 commits into from
Sep 1, 2017

Conversation

tromey
Copy link
Contributor

@tromey tromey commented Aug 31, 2017

This series changes columns numbers to be reported in units of UTF-16 characters.


This change is Reviewable

@SimonSapin
Copy link
Member

Looks good! A couple details:


Reviewed 1 of 1 files at r1, 2 of 2 files at r2, 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 3 unresolved discussions.


src/tokenizer.rs, line 371 at r1 (raw file):

        debug_assert!(byte == b'\r' || byte == b'\n' || byte == b'\x0C');
        self.position += 1;
        if byte == b'\r' && self.next_byte() == Some(/* LF */ b'\n') {

Nit: this "LF" comment was to mirror the former is_cr argument, but now that we have \r instead, \n is consistent and the comment can be removed.


src/tokenizer.rs, line 705 at r2 (raw file):

            b'\xF0'...b'\xFF' => { tokenizer.consume_4byte_intro(); }
            _ => {
                tokenizer.advance(1);

Please add a comment here and in 4 other places with similar code below to indicate that the remaining cases are ASCII or leading bytes.


src/tokenizer.rs, line 378 at r3 (raw file):

        // This takes two UTF-16 characters to represent, so we
        // actually have an undercount.
        self.current_line_start_position = self.current_line_start_position.wrapping_sub(1);

This uses wrapping_sub in case the very first byte of the input starts a 4-bytes code point, right? If so, shouldn’t every other operation with current_line_start_position use wrapping arithmetic? (Or maybe replacing += 1 in consume_continuation_byte with wrapping_add is enough, since it will bring the position back over zero with input that is known to be well-formed UTF-8?) Either way, please add a test for that.


Comments from Reviewable

@tromey
Copy link
Contributor Author

tromey commented Aug 31, 2017

This uses wrapping_sub in case the very first byte of the input starts a 4-bytes code point, right? If so, shouldn’t every other operation with current_line_start_position use wrapping arithmetic? (Or maybe replacing += 1 in consume_continuation_byte with wrapping_add is enough, since it will bring the position back over zero with input that is known to be well-formed UTF-8?) Either way, please add a test for that.

Yes, nice catch. I believe the places where it is incremented must use wrapping_add; however, there's no need to use wrapping operations in current_source_location, because that should not be able to observe an "underflow" value (due to the UTF-8 encoding observation you point out).

@SimonSapin
Copy link
Member

@bors-servo r+


Reviewed 1 of 2 files at r5, 2 of 2 files at r6.
Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@bors-servo
Copy link
Contributor

📌 Commit f721368 has been approved by SimonSapin

bors-servo pushed a commit that referenced this pull request Sep 1, 2017
Utf 16 columns

This series changes columns numbers to be reported in units of UTF-16 characters.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/rust-cssparser/192)
<!-- Reviewable:end -->
@bors-servo
Copy link
Contributor

⌛ Testing commit f721368 with merge 7560c3a...

@bors-servo
Copy link
Contributor

☀️ Test successful - status-travis
Approved by: SimonSapin
Pushing 7560c3a to master...

@bors-servo bors-servo merged commit f721368 into servo:master Sep 1, 2017
@tromey tromey deleted the utf-16-columns branch September 1, 2017 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants