Skip to content

Function tokenizer::preprocess 3x faster #66

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 18, 2014
Merged

Function tokenizer::preprocess 3x faster #66

merged 5 commits into from
Dec 18, 2014

Conversation

ayosec
Copy link
Contributor

@ayosec ayosec commented Dec 18, 2014

I added a bench to test the difference:

fn bench_preprocess(b: &mut test::Bencher) {
    let source = "Lorem\n\t\u{FFFD}ipusm\ndoror\u{FFFD}á\n";
    b.iter(|| {
        let _ = super::preprocess(source);
    });
}

With the old function, with 3 replace() (commit ea6c1b7):

test tokenizer::bench_preprocess::bench_preprocess  ... bench:       440 ns/iter (+/- 24)

With the new function:

test tokenizer::bench_preprocess::bench_preprocess  ... bench:       142 ns/iter (+/- 6)

I'm a newbie in Rust, so I guess that there are a lot of improvements to be done.

b'\0' => result.push_all("\u{FFFD}".as_bytes()),
_ if byte < 128 => result.push(byte),
_ => {
// Multi-byte character
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole block is unnecessary. UTF-8 bytes for a multi-byte code point all in 0x80..0xFF, so copying each byte separately is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it because the byte \x0C is replaced by \n in a previous pattern, and I'm not sure if there is any multibyte character that contains \x0c.

For instance, a character like FA 0C 10 can be replaced with FA 0A 10.

I'm not a Unicode expert, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8 is well designed this way: bytes in 0x00 to 0x7F can only happen in single-byte code points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks for the review. The code looks much better now.

@SimonSapin
Copy link
Member

Looks great, thanks for the patch!

SimonSapin added a commit that referenced this pull request Dec 18, 2014
Function tokenizer::preprocess 3x faster
@SimonSapin SimonSapin merged commit eb75a1b into servo:master Dec 18, 2014
@SimonSapin SimonSapin mentioned this pull request Jun 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants