Function tokenizer::preprocess 3x faster #66

ayosec · 2014-12-18T22:19:19Z

I added a bench to test the difference:

fn bench_preprocess(b: &mut test::Bencher) {
    let source = "Lorem\n\t\u{FFFD}ipusm\ndoror\u{FFFD}á\n";
    b.iter(|| {
        let _ = super::preprocess(source);
    });
}

With the old function, with 3 replace() (commit ea6c1b7):

test tokenizer::bench_preprocess::bench_preprocess  ... bench:       440 ns/iter (+/- 24)

With the new function:

test tokenizer::bench_preprocess::bench_preprocess  ... bench:       142 ns/iter (+/- 6)

I'm a newbie in Rust, so I guess that there are a lot of improvements to be done.

SimonSapin · 2014-12-18T22:30:13Z

src/tokenizer.rs

+            b'\0'                   => result.push_all("\u{FFFD}".as_bytes()),
+            _ if byte < 128         => result.push(byte),
+            _                       => {
+                // Multi-byte character


I think this whole block is unnecessary. UTF-8 bytes for a multi-byte code point all in 0x80..0xFF, so copying each byte separately is fine.

I added it because the byte \x0C is replaced by \n in a previous pattern, and I'm not sure if there is any multibyte character that contains \x0c.

For instance, a character like FA 0C 10 can be replaced with FA 0A 10.

I'm not a Unicode expert, though.

UTF-8 is well designed this way: bytes in 0x00 to 0x7F can only happen in single-byte code points.

Done! Thanks for the review. The code looks much better now.

SimonSapin · 2014-12-18T22:52:41Z

Looks great, thanks for the patch!

Function tokenizer::preprocess 3x faster

ayosec added 3 commits December 18, 2014 19:45

Benchmark for preprocess function

ea6c1b7

preprocess function in one pass

9b4ec49

Reduce match branches

5f46180

SimonSapin reviewed Dec 18, 2014
View reviewed changes

ayosec added 2 commits December 18, 2014 22:36

Remove redundant UTF-8 check

0656606

Simplify match block removing code for multi-byte chars

2f08b15

SimonSapin added a commit that referenced this pull request Dec 18, 2014

Merge pull request #66 from ayosec/faster-preprocess

eb75a1b

Function tokenizer::preprocess 3x faster

SimonSapin merged commit eb75a1b into servo:master Dec 18, 2014

SimonSapin mentioned this pull request Jun 25, 2017

Make it fast! #164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Function tokenizer::preprocess 3x faster #66

Function tokenizer::preprocess 3x faster #66

Uh oh!

ayosec commented Dec 18, 2014

Uh oh!

SimonSapin Dec 18, 2014

Uh oh!

ayosec Dec 18, 2014

Uh oh!

SimonSapin Dec 18, 2014

Uh oh!

ayosec Dec 18, 2014

Uh oh!

SimonSapin commented Dec 18, 2014

Uh oh!

Uh oh!

Function tokenizer::preprocess 3x faster #66

Function tokenizer::preprocess 3x faster #66

Uh oh!

Conversation

ayosec commented Dec 18, 2014

Uh oh!

SimonSapin Dec 18, 2014

Choose a reason for hiding this comment

Uh oh!

ayosec Dec 18, 2014

Choose a reason for hiding this comment

Uh oh!

SimonSapin Dec 18, 2014

Choose a reason for hiding this comment

Uh oh!

ayosec Dec 18, 2014

Choose a reason for hiding this comment

Uh oh!

SimonSapin commented Dec 18, 2014

Uh oh!

Uh oh!