-
Notifications
You must be signed in to change notification settings - Fork 136
Function tokenizer::preprocess 3x faster #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b'\0' => result.push_all("\u{FFFD}".as_bytes()), | ||
_ if byte < 128 => result.push(byte), | ||
_ => { | ||
// Multi-byte character |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole block is unnecessary. UTF-8 bytes for a multi-byte code point all in 0x80..0xFF, so copying each byte separately is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it because the byte \x0C
is replaced by \n
in a previous pattern, and I'm not sure if there is any multibyte character that contains \x0c
.
For instance, a character like FA 0C 10
can be replaced with FA 0A 10
.
I'm not a Unicode expert, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-8 is well designed this way: bytes in 0x00 to 0x7F can only happen in single-byte code points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Thanks for the review. The code looks much better now.
Looks great, thanks for the patch! |
Function tokenizer::preprocess 3x faster
I added a bench to test the difference:
With the old function, with 3
replace()
(commit ea6c1b7):With the new function:
I'm a newbie in Rust, so I guess that there are a lot of improvements to be done.