Skip to content

Drastic performance increases in the CSS parser #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Drastic performance increases in the CSS parser #68

wants to merge 1 commit into from

Conversation

slusarz
Copy link
Contributor

@slusarz slusarz commented Sep 23, 2013

Recently we were alerted of extremely slow performance of the CSS parser:

http://lists.horde.org/archives/imp/Week-of-Mon-20130916/055480.html

I couldn't reproduce 40 second runtimes, but there was a definite lag. Running a xdebug() profiler produced an almost 1 GB output, with certain internal CSS parsing functions being called 1,000,000 times(!). And slow functions like substr() and strtolower() were being run 400,000 times each.

Making a quick pass through the parser, the low-hanging fruit was the code dealing with comparisons on a single byte. In almost every case, consume(), peek() and comes() is absolute overkill when scanning for a single byte that is known NOT to be a multibyte token.

Just changing these calls to a much simpler currentByte() (just grabbing the string index at the current position) and consumeByte() (just incrementing the current pointer) slashed the runtime by more than 50%. Rough example: the xdebug trace file dropped from about 968 MB to 436 MB.

The parser should really be rewritten into a stream parser/tokenizer to achieve maximum speed, but this at least is a good start for very simple changes.

@sabberworm
Copy link
Collaborator

Thanks for this. It’s a change I’ve thought about making as well. Initially, I’d wanted the parser to work with charsets that don’t have the ASCII characters on a single byte (like UCS-2) but since this would need some additional work either way (and no-one, in the three years that this project has existed, has requested support for this), I’m willing to drop that plan and go with your solution.

I know this parser is no race horse and but I’ve never had any applications that needed to be this performant and I’ve always cached the output of expensive parsing operations.

@slusarz
Copy link
Contributor Author

slusarz commented Sep 24, 2013

Well, if this ever becomes an issue (i.e. UCS-2), the easiest solution will be just to convert the charset of the CSS string to UTF-8 before beginning parsing. That would eliminate any issue with ASCII parsing characters.

But as you point out, as a practical matter is anyone really using anything other than UTF-8 (or maybe iso-8859-1) for a style sheet. So optimizing for practical uses rather than theoretical makes much more sense.

$this->consumeWhiteSpace();
$this->parseList($oResult);
return $oResult;
} else if ($sIdentifier === 'namespace') {
$sPrefix = null;
$mUrl = $this->parsePrimitiveValue();
if (!$this->comes(';')) {
if (!$this->currentByte() == ';') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!$this->currentByte() is evaluated before == ';' so this is NOT the same thing as !$this->comes(';')

@sabberworm
Copy link
Collaborator

I’ve added a few notes but even when I do change these things, the unit tests fail for files that contain utf-8 encoded characters inside string literals or comments (the offending files are unicode.css and functions.css in the tests/files/ directory. So now, somewhere a single byte gets consumed when it should be a multibyte character. It’s probably my fault as I am using the byte-based $this->iCurrentPosition directly for mb_substr and mb_strpos (instead of truncating the string first using byte-level functions and then passing them to the multibyte functions), but I don’t have the time right now to investigate further so I’ll have to put this on hold for now (unless you want to do some digging).

@slusarz
Copy link
Contributor Author

slusarz commented Sep 25, 2013

Closing. Starting over with a lesser scope in a different branch.

@slusarz slusarz closed this Sep 25, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants