diff --git a/.gitignore b/.gitignore
index db7c4cf..6d8d72b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -67,3 +67,4 @@ gumbocy.html
venv/
*.rst
gumbo-parser
+/tests/_benchmark_fixture.html
diff --git a/.travis.yml b/.travis.yml
index 4fbcdbc..dc63436 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -11,7 +11,8 @@ before_install:
- docker ps
- docker info
- docker version
-# - docker pull commonsearch/gumbocy
+ - ./scripts/git-set-file-times
+ - docker pull commonsearch/gumbocy
- make docker_build
script:
diff --git a/Dockerfile b/Dockerfile
index a6b57f4..b1b1bc5 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -44,3 +44,11 @@ RUN curl -L 'https://bitbucket.org/squeaky/portable-pypy/downloads/pypy-5.3.1-li
RUN /opt/pypy/bin/pypy -m ensurepip
RUN /opt/pypy/bin/pip install -r /requirements.txt
RUN /opt/pypy/bin/pip install -r /requirements-benchmark.txt
+
+# Install RE2
+RUN mkdir -p /tmp/re2 && \
+ curl -L 'https://github.com/google/re2/archive/636bc71728b7488c43f9441ecfc80bdb1905b3f0.tar.gz' -o /tmp/re2/re2.tar.gz && \
+ cd /tmp/re2 && tar zxvf re2.tar.gz --strip-components=1 && \
+ make && make install && \
+ rm -rf /tmp/re2 && \
+ ldconfig
diff --git a/README.md b/README.md
index 603558b..f80a1c9 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,7 @@
# gumbocy
+[](https://travis-ci.org/commonsearch/gumbocy) [](LICENSE)
+
**gumbocy** is an alternative Python binding for the excellent [Gumbo](https://github.com/google/gumbo-parser) HTML5 parser, originally written for [Common Search](http://about.commonsearch.org).
It differs from the [official Python binding](https://github.com/google/gumbo-parser/tree/master/python/gumbo) in a few ways:
@@ -7,7 +9,8 @@ It differs from the [official Python binding](https://github.com/google/gumbo-pa
- It is optimized for performance by using [Cython](http://cython.org/).
- It has a smaller feature set and doesn't aim to be a general-purpose binding.
- Its `listnodes()` API just returns nodes as a flat list of tuples.
- - It is generally restrictive: attributes have to be whitelisted.
+ - Its `analyze()` API traverses the HTML tree and returns high-level data like groups of words and lists of hyperlinks.
+ - It is generally restrictive. For instance, attributes have to be whitelisted.
## Installation
@@ -58,31 +61,41 @@ make test
```
import gumbocy
-parser = gumbocy.HTMLParser("""
Helloworld!""")
-parser.parse()
-print parser.listnodes(options={})
+parser = gumbocy.HTMLParser(options={})
+parser.parse("""Helloworld!""")
+print parser.listnodes()
=> [(0, "html"), (1, "head"), (2, "title"), (3, None, "Hello"), (1, "body"), (2, None, "world!")]
+
+print parser.analyze()
+
+=> {'word_groups': [('world!', 'body')], 'external_hyperlinks': [], 'internal_hyperlinks': [], 'title': 'Hello'}
+
```
-For more usage examples, see the [tests](https://github.com/commonsearch/gumbocy/blob/master/tests/test_basic.py).
+For more usage examples, see the [tests](https://github.com/commonsearch/gumbocy/blob/master/tests/).
## Options reference
- - **attributes_whitelist**: a set of attributes which, if present, will be returned in a dict as the 3rd element of a node tuple. Note that "class" is returned as a frozenset. Defaults to `set()`.
+ - **attributes_whitelist**: a set of attributes which, if present, will be returned in a dict as the 3rd element of a node tuple by `listnodes()`. Note that "class" is returned as a frozenset. Defaults to `set()`.
- **nesting_limit**: an integer to specify the maximum nesting level that will be returned. Defaults to `999`.
- **head_only**: a boolean that will make gumbocy return only the elements in the of the document. Useful for parsing only tags for instance. Defaults to `False`.
- **tags_ignore**: a list of tag names that won't be returned (as well as their children).
- - **ids_ignore**: a list of IDs for which matching elements (and their children) won't be returned. "id" needs to be in `attributes_whitelist` for this to work.
- - **classes_ignore**: a list of classes for which matching elements (and their children) won't be returned. "class" needs to be in `attributes_whitelist` for this to work.
+ - **ids_ignore**: a list of IDs for which matching elements (and their children) won't be returned.
+ - **classes_ignore**: a list of classes for which matching elements (and their children) won't be returned.
## Contributing
If you are using Sublime Text, we recommend installing [Cython support](https://github.com/NotSqrt/sublime-cython).
-All contributions are welcome! Feel free to use the Issues tab or send us your Pull Requests.
+All contributions are welcome! Feel free to use the [Issues tab](https://github.com/commonsearch/gumbocy/issues) or send us your Pull Requests.
## Changelog
-### 0.1: Initial public release
\ No newline at end of file
+### 0.2
+ - New `analyze()` API, moving most of the tree traversal that was happening in `cosr-back` to Cython, resulting in a ~3x speedup in indexing speed.
+ - More tests
+
+### 0.1
+ - Initial public release
diff --git a/gumbocy.cpp b/gumbocy.cpp
index 999d227..f3ac9fa 100644
--- a/gumbocy.cpp
+++ b/gumbocy.cpp
@@ -270,12 +270,18 @@ static CYTHON_INLINE float __PYX_NAN() {
#define __PYX_HAVE__gumbocy
#define __PYX_HAVE_API__gumbocy
#include "gumbo.h"
-#include
+#include "string.h"
+#include
#include "ios"
#include "new"
#include "stdexcept"
#include "typeinfo"
+#include "re2/stringpiece.h"
+#include "re2/re2.h"
+#include
#include
+#include
+#include