Skip to content

Commit 158fd99

Browse files
committed
Update README for more details
1 parent 2840ab1 commit 158fd99

1 file changed

Lines changed: 11 additions & 5 deletions

File tree

README.md

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,21 @@
11
# gzipstream
22

33
`gzipstream` allows Python to process multi-part gzip files from a streaming source.
4-
Primarily intended for use with the [warc library](http://warc.readthedocs.org/en/latest/) for processing [Common Crawl](http://commoncrawl.org/) and other web archive data.
4+
The library is originally intended for use with the Python [warc library](http://warc.readthedocs.org/en/latest/) for processing [Common Crawl](http://commoncrawl.org/) and other web archive data.
55

6-
As an example of usage, `examples / streaming_commoncrawl_from_s3.py` shows how `gzipstream` can be used with `boto` and `warc` to process a randomly selected gzip web archive (WARC) from the 2014-15 Common Crawl dataset.
7-
Without `gzipstream`, processing of the file would only be possible by fully downloading it.
8-
This is highly inefficient as (a) a gzipped WARC file is composed of multiple independent gzip files and (b) the WARC file is hunderds of megabytes in size.
6+
# Installation
7+
8+
If you are using pip, simply run the command `pip install -e git+https://github.com/commoncrawl/gzipstream.git#egg=gzipstream`.
9+
You can also install using `python setup.py install` if so desired.
910

1011
# Usage
1112

12-
For detailed usage, see the examples folder, but minimally...
13+
As an example of usage, `examples/streaming_commoncrawl_from_s3.py` shows how `gzipstream` can be used to incrementally process a gzipped web archive (WARC) file.
14+
The file is almost a gigabyte in size, selected randomly from the 2014-15 Common Crawl dataset and hosted on Amazon S3.
15+
Without `gzipstream`, processing of the file would only be possible by fully downloading it.
16+
This is highly inefficient as (a) a gzipped WARC file is composed of multiple independent gzip files and (b) the WARC file is hunderds of megabytes in size.
17+
18+
For minimal usage however...
1319

1420
```python
1521
from gzipstream import GzipStreamFile

0 commit comments

Comments
 (0)