# gzipstream

`gzipstream` allows Python to process multi-part gzip files from a streaming source.
The library is originally intended for use with the Python [warc library](http://warc.readthedocs.org/en/latest/) for processing [Common Crawl](http://commoncrawl.org/) and other web archive data.

# Installation

If you are using pip, simply run the command `pip install -e git+https://github.com/commoncrawl/gzipstream.git#egg=gzipstream`.
You can also install using `python setup.py install` if so desired.

# Usage

As an example of usage, `examples/streaming_commoncrawl_from_s3.py` shows how `gzipstream` can be used to incrementally process a gzipped web archive (WARC) file.
The file is almost a gigabyte in size, selected randomly from the 2014-15 Common Crawl dataset and hosted on Amazon S3.
Without `gzipstream`, processing of the file would only be possible by fully downloading it.
This is highly inefficient as (a) a gzipped WARC file is composed of multiple independent gzip files and (b) the WARC file is hunderds of megabytes in size.

For minimal usage however...

```python
from gzipstream import GzipStreamFile
f = open('huge_file.gz') # Any streaming file object that supports `read`
gz = GzipStreamFile(f)
```

# License

MIT License, as per `LICENSE`