You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+11-5Lines changed: 11 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,15 +1,21 @@
1
1
# gzipstream
2
2
3
3
`gzipstream` allows Python to process multi-part gzip files from a streaming source.
4
-
Primarily intended for use with the [warc library](http://warc.readthedocs.org/en/latest/) for processing [Common Crawl](http://commoncrawl.org/) and other web archive data.
4
+
The library is originally intended for use with the Python[warc library](http://warc.readthedocs.org/en/latest/) for processing [Common Crawl](http://commoncrawl.org/) and other web archive data.
5
5
6
-
As an example of usage, `examples / streaming_commoncrawl_from_s3.py` shows how `gzipstream` can be used with `boto` and `warc` to process a randomly selected gzip web archive (WARC) from the 2014-15 Common Crawl dataset.
7
-
Without `gzipstream`, processing of the file would only be possible by fully downloading it.
8
-
This is highly inefficient as (a) a gzipped WARC file is composed of multiple independent gzip files and (b) the WARC file is hunderds of megabytes in size.
6
+
# Installation
7
+
8
+
If you are using pip, simply run the command `pip install -e git+https://github.com/commoncrawl/gzipstream.git#egg=gzipstream`.
9
+
You can also install using `python setup.py install` if so desired.
9
10
10
11
# Usage
11
12
12
-
For detailed usage, see the examples folder, but minimally...
13
+
As an example of usage, `examples/streaming_commoncrawl_from_s3.py` shows how `gzipstream` can be used to incrementally process a gzipped web archive (WARC) file.
14
+
The file is almost a gigabyte in size, selected randomly from the 2014-15 Common Crawl dataset and hosted on Amazon S3.
15
+
Without `gzipstream`, processing of the file would only be possible by fully downloading it.
16
+
This is highly inefficient as (a) a gzipped WARC file is composed of multiple independent gzip files and (b) the WARC file is hunderds of megabytes in size.
0 commit comments