You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1
Original file line number
Diff line number
Diff line change
@@ -207,6 +207,7 @@ Some differences between the warcio and FastWARC APIs are hidden from the user i
207
207
208
208
However, it's recommended that you carefully verify that your custom job implementation works in combination with FastWARC. There are subtle differences between the warcio and FastWARC APIs, including the underlying classes (WARC/HTTP headers and stream implementations). In addition, FastWARC does not support for legacy ARC files and does not automatically decode HTTP content and transfer encodings (see [Resiliparse HTTP Tools](https://resiliparse.chatnoir.eu/en/latest/man/parse/http.html#read-chunked-http-payloads)). While content and transfer encodings are already decoded in Common Crawl WARC files, this may not be the case for WARC files from other sources. See also [WARC 1.1 specification, http/https response records](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#http-and-https-schemes).
209
209
210
+
210
211
## Credits
211
212
212
213
Examples are originally ported from Stephen Merity's [cc-mrjob](https://github.com/commoncrawl/cc-mrjob/) with the following changes and upgrades:
Copy file name to clipboardExpand all lines: requirements.txt
+5
Original file line number
Diff line number
Diff line change
@@ -18,3 +18,8 @@ lxml
18
18
#fastwarc
19
19
# (tested with)
20
20
#fastwarc==0.14.1
21
+
22
+
# to parse HTML (used in cc_index_word_count.py) using Resiliparse (https://pypi.org/project/Resiliparse/). Resiliparse requires compatible fastwarc version.
encoding (str, optional): Specific character encoding to use. If None, auto-detection is attempted
28
+
**kwargs: Additional arguments passed to extract_plain_text:
29
+
Refer here https://resiliparse.chatnoir.eu/en/latest/api/extract/html2text.html#resiliparse.extract.html2text.extract_plain_text for accepted arguments.
0 commit comments