Skip to content

Commit dc23ca9

Browse files
docs(README): update supported Python/Spark versions
- add Python 3.12 and 3.13, Spark 3.5.5 as supported versions - improve description how to run cc-pyspark jobs and how to pass a list of input files
1 parent 7f50c49 commit dc23ca9

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ pip install -r requirements.txt
3939

4040
## Compatibility and Requirements
4141

42-
Tested with with Spark 3.2.3, 3.3.2 and 3.4.1 in combination with Python 3.8, 3.9 and 3.10. See the branch [python-2.7](/commoncrawl/cc-pyspark/tree/python-2.7) if you want to run the job on Python 2.7 and older Spark versions.
42+
Tested with with Spark 3.2.3, 3.3.2, 3.4.1, 3.5.5 in combination with Python 3.8, 3.9, 3.10, 3.12 and 3.13. See the branch [python-2.7](/commoncrawl/cc-pyspark/tree/python-2.7) if you want to run the job on Python 2.7 and older Spark versions.
4343

4444

4545
## Get Sample Data
@@ -58,6 +58,8 @@ Note that the sample data is from an older crawl (`CC-MAIN-2017-13` run in March
5858

5959
## Process Common Crawl Data on Spark
6060

61+
CC-PySpark reads the list of input files from a manifest file. Typically, these are Common Crawl WARC, WAT or WET files, but it could be any other type of file, as long it is supported by the class implementing [CCSparkJob](./sparkcc.py). The files can be given as absolute URLs or as paths relative to a base URL (option `--input_base_url`). The URL cat point to a local file (`file://`), to a remote location (typically below `s3://commoncrawl/` resp. `https://data.commoncrawl.org/`). For development and testing, you'd start with local files.
62+
6163
### Running locally
6264

6365
First, point the environment variable `SPARK_HOME` to your Spark installation.

0 commit comments

Comments
 (0)