Skip to content

feat: Adding project boilerplate with classify-warc CLI and notebook#20

Merged
malteos merged 3 commits into
mainfrom
feat/classify-warc
May 27, 2026
Merged

feat: Adding project boilerplate with classify-warc CLI and notebook#20
malteos merged 3 commits into
mainfrom
feat/classify-warc

Conversation

@malteos

@malteos malteos commented May 26, 2026

Copy link
Copy Markdown
Collaborator

This PR adds the project boilerplate / CLI including the classify WARC command to runs fasttext classifers (like Gneissweb) on WARC files from S3. It also comes with a notebook that compares the classifier scores from two different crawls (main vs focus crawl).

Comment thread data/seeds/out_of_scope.csv Outdated
@malteos malteos merged commit 4a4deb2 into main May 27, 2026
1 check passed
@wumpus wumpus deleted the feat/classify-warc branch June 4, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants