Skip to content

feat: Adding tokenize CLI command#37

Merged
malteos merged 2 commits into
mainfrom
feat/tokenize-cli
Jun 3, 2026
Merged

feat: Adding tokenize CLI command#37
malteos merged 2 commits into
mainfrom
feat/tokenize-cli

Conversation

@malteos

@malteos malteos commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

The CLI command tokenize uses a Huggingface tokenizer to convert text (from classify WARC text cache) into token IDs and token count.

This is used to estimate the total token count of the focus crawl.

malteos added 2 commits June 1, 2026 21:18
…ils/io/)

The previous commit edited cli.py to import TokenizeCommand and updated
pyproject.toml / summary.py, but the new files themselves were never
staged — so the published wheel failed to import with
`ModuleNotFoundError: No module named 'ccoa.commands.tokenize'`.
@malteos malteos merged commit 59b677a into main Jun 3, 2026
1 check passed
@wumpus wumpus deleted the feat/tokenize-cli branch June 15, 2026 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant