Skip to content

Discovery of sites with Hungarian content#61

Merged
wumpus merged 3 commits into
mainfrom
site-discovery-hungarian
Jun 21, 2026
Merged

Discovery of sites with Hungarian content#61
wumpus merged 3 commits into
mainfrom
site-discovery-hungarian

Conversation

@sebastian-nagel

@sebastian-nagel sebastian-nagel commented May 6, 2026

Copy link
Copy Markdown
Contributor

See the comments in site-discovery-hungarian.sql.

To run the query over multiple crawls:

python src/util/athena_query_multiple_crawls.py \
   --database ccindex \
   --s3_staging_dir s3://mybucket/site-discovery-by-language/hungarian \
   src/sql/examples/cc-index/site-discovery-hungarian.sql CC-MAIN-2018-34 ...

@wumpus wumpus requested a review from malteos May 7, 2026 04:48

@malteos malteos left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Just out of curiosity:

What's your experience with content_languages like 'hun%' vs like '%hun%'? Does the latter introduce a lot of noise?

@wumpus

wumpus commented May 8, 2026

Copy link
Copy Markdown
Member

Seb was trying to match the primary language, so the LIKE is anchored on the left side. A 3 language webpage looks like 'hun,fra,eng'. '%hun%' would match the primary, secondary, and tertiary language. Obviously there's more noise in the tertiary side.

- Include sites below the .hu top-level domain
Add utility to run a query over multiple crawls.
@wumpus wumpus merged commit da58252 into main Jun 21, 2026
7 checks passed
@wumpus wumpus deleted the site-discovery-hungarian branch June 21, 2026 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants