Discovery of sites with Hungarian content #61

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

wumpus merged 3 commits into main from site-discovery-hungarian

Jun 21, 2026

README.md

-Original file line number
+Diff line change
@@ Expand Up @@
        More details to run the converter are given below.
     Note that the Dockerfile defines the conversion tool as entry point.
-    Overriding the entrypoint woulld allow to inspect the container using an interactive shell:
+    Overriding the entrypoint would allow to inspect the container using an interactive shell:
     ```
     $> docker run --rm --entrypoint=/bin/bash -ti cc-index-table
@@ Expand Down Expand Up @@
     - count pairs of top-level domain and content language: [count-language-tld.sql](src/sql/examples/cc-index/count-language-tld.sql)
     - find correlations between TLD and content language using the log-likelihood ratio: [loglikelihood-language-tld.sql](src/sql/examples/cc-index/loglikelihood-language-tld.sql)
     - ... and similar for correlations between content language and character encoding: [correlation-language-charset.sql](src/sql/examples/cc-index/correlation-language-charset.sql)
-    - discover sites hosting content of specific language(s): [site-discovery-by-language.sql](src/sql/examples/cc-index/site-discovery-by-language.sql)
-    - discover non-English sites: [discovery-of-non-english-sites](src/sql/examples/cc-index/discovery-of-non-english-sites.sql)
-    - find multi-lingual domains by analyzing URL paths: [get-language-translations-url-path.sql](src/sql/examples/cc-index/get-language-translations-url-path.sql)
+    - site discovery by content language:
+      - specific language(s): [site-discovery-by-language.sql](src/sql/examples/cc-index/site-discovery-by-language.sql)
+      - non-English sites: [discovery-of-non-english-sites](src/sql/examples/cc-index/discovery-of-non-english-sites.sql)
+      - Hungarian sites: [site-discovery-hungarian.sql](src/sql/examples/cc-index/site-discovery-hungarian.sql)
+      - find multi-lingual domains by analyzing URL paths: [get-language-translations-url-path.sql](src/sql/examples/cc-index/get-language-translations-url-path.sql)
     - extract robots.txt records for a list of sites: [get-records-robotstxt.sql](src/sql/examples/cc-index/get-records-robotstxt.sql)
     Athena creates results in CSV format. E.g., for the last example, the mining of multi-lingual domains we get:
@@ Expand Down @@

pyproject.toml

-Original file line number
+Diff line change
@@ Expand Up / @@ -8,4 +8,5 @@ dependencies = [ @@
         "pyarrow>=22.0.0",
         "pytest>=8.4.2",
         "tqdm>=4.67.1",
+        "pyathena>=3.31.0",
     ]

src/sql/examples/cc-index/site-discovery-hungarian.sql

-Original file line number
+Diff line change
@@ -0,0 +1,74 @@
+    -- Discover sites (hosts or domains) hosting Hungarian content.
+    --
+    -- See also
+    --    get-records-for-language.sql
+    -- for restrictions and limitations regarding the
+    -- automatic detection of the content language(s).
+    --
+    -- For similar queries for discovery of language communities, see
+    --    site-discovery-by-language.sql
+    --    loglikelihood-language-tld.sql
+    --    discovery-of-non-english-sites.sql
+    --
+    -- A "site" is defined by the host name part of a URL.
+    --
+    -- We group by primary content language and site/host name
+    -- but also top-level domain and registered domain which are
+    -- both determined by the host name.
+    --
+    -- For every group (language and site) we extract
+    --
+    -- - the total number of pages using Hungarian as primary language
+    --
+    -- - the number of pages in other languages or where the language could not be identified
+    --
+    -- - the total number of pages for this host and domain name
+    --
+    -- - the coverage (percentage) of content in Hungarian by this host
+    --
+    -- - a histogram of the detected primary content languages on the host
+    --   (sorted by decreasing frequency, taking only the top 10 most frequent languages)
+    --
+    -- To reduce the noise we limit the result to include only sites which
+    -- host at least 50 pages and 10% of the content in Hungarian.
+    --
+    with tmp as (
+    select count(*) as pages_host_total,
+           sum(count(*)) over(partition by url_host_registered_domain) as pages_domain_total,
+           -- count pages where the primary content language is Hungarian
+           sum(case when content_languages like 'hun%' then 1 else 0 end) as pages_hungarian,
+           sum(case when content_languages like 'hun%' then 0 else case when content_languages is null then 0 else 1 end end) as pages_language_other,
+           sum(case when content_languages is null then 1 else 0 end) as pages_language_unknown,
+           count(distinct regexp_extract(content_languages, '^([a-z]{3})')) as num_distinct_primary_languages,
+           url_host_tld,
+           url_host_registered_domain,
+           url_host_name,
+           cast(
+             slice(
+               array_sort(
+                 cast(map_entries(histogram(regexp_extract(content_languages, '^([a-z]{3})')))
+                      as array(row(lang varchar, freq bigint))),
+                 (a, b) -> if(a.freq < b.freq, 1, if(a.freq = b.freq, 0, -1))),
+, 10)
+             as JSON) as top_10_primary_languages
+    from ccindex.ccindex
+    where crawl = 'CC-MAIN-2026-17'
+      and subset = 'warc'
+    group by url_host_tld,
+             url_host_registered_domain,
+             url_host_name)
+    select pages_domain_total,
+           pages_host_total,
+           pages_hungarian,
+           pages_language_other,
+           pages_language_unknown,
+           cast(round(100.0*pages_hungarian/pages_host_total, 0) as int) as pages_hungarian_pct,
+           num_distinct_primary_languages,
+           url_host_tld,
+           url_host_registered_domain,
+           url_host_name,
+           top_10_primary_languages
+    from tmp
+    where pages_hungarian >= 50
+      and (1.0*pages_hungarian/pages_host_total) >= .1
+    order by pages_hungarian desc;

src/util/athena_query_multiple_crawls.py

-Original file line number
+Diff line change
@@ -0,0 +1,82 @@
+    """Execute an Athena query over multiple crawls."""
+    import argparse
+    import logging
+    import os
+    import re
+    import sys
+    from collections import Counter, defaultdict
+    from urllib.parse import urljoin, urlparse
+    import pandas as pd
+    from pyathena import connect
+    from pyathena.util import RetryConfig
+    logging.basicConfig(level='INFO',
+                        format='%(asctime)s %(levelname)s %(name)s: %(message)s')
+    rx_crawl_id = re.compile(r'CC-MAIN-20[123][0-9]-(?:[0-4][0-9]|5[12])')
+    def query_execute(crawl, cursor, args):
+        query_template = open(args.query_template, encoding='UTF-8').read()
+        if '{crawl}' in query_template:
+            query = query_template.format(crawl=crawl)
+        else:
+            logging.info("The query template does not contain the placeholder '{crawl}'")
+            logging.info("Trying to match crawl identifiers...")
+            query = rx_crawl_id.sub(crawl, query_template)
+        logging.info("Athena query to create temporary export table:\n%s", query)
+        cursor.execute(query)
+        logging.info("Creation of temporary export table: %s", cursor.result_set.state)
+        logging.info("Athena query ID %s: %s",
+                     cursor.query_id,
+                     cursor.result_set.state)
+        logging.info("       data_scanned_in_bytes: %d",
+                     cursor.result_set.data_scanned_in_bytes)
+        logging.info("       total_execution_time_in_millis: %d",
+                     cursor.result_set.total_execution_time_in_millis)
+    def parse_arguments(args):
+        parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+        parser.add_argument('--database',
+                            help='Name of the database', default='ccindex')
+        parser.add_argument('--output_base_name',
+                            help='Base name of output files, without the suffix .csv.gz', default='results')
+        parser.add_argument('--s3_staging_dir',
+                            help='Staging directory on S3 used for query results and metadata.'
+                            'A subdirectory for each crawl dataset is created.')
+        parser.add_argument('query_template',
+                            help='Query template.'
+                            'The query template should contain a placeholder for the crawls to be processed: {crawl}.'
+                            'Alternatively, any crawl identifiers (CC-MAIN-YYYY-WW) are replaced by the actual crawl ID.')
+        parser.add_argument('crawl_data_set', nargs='+',
+                            help='Common Crawl crawl dataset(s) to process, e.g., CC-MAIN-2022-33')
+        args = parser.parse_args(args)
+        # remove trailing slash
+        args.s3_output_location = args.s3_staging_dir.rstrip('/')
+        return args
+    def main(args):
+        args = parse_arguments(args)
+        retry_config = RetryConfig(attempt=3)
+        for crawl in args.crawl_data_set:
+            cursor = connect(s3_staging_dir=args.s3_staging_dir + '/' + crawl.lower(),
+                             retry_config=retry_config,
+                             region_name="us-east-1").cursor()
+            query_execute(crawl, cursor, args)
+    if __name__ == '__main__':
+        main(sys.argv[1:])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discovery of sites with Hungarian content #61

Uh oh!

Diff view

Diff view

There are no files selected for viewing