Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ A [Dockerfile](./Dockerfile) is provided to compile the project and run the Spar
More details to run the converter are given below.

Note that the Dockerfile defines the conversion tool as entry point.
Overriding the entrypoint woulld allow to inspect the container using an interactive shell:
Overriding the entrypoint would allow to inspect the container using an interactive shell:

```
$> docker run --rm --entrypoint=/bin/bash -ti cc-index-table
Expand Down Expand Up @@ -172,9 +172,11 @@ A couple of sample queries are also provided (for the flat schema):
- count pairs of top-level domain and content language: [count-language-tld.sql](src/sql/examples/cc-index/count-language-tld.sql)
- find correlations between TLD and content language using the log-likelihood ratio: [loglikelihood-language-tld.sql](src/sql/examples/cc-index/loglikelihood-language-tld.sql)
- ... and similar for correlations between content language and character encoding: [correlation-language-charset.sql](src/sql/examples/cc-index/correlation-language-charset.sql)
- discover sites hosting content of specific language(s): [site-discovery-by-language.sql](src/sql/examples/cc-index/site-discovery-by-language.sql)
- discover non-English sites: [discovery-of-non-english-sites](src/sql/examples/cc-index/discovery-of-non-english-sites.sql)
- find multi-lingual domains by analyzing URL paths: [get-language-translations-url-path.sql](src/sql/examples/cc-index/get-language-translations-url-path.sql)
- site discovery by content language:
- specific language(s): [site-discovery-by-language.sql](src/sql/examples/cc-index/site-discovery-by-language.sql)
- non-English sites: [discovery-of-non-english-sites](src/sql/examples/cc-index/discovery-of-non-english-sites.sql)
- Hungarian sites: [site-discovery-hungarian.sql](src/sql/examples/cc-index/site-discovery-hungarian.sql)
- find multi-lingual domains by analyzing URL paths: [get-language-translations-url-path.sql](src/sql/examples/cc-index/get-language-translations-url-path.sql)
- extract robots.txt records for a list of sites: [get-records-robotstxt.sql](src/sql/examples/cc-index/get-records-robotstxt.sql)

Athena creates results in CSV format. E.g., for the last example, the mining of multi-lingual domains we get:
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ dependencies = [
"pyarrow>=22.0.0",
"pytest>=8.4.2",
"tqdm>=4.67.1",
"pyathena>=3.31.0",
]
74 changes: 74 additions & 0 deletions src/sql/examples/cc-index/site-discovery-hungarian.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
-- Discover sites (hosts or domains) hosting Hungarian content.
--
-- See also
-- get-records-for-language.sql
-- for restrictions and limitations regarding the
-- automatic detection of the content language(s).
--
-- For similar queries for discovery of language communities, see
-- site-discovery-by-language.sql
-- loglikelihood-language-tld.sql
-- discovery-of-non-english-sites.sql
--
-- A "site" is defined by the host name part of a URL.
--
-- We group by primary content language and site/host name
-- but also top-level domain and registered domain which are
-- both determined by the host name.
--
-- For every group (language and site) we extract
--
-- - the total number of pages using Hungarian as primary language
--
-- - the number of pages in other languages or where the language could not be identified
--
-- - the total number of pages for this host and domain name
--
-- - the coverage (percentage) of content in Hungarian by this host
--
-- - a histogram of the detected primary content languages on the host
-- (sorted by decreasing frequency, taking only the top 10 most frequent languages)
--
-- To reduce the noise we limit the result to include only sites which
-- host at least 50 pages and 10% of the content in Hungarian.
--
with tmp as (
select count(*) as pages_host_total,
sum(count(*)) over(partition by url_host_registered_domain) as pages_domain_total,
-- count pages where the primary content language is Hungarian
sum(case when content_languages like 'hun%' then 1 else 0 end) as pages_hungarian,
sum(case when content_languages like 'hun%' then 0 else case when content_languages is null then 0 else 1 end end) as pages_language_other,
sum(case when content_languages is null then 1 else 0 end) as pages_language_unknown,
count(distinct regexp_extract(content_languages, '^([a-z]{3})')) as num_distinct_primary_languages,
url_host_tld,
url_host_registered_domain,
url_host_name,
cast(
slice(
array_sort(
cast(map_entries(histogram(regexp_extract(content_languages, '^([a-z]{3})')))
as array(row(lang varchar, freq bigint))),
(a, b) -> if(a.freq < b.freq, 1, if(a.freq = b.freq, 0, -1))),
1, 10)
as JSON) as top_10_primary_languages
from ccindex.ccindex
where crawl = 'CC-MAIN-2026-17'
and subset = 'warc'
group by url_host_tld,
url_host_registered_domain,
url_host_name)
select pages_domain_total,
pages_host_total,
pages_hungarian,
pages_language_other,
pages_language_unknown,
cast(round(100.0*pages_hungarian/pages_host_total, 0) as int) as pages_hungarian_pct,
num_distinct_primary_languages,
url_host_tld,
url_host_registered_domain,
url_host_name,
top_10_primary_languages
from tmp
where pages_hungarian >= 50
and (1.0*pages_hungarian/pages_host_total) >= .1
order by pages_hungarian desc;
82 changes: 82 additions & 0 deletions src/util/athena_query_multiple_crawls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
"""Execute an Athena query over multiple crawls."""

import argparse
import logging
import os
import re
import sys

from collections import Counter, defaultdict
from urllib.parse import urljoin, urlparse

import pandas as pd

from pyathena import connect
from pyathena.util import RetryConfig


logging.basicConfig(level='INFO',
format='%(asctime)s %(levelname)s %(name)s: %(message)s')


rx_crawl_id = re.compile(r'CC-MAIN-20[123][0-9]-(?:[0-4][0-9]|5[12])')


def query_execute(crawl, cursor, args):
query_template = open(args.query_template, encoding='UTF-8').read()
if '{crawl}' in query_template:
query = query_template.format(crawl=crawl)
else:
logging.info("The query template does not contain the placeholder '{crawl}'")
logging.info("Trying to match crawl identifiers...")
query = rx_crawl_id.sub(crawl, query_template)

logging.info("Athena query to create temporary export table:\n%s", query)

cursor.execute(query)
logging.info("Creation of temporary export table: %s", cursor.result_set.state)

logging.info("Athena query ID %s: %s",
cursor.query_id,
cursor.result_set.state)
logging.info(" data_scanned_in_bytes: %d",
cursor.result_set.data_scanned_in_bytes)
logging.info(" total_execution_time_in_millis: %d",
cursor.result_set.total_execution_time_in_millis)

def parse_arguments(args):
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--database',
help='Name of the database', default='ccindex')
parser.add_argument('--output_base_name',
help='Base name of output files, without the suffix .csv.gz', default='results')
parser.add_argument('--s3_staging_dir',
help='Staging directory on S3 used for query results and metadata.'
'A subdirectory for each crawl dataset is created.')
parser.add_argument('query_template',
help='Query template.'
'The query template should contain a placeholder for the crawls to be processed: {crawl}.'
'Alternatively, any crawl identifiers (CC-MAIN-YYYY-WW) are replaced by the actual crawl ID.')
parser.add_argument('crawl_data_set', nargs='+',
help='Common Crawl crawl dataset(s) to process, e.g., CC-MAIN-2022-33')
args = parser.parse_args(args)

# remove trailing slash
args.s3_output_location = args.s3_staging_dir.rstrip('/')

return args

def main(args):
args = parse_arguments(args)

retry_config = RetryConfig(attempt=3)

for crawl in args.crawl_data_set:
cursor = connect(s3_staging_dir=args.s3_staging_dir + '/' + crawl.lower(),
retry_config=retry_config,
region_name="us-east-1").cursor()
query_execute(crawl, cursor, args)


if __name__ == '__main__':
main(sys.argv[1:])