Turning 30,000 Arabic domains into a better crawl

Code and data accompanying the work described in this blog post to filter, geolocate and categorise a donation of Arabic seed domains.

ArabicDomainQuality.xlsx: The original data received from QCRI.
arabic_seeds.ipynb: A notebook detailing the data processing and analysis.
crawl_lang_info.tsv: Summarised language information for the domains found in the CC-MAIN-2026-{21,17,12} archives.
DomainQuality_Dashboard.ipynb: Additional analysis of the quality of the pre-filtered domains, carried out by researchers at QCRI.

We use uv to manage Python dependencies.

Acknowledgements

Thank you to Hamdy S. Hussein, Dr. Kareem M. Darwish and Dr. Mohamed Ahmed Yassin Eltabakh of the Qatar Computing Research Institute for providing the initial seed list, quality annotations and exploratory visualisations. These were created as part of the Fanar Project, an Arabic generative AI platform.