Skip to content

commoncrawl/cc-citations

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Common Crawl Citations – BibTeX Database

BibTex files are in bib/

Note: work in progress, still contains only a fraction of recent articles

Fields Specific for Common Crawl

The following non-standard fields are used to add information how the publications relate to Common Crawl:

cc-author-affiliation
affiliation of the authors
cc-class
classification of the publication: domain of research, topics, keywords
cc-snippet
snippet citing Common Crawl
cc-dataset-used
subset of Common Crawl used, e.g., CC-MAIN-2016-07
cc-derived-dataset-about
the publication describes a dataset which has been derived from Common Crawl, e.g., GloVe-word-embeddings
cc-derived-dataset-used
a dataset has been used which is derived from Common Crawl, e.g., GloVe-word-embeddings
cc-derived-dataset-cited
a derived dataset is cited but not used

Formatting and Export of Citations

The Makefile contains targets to apply a consistent formatting to the citations. It also allows to export the citations. The following BibTeX tools are required: bibtex2html, bibclean, bibtool.

(Do not be confused by the pypi package bibclean, it's entirely different. bibclean, bibtool, and bibtex2html are available as OS packages, at least in apt-based distros.)

Citations from Google Scholar Alerts

As an initial step and to get a higher coverage, citations are extracted from Google Scholar Alert e-mails received April 2016 to date. See gscholar_alerts.

Plotting the Data

A Python script for plotting citations over time is included in this repository.

citations-by-year Fig 1: Plot of Common Crawl citations in Google Scholar as of July 29th 2024

About

Scientific articles using or citing Common Crawl data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors