C4Corpus

C4Corpus is a preprocessed Common Crawl data set using DKPro C4CorpusTools including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

Contents