citations / README.md
tvaughan's picture
Update README.md
dfcd2d7 verified
|
Raw
History Blame Contribute Delete
1.16 kB
metadata
configs:
  - config_name: '2016'
    data_files:
      - split: train
        path: 2016.jsonl
  - config_name: '2017'
    data_files:
      - split: train
        path: 2017.jsonl
  - config_name: '2018'
    data_files:
      - split: train
        path: 2018.jsonl
  - config_name: '2019'
    data_files:
      - split: train
        path: 2019.jsonl
  - config_name: '2020'
    data_files:
      - split: train
        path: 2020.jsonl
  - config_name: '2021'
    data_files:
      - split: train
        path: 2021.jsonl
  - config_name: '2022'
    data_files:
      - split: train
        path: 2022.jsonl
  - config_name: '2023'
    data_files:
      - split: train
        path: 2023.jsonl
  - config_name: '2024'
    data_files:
      - split: train
        path: 2024.jsonl

Common Crawl Citations Overview

This dataset contains citations referencing Common Crawl Foundation and its datasets, pulled from Google Scholar.

Please note that these citations are not curated, so they will include some false positives. An annotated subset of these citations with additional fields can be found at cc-citations.