Skip to content

Export annotated citations in a format usable for Excel analysis #6

Description

@handecelikkanat

Our current annotated citations live in .bib files, and contain multiple values for 2 keys especially important for data analysis, which are cc-class and keywords.

Example:

@Misc{cc:KoenigRauchWoerter:2025:Monitoring-of-economic-shocks,
  title        = "Real-time Monitoring of Economic Shocks using Company Websites",
  author       = "Michael Koenig and Jakob Rauch and Martin Woerter",
  year         = "2025",
  ...
  primaryclass = "econ.GN",
  keywords     = "large language models, natural language processing, crisis, economic shocks, economic monitoring, Covid-19",
  URL          = "https://arxiv.org/abs/2502.17161",
  abstract     = ...
  cc-author-affiliation = "ETH Zurich, ...
  cc-class     = "economics, economic-monitoring, web-archiving, nlp/large-language-models",
  cc-snippet   = ...
}

It is important to have 1 value per key in Excel to be able to use data in pivot tables and charts. Proposed solution is to export these into csvs with multiple rows per paper, identified with unique ID per paper.

Proposal:

Add into export-csv.py: Functionality for producing one big csv from all of our citations, that:

  • can be exported (or copy-pasted) as an Excel file,
  • and has multiple rows per citations (one row for key-value pair) that enable using pivot tables and chart-making in Excel.

Proposed format: One new aggregate column (cc-topic) that combines values from keyword and cc-class columns, and another new column (cc-og-key). that holds which OG column (keyword or cc-class) this topic came from.

id year primaryclass cc-og-key cc-topic title authors cc-author-affiliation URL DOI cc-snippet
cc:Koenig... 2025 econ.GN keyword large language models Real-time Monitoring ... Koenig, ... ETH ... http://... ... ....
cc:Koenig... 2025 econ.GN keyword natural language processing Real-time Monitoring ... Koenig, ... ETH ... http://... ... ....
cc:Koenig... 2025 econ.GN ... ... Real-time Monitoring ... Koenig, ... ETH ... http://... ... ....
cc:Koenig... 2025 econ.GN keyword Covid-19 Real-time Monitoring ... Koenig, ... ETH ... http://... ... ....
cc:Koenig... 2025 econ.GN cc-class economics Real-time Monitoring ... Koenig, ... ETH ... http://... ... ....
cc:Koenig... 2025 econ.GN cc-class economic-monitoring Real-time Monitoring ... Koenig, ... ETH ... http://... ... ....
cc:Koenig... 2025 econ.GN ... ... Real-time Monitoring ... Koenig, ... ETH ... http://... ... ....
cc:Koenig... 2025 econ.GN cc-class web-archiving Real-time Monitoring ... Koenig, ... ETH ... http://... ... ....
... ... ... ... ... ... ... ... ... ... ....
  • This should allow pivot tables and graphs, if necessary by eliminating all rows that include OG key = keyword, etc, or aggregating all keys (keyword, cc-class...) into the same analysis.
  • It enables one row per value for multiple keys with multiple values (without needing an NxN mapping).

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions