Our current annotated citations live in .bib files, and contain multiple values for 2 keys especially important for data analysis, which are cc-class and keywords.
Example:
@Misc{cc:KoenigRauchWoerter:2025:Monitoring-of-economic-shocks,
title = "Real-time Monitoring of Economic Shocks using Company Websites",
author = "Michael Koenig and Jakob Rauch and Martin Woerter",
year = "2025",
...
primaryclass = "econ.GN",
keywords = "large language models, natural language processing, crisis, economic shocks, economic monitoring, Covid-19",
URL = "https://arxiv.org/abs/2502.17161",
abstract = ...
cc-author-affiliation = "ETH Zurich, ...
cc-class = "economics, economic-monitoring, web-archiving, nlp/large-language-models",
cc-snippet = ...
}
It is important to have 1 value per key in Excel to be able to use data in pivot tables and charts. Proposed solution is to export these into csvs with multiple rows per paper, identified with unique ID per paper.
Proposal:
Add into export-csv.py: Functionality for producing one big csv from all of our citations, that:
- can be exported (or copy-pasted) as an Excel file,
- and has multiple rows per citations (one row for key-value pair) that enable using pivot tables and chart-making in Excel.
Proposed format: One new aggregate column (cc-topic) that combines values from keyword and cc-class columns, and another new column (cc-og-key). that holds which OG column (keyword or cc-class) this topic came from.
| id |
year |
primaryclass |
cc-og-key |
cc-topic |
title |
authors |
cc-author-affiliation |
URL |
DOI |
cc-snippet |
| cc:Koenig... |
2025 |
econ.GN |
keyword |
large language models |
Real-time Monitoring ... |
Koenig, ... |
ETH ... |
http://... |
... |
.... |
| cc:Koenig... |
2025 |
econ.GN |
keyword |
natural language processing |
Real-time Monitoring ... |
Koenig, ... |
ETH ... |
http://... |
... |
.... |
| cc:Koenig... |
2025 |
econ.GN |
... |
... |
Real-time Monitoring ... |
Koenig, ... |
ETH ... |
http://... |
... |
.... |
| cc:Koenig... |
2025 |
econ.GN |
keyword |
Covid-19 |
Real-time Monitoring ... |
Koenig, ... |
ETH ... |
http://... |
... |
.... |
| cc:Koenig... |
2025 |
econ.GN |
cc-class |
economics |
Real-time Monitoring ... |
Koenig, ... |
ETH ... |
http://... |
... |
.... |
| cc:Koenig... |
2025 |
econ.GN |
cc-class |
economic-monitoring |
Real-time Monitoring ... |
Koenig, ... |
ETH ... |
http://... |
... |
.... |
| cc:Koenig... |
2025 |
econ.GN |
... |
... |
Real-time Monitoring ... |
Koenig, ... |
ETH ... |
http://... |
... |
.... |
| cc:Koenig... |
2025 |
econ.GN |
cc-class |
web-archiving |
Real-time Monitoring ... |
Koenig, ... |
ETH ... |
http://... |
... |
.... |
| ... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
.... |
- This should allow pivot tables and graphs, if necessary by eliminating all rows that include
OG key = keyword, etc, or aggregating all keys (keyword, cc-class...) into the same analysis.
- It enables one row per value for multiple keys with multiple values (without needing an NxN mapping).
Our current annotated citations live in .bib files, and contain multiple values for 2 keys especially important for data analysis, which are
cc-classandkeywords.Example:
It is important to have 1 value per key in Excel to be able to use data in pivot tables and charts. Proposed solution is to export these into csvs with multiple rows per paper, identified with unique ID per paper.
Proposal:
Add into
export-csv.py: Functionality for producing one big csv from all of our citations, that:Proposed format: One new aggregate column (cc-topic) that combines values from
keywordandcc-classcolumns, and another new column (cc-og-key). that holds which OG column (keywordorcc-class) this topic came from.OG key = keyword, etc, or aggregating all keys (keyword, cc-class...) into the same analysis.