Skip to content

Update citations 2025/2026 December - March#11

Merged
sebastian-nagel merged 6 commits into
mainfrom
update-citations-2025-2026-dev-jan-feb
Mar 19, 2026
Merged

Update citations 2025/2026 December - March#11
sebastian-nagel merged 6 commits into
mainfrom
update-citations-2025-2026-dev-jan-feb

Conversation

@sebastian-nagel

Copy link
Copy Markdown
Contributor
  • Update GScholar citations
  • Allow incrementatal updates of GScholar citations
    • Read the clean citations.jsonl as base and add/update citations from EML folder.
    • This allows to update the citations from an incomplete set of alert emails. Manual inspection of the updated data is strongly recommended to avoid that citations are lost.
  • Add few interesting citations to extended citations list.

Read the clean citations.jsonl as base and add/update citations
from EML folder.
Add GScholar citations of March 2026 using the incremental update.

@handecelikkanat handecelikkanat left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebastian-nagel LGTM apart from the systematic issue to include non-author info (year, journal name) in Authors field's final entry.

I marked several occasions and stopped marking after a while because too many.

Exists in older entries as well (some marked), not only this PR's ones.

Likely a code issue but I couldnt spot where yet. Shall I check in more detail, or do you have an idea?

Comment thread gscholar_alerts/citations.jsonl
Comment thread gscholar_alerts/citations.jsonl
Comment thread gscholar_alerts/citations.jsonl Outdated
Comment thread gscholar_alerts/citations.jsonl
Comment thread gscholar_alerts/citations.jsonl
Comment thread gscholar_alerts/citations.jsonl
Comment thread gscholar_alerts/citations.jsonl
Comment thread gscholar_alerts/citations.jsonl
Comment thread gscholar_alerts/citations.jsonl
Comment thread gscholar_alerts/citations.jsonl
@handecelikkanat

handecelikkanat commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

@sebastian-nagel I cannot see this, is it me?

Add few interesting citations to extended citations list.

EDIT: Nvm, my mistake.

@handecelikkanat

Copy link
Copy Markdown
Contributor

@sebastian-nagel @jenenglish I had a better idea, think this is the conventional way.

  • The issue I found (TBC by Sebastian) looks not related to this PR.
  • So I think I should approve these changes, then make a new issue for the problem.
  • Then we handle this in a new PR.

Therefore Ill approve this PR and raise an issue.

@handecelikkanat handecelikkanat left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unrelated information in Authors field looks older than this PR.

Lets merge this and look for it separately.

@jenenglish

Copy link
Copy Markdown

@sebastian-nagel Looks good. A few minor things from spot-checking:

{"year":"2025","title":"NOTE: UNDER PEER REVIEW Not for dissemination without author permission.","authors":["TES Charlesworth, LK Werden, J van den Hoogen…"],"snippet":"Disentangling the cultural drivers of ecological degradation and recovery remains a central challenge for a regenerative future. Here, we use language to develop the first systematic record of global variation in nature attitudes and explore the …","url":["https://osf.io/download/vumza/"]}

  • Looks like title is: Language Reveals Global Links Between Nature Attitudes and Sustainable Development

{"year":"2025","title":"Peasant movements by country or region","authors":["V Campesina"],"snippet":"Several peasant movement in India arose during the colonial era, when economic policies by various British colonial administrations led to the decline of traditional handicraft industries. These policies lead to change of ownership in lands, land …","url":["https://reference.org/facts/Peasant_movement/u6NltEmc"]}

  • Suggest removing this and the other reference.org articles which just have a general statement about using common crawl.

{"year":"2026","title":"AI-Generated Creativity and the Law","authors":["PH Originality, AI Fears - 2026"],"snippet":"The legal inferences of AI-driven change of copyrighted works advance complicated questions under intellectual property and copyright law including the boundary between conversion in use or infringement. AI-generated modifications from …","url":["https://www.igi-global.com/viewtitle.aspx?titleid=400874"]}
{"year":"2026","title":"AI-Generated Creativity and the Law: Protecting Human Originality Amid Imposter Fears","authors":["B Chouhan - Imposter Syndrome and AI: Navigating Human Identity …, 2026"],"snippet":"The legal inferences of AI-driven change of copyrighted works advance complicated questions under intellectual property and copyright law including the boundary between conversion in use or infringement. AI-generated modifications from …","url":["https://www.igi-global.com/chapter/ai-generated-creativity-and-the-law/400874"]}

  • duplicates

{"year":"2026","title":"JOHAN MICHEL & SARA LINDBERG","authors":["J MICHEL"],"snippet":"… The dataset Bias in Bios was extracted from Common Crawl by filtering text lines … on an initial analysis of a small subset of Common Crawl. In some cases, closely …","url":["https://gupea.ub.gu.se/server/api/core/bitstreams/8b5e23f4-554e-4712-ae84-dd2c1b84aad7/content"]}

  • title is Interpretable Methods for Information Re-
    moval in Text-Based Learning

{"year":"2026","title":"Multimodal Large Models","authors":["L Lin, Y Liu"],"snippet":"… GPT-3 [61]: Released in May 2020, it utilized 175 billion parameters and 45 TB of CommonCrawl data for massive-scale learning—over … data and CommonCrawl data as low-quality data, train a simple logistic regression model to assess data …","url":["https://link.springer.com/content/pdf/10.1007/978-981-95-4929-0.pdf"]}
{"year":"2026","title":"Multimodal Large Models: A New Paradigm of Artificial Intelligence","authors":["L Lin"],"snippet":"… GPT-2 was pre-trained on larger text datasets including CommonCrawl, WebText, and BooksCorpus. Compared to GPT-1, GPT-2 showed … GPT-3 [61]: Released in May 2020, it utilized 175 billion parameters and 45 TB of CommonCrawl data for …","url":["https://books.google.de/books?hl=en&lr=lang_en&id=LcHAEQAAQBAJ&oi=fnd&pg=PR6&dq=commoncrawl&ots=zHslsni1nJ&sig=7pLEHDkbH_gU-R_EguY3YZXXtIo"]}

  • duplicates

@sebastian-nagel

Copy link
Copy Markdown
Contributor Author

Thanks, @handecelikkanat and @jenenglish! Merging...

@sebastian-nagel sebastian-nagel merged commit 51b1323 into main Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants