Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions paper-explorer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# CC-Citations: Paper Explorer

A visual tool for exploring research papers citing Common Crawl based on embedding similarity. The tool is deployed as a [Huggingface space](https://huggingface.co/spaces/commoncrawl/cc-citations). This folder contains all code for generating paper embeddings, topic modeling, and the Web app.

## Setup

- Python 3.12 (recommended)

```bash
# install dependencies via pip
pip install -r requirements.txt
```

## Generate paper embeddings

The paper explorer requires 2D representations of the paper embeddings. To obtain those, we use the title and abtract of each paper to generate embeddings and then apply dimensionality reduction.

```bash
python embed_papers.py --input_path=<path to OpenAlex JSONL> \
--json_output_path=papers.json \
--js_output_path=hf_space/papers.js \
--model_name_or_path=malteos/scincl

python embed_papers.py --input_path=../gscholar_alerts/citations.jsonl \
--json_output_path=papers_full.json \
--js_output_path=papers_full.js \
--model_name_or_path=malteos/scincl \
--batch_size=12 \
--title_field=title \
--url_field=url \
--authors_field=authors \
--abstract_field=snippet \
--embedding_fields title

python embed_papers.py --input_path=./merged_citations.jsonl \
--json_output_path=papers_merged.json \
--js_output_path=papers_merged.js \
--model_name_or_path=malteos/scincl \
--batch_size=12 \
--title_field=title \
--url_field=url \
--authors_field=authors \
--abstract_field=abstract \
--id_field=openalex_id \
--embedding_fields title abstract

```

## Topic detection with LDA

To assign topics to each paper, we run LDA on the titles and abstracts.

```bash
python classify_paper_topics.py \
--input_path=papers.json \
--topics_path=topics.json \
--paper_to_topic_path=paper_topics.json \
--n_topics=12 --n_words=20 --max_iter=100 --use_abstracts

python classify_paper_topics.py \
--input_path=papers_full.json \
--topics_path=topics_full.json \
--paper_to_topic_path=paper_topics_full.json \
--n_topics=30 --n_words=20 --max_iter=100


python classify_paper_topics.py \
--input_path=papers_merged.json \
--topics_path=topics_merged.json \
--paper_to_topic_path=paper_topics_merged.json \
--n_topics=50 --n_words=20 --max_iter=100 --use_abstracts
```

Since LDA does not produce topic titles but keywords list we use an LLM to assign titles and colors (e.g., for Claude Code):

> Assign a `topic_title` to each topic in `topics.json` based on the provided keywords (LDA output) and assign colors such that the color reflects topic similarity. If no meaningful title can be assigned use "Other" as a topic title.

Or use a CLI command:

```bash
claude --permission-mode acceptEdits --allowedTools Read,Edit,Glob -p "In the file topics_full.json, assign a `topic_title` field to each topic in JSON list of topics based provided keywords (LDA output) and assign colors such that the color reflects topic similarity. If no meaningful title can be assigned use "Other" as a topic title."

claude --permission-mode acceptEdits --allowedTools Read,Edit,Glob -p "In the file topics_merged.json, assign a 'topic_title' field to each topic in JSON list of topics based provided keywords (LDA output) and assign colors such that the color reflects topic similarity. If no meaningful title can be assigned use "Other" as a topic title."

claude --permission-mode acceptEdits --allowedTools Read,Edit,Glob -p "The file topics_merged.json holds a list of many topics with titles and keywords (LDA output). Group these topics into a 15 meaningful main topics. Assign a new field 'main_topic_title' to each topic. If certain topics cannot be meanigfully grouped, assign them the 'Other' main topic title. Save the output into a new file with the '_grouped' suffix."

```

## JavaScript data

To load all results in a Web page, all pieces need to be converted into a Javascript file:

```bash
python create_papers_js.py \
--papers papers_full.json \
--topics topics_full.json \
--paper-topics paper_topics_full.json \
--output hf_space/papers.js


python create_papers_js.py \
--papers papers_merged.json \
--topics topics_merged.json \
--paper-topics paper_topics_merged.json \
--output hf_space/papers.js

```

## View Web page

The resulting Web app is a single HTML file with Javascript and can be viewed in a browser.

```bash
cd hf_space

# from local FS
open hf_space/index.html

# via local web server at http://localhost
python -m http.server 80
```

## Push to HF space

To deploy the web app to Hugginface, you can upload the relevant files as follows:

```bash
huggingface-cli upload commoncrawl/cc-citations ./hf_space --repo-type space --commit-message "Uploading paper explorer"

huggingface-cli upload malteos/some-tests ./hf_space --repo-type space --commit-message "Uploading paper explorer"
```

## References

- Embedding model: https://github.com/malteos/scincl
- Dimensionality reduction method: https://github.com/lmcinnes/umap
- Topic modeling: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
245 changes: 245 additions & 0 deletions paper-explorer/classify_paper_topics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
"""Classify research paper topics using LDA implementation using scikit-learn.

Inputs:

- Path to JSON list of research papers (title, abstract).
- Number of topics

Topics are classified based on title + abstract.

Outputs:

- Mapping of topic idx to keywords
- Mapping of paper idx to topic idx

Usage:

paper-explorer/classify_paper_topics.py --input_path=paper-explorer/papers.json --topics_path=paper-explorer/paper_topics.json --paper_to_topic_path=paper-explorer/paper_topics.json

"""

import json
import argparse
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

# English stop words tailored to research papers
stop_words = {
# Common English stop words
'a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are',
'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but',
'by', 'can', 'could', 'did', 'do', 'does', 'doing', 'down', 'during', 'each', 'few', 'for',
'from', 'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'hers', 'herself',
'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'itself', 'just',
'me', 'might', 'more', 'most', 'must', 'my', 'myself', 'no', 'nor', 'not', 'now', 'of', 'off',
'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 's',
'same', 'she', 'should', 'so', 'some', 'such', 't', 'than', 'that', 'the', 'their', 'theirs',
'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to',
'too', 'under', 'until', 'up', 'very', 'was', 'we', 'were', 'what', 'when', 'where', 'which',
'while', 'who', 'whom', 'why', 'will', 'with', 'would', 'you', 'your', 'yours', 'yourself',
'yourselves',
# Research paper specific stop words
'abstract', 'article', 'paper', 'study', 'research', 'results', 'conclusion', 'introduction',
'method', 'methodology', 'approach', 'analysis', 'figure', 'table', 'section', 'based',
'using', 'show', 'shows', 'presented', 'propose', 'proposed', 'discuss', 'discussed',
'demonstrate', 'demonstrated', 'investigate', 'investigated', 'examine', 'examined'
}


def load_papers(input_path):
"""Load papers from JSON file."""
with open(input_path, 'r', encoding='utf-8') as f:
papers = json.load(f)
return papers


def preprocess_papers(papers, use_abstracts: bool = False):
"""Extract and combine title and abstract from papers."""
documents = []
for paper in papers:
if use_abstracts:
title = paper.get('title', '')
abstract = paper.get('abstract', '')
# Combine title and abstract, handle missing values
text = f"{title} {abstract}".strip()

else:
text = paper.get('title', '')

documents.append(text)
return documents


def train_lda_model(documents, n_topics, max_features=3000, max_iter=50, random_state=42):
"""Train LDA model on documents."""
# Calculate min_df based on corpus size for better filtering
n_docs = len(documents)

# Create document-term matrix
vectorizer = CountVectorizer(
max_features=max_features,
stop_words=list(stop_words),
lowercase=True,
min_df=max(3, int(n_docs * 0.002)), # Filter rare terms (min 3 or 0.2%)
max_df=0.7 # Ignore very common terms (70% threshold)
)
doc_term_matrix = vectorizer.fit_transform(documents)

# Train LDA model
# Use batch learning for better topic balance
# alpha: controls document-topic density (higher = more topics per doc)
# eta (beta): controls topic-word density
lda = LatentDirichletAllocation(
n_components=n_topics,
max_iter=max_iter,
learning_method='batch', # Use batch for better convergence
learning_offset=10.,
doc_topic_prior=None, # Use symmetric prior (1/n_topics)
topic_word_prior=None, # Use symmetric prior
random_state=random_state,
n_jobs=-1,
evaluate_every=5,
perp_tol=0.01
)
lda.fit(doc_term_matrix)

return lda, vectorizer, doc_term_matrix


def extract_topics(lda, vectorizer, n_words=10):
"""Extract top keywords for each topic."""
topics = {}
feature_names = vectorizer.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
top_indices = topic.argsort()[-n_words:][::-1]
top_words = [feature_names[i] for i in top_indices]
topics[topic_idx] = {"keywords": top_words}

return topics


def assign_papers_to_topics(lda, doc_term_matrix):
"""Assign each paper to its most probable topic."""
topic_distributions = lda.transform(doc_term_matrix)
paper_topics = {}

for paper_idx, distribution in enumerate(topic_distributions):
topic_idx = int(np.argmax(distribution))
paper_topics[paper_idx] = topic_idx

return paper_topics


def save_json(data, output_path):
"""Save data to JSON file."""
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)


def main():
"""Main CLI function."""
parser = argparse.ArgumentParser(
description='Classify research paper topics using LDA'
)
parser.add_argument(
'--input_path',
type=str,
required=True,
help='Path to input JSON file containing papers'
)
parser.add_argument(
'--topics_path',
type=str,
required=True,
help='Path to output JSON file for topic keywords'
)
parser.add_argument(
'--paper_to_topic_path',
type=str,
required=True,
help='Path to output JSON file for paper-to-topic mapping'
)
parser.add_argument(
'--n_topics',
type=int,
default=10,
help='Number of topics to extract (default: 10)'
)
parser.add_argument(
'--n_words',
type=int,
default=10,
help='Number of keywords per topic (default: 10)'
)
parser.add_argument(
'--max_features',
type=int,
default=3000,
help='Maximum number of features for vectorizer (default: 3000)'
)
parser.add_argument(
'--max_iter',
type=int,
default=50,
help='Maximum number of iterations for LDA (default: 50)'
)
parser.add_argument(
'--limit',
type=int,
default=None,
help='Limit number of papers to process (default: None, process all)'
)
parser.add_argument(
'--use_abstracts',
action="store_true",
default=False,
help='Use paper titles and abstracts (otherwise only titles are used)'
)

args = parser.parse_args()

print(f"Loading papers from {args.input_path}...")
papers = load_papers(args.input_path)

if args.limit is not None:
papers = papers[:args.limit]
print(f"Limited to {len(papers)} papers")
else:
print(f"Loaded {len(papers)} papers")

print(f"Preprocessing papers... {args.use_abstracts=}")
documents = preprocess_papers(papers, use_abstracts=args.use_abstracts)

print(f"Training LDA model with {args.n_topics} topics...")
lda, vectorizer, doc_term_matrix = train_lda_model(
documents,
args.n_topics,
max_features=args.max_features,
max_iter=args.max_iter
)

print(f"Extracting top {args.n_words} keywords per topic...")
topics = extract_topics(lda, vectorizer, n_words=args.n_words)

print("Assigning papers to topics...")
paper_topics = assign_papers_to_topics(lda, doc_term_matrix)

print(f"Saving topics to {args.topics_path}...")
save_json(topics, args.topics_path)

print(f"Saving paper-to-topic mapping to {args.paper_to_topic_path}...")
save_json(paper_topics, args.paper_to_topic_path)

print("\nTopic keywords:")
for topic_idx, topic_data in topics.items():
keywords = topic_data["keywords"]
print(f"Topic {topic_idx}: {', '.join(keywords)}")

print(f"\nDone! Classified {len(paper_topics)} papers into {args.n_topics} topics.")


if __name__ == '__main__':
main()

Loading