Skip to content

Commit 3b61af3

Browse files
author
Alden Page
committed
Add supercharger project
1 parent 6be5306 commit 3b61af3

File tree

1 file changed

+27
-0
lines changed
  • content/gsoc-2019/project-ideas/project-ideas-list/supercharge-our-indexing-engine

1 file changed

+27
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
_model: project-idea
2+
---
3+
_hidden: yes
4+
---
5+
title: Supercharge our search indexer
6+
---
7+
description:
8+
9+
CC Search is a system for searching hundreds of millions (eventually billions) of Creative Commons works. We store all of these documents inside of a PostgreSQL database. To enable rapid search performance on a dataset of this size, we mirror the documents to Elasticsearch weekly. It takes about 20 hours to index 276MM documents, but the speed could be greatly improved through parallelization across multiple nodes and multithreading. This project represents a great opportunity to learn about the challenges of distributed computing.
10+
---
11+
rationale:
12+
13+
Faster indexing allows us to deliver higher quality search results to our users in less time.
14+
---
15+
resources:
16+
17+
- [CC Catalog API README](https://github.com/creativecommons/cccatalog-api/README.md)
18+
---
19+
expected_result:
20+
21+
Ideally, distributing the indexing process across 5 nodes should cut the indexing time by 80% (or 4 hours compared to the current single-node, single-threaded implementation).
22+
---
23+
skills_recommended: Python, basic understanding of threads, basic understanding of databases, benchmark-driven mindset.
24+
---
25+
mentors: Alden Page
26+
---
27+
difficulty: Hard

0 commit comments

Comments
 (0)