Skip to content

Commit 58f28fe

Browse files
committed
draft blog post
1 parent afca20a commit 58f28fe

File tree

2 files changed

+125
-8
lines changed

2 files changed

+125
-8
lines changed

content/blog/authors/NaishaSinha/contents.lr

+4-2
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,13 @@ username: naishasinha
22
---
33
name: Naisha Sinha
44
---
5-
md5_hashed_email:
5+
md5_hashed_email: c6f768d61d96f508d9523bf28664cb64
66
---
77
about:
88

99
Naisha worked on [Automating Quantifying the Commons][repository] as a developer for [Google
10-
Summer of Code (GSoC) 2024](/programs/history/).
10+
Summer of Code (GSoC) 2024](/programs/history/). <br>
11+
GitHub: [`@naishasinha`][github]
1112

1213
[repository]: https://github.com/creativecommons/quantifying
14+
[github]: https://github.com/naishasinha
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,12 @@
11
title: Automating Quantifying the Commons: Part 1
22
---
33
categories:
4-
cc-dataviz
5-
collaboration
6-
community
7-
quantifying-the-commons
8-
open-source
94
gsoc-2024
105
gsoc
6+
quantifying-the-commons
7+
open-source
8+
collaboration
9+
community
1110
---
1211
author: naishasinha
1312
---
@@ -17,4 +16,120 @@ body:
1716

1817
![GSoC 2024](Automating - GSoC Logo.png)
1918

20-
## Project
19+
## Introduction
20+
***
21+
22+
Quantifying the Commons, an initiative emerging from the UC Berkeley Data Science Discovery Program,
23+
aims to quantify the frequency of open domain and CC license usage for future accessibility and analysis purposes
24+
(Refer to the initial CC article for Quantifying **[here!][quantifying]**).
25+
To date, the scope of the previous project advancements had not included automation or combined reporting,
26+
which is necessary to minimize the potential for human error and allow for more timely updates,
27+
especially for a system that engages with substantial streams of data. <br>
28+
29+
As a selected developer for Google Summer of Code 2024,
30+
my goal for this summer is to develop an automation software for data gathering, flow, and report generation,
31+
ensuring that reports are never more than 3 months out-of-date. This blog post serves as a technical journal
32+
for my endeavor till the midterm evaluation period. Part 2 will be posted after successful completion of the
33+
entire summer program.
34+
35+
## Pre-Program Knowledge and Associated Challenges
36+
***
37+
38+
As an undergraduate CS student, I had not yet had any experience working with codebases
39+
as intricate as this one; the most complex software I had worked on prior to this undertaking
40+
was most probably a medium-complexity full-stack application. In my pre-GSoC contributions to Quantifying, I did successfully
41+
implement logging across all the Python files (**[PR #97][logging]**), but admittedly, I was not familiar with a lot of the other modules that
42+
were being used in these files. As a result, this caused minor inconveniences to my development process from the very beginning.
43+
For example, not being experienced with operating system (OS) modules had me confused as to how I was supposed to
44+
join new directories. In addition, I had never worked with such large streams of data before, so it was initially a
45+
challenge to map out pseudocode for handling big data effectively. The next section elaborates on my development process and how I resolved these setbacks.
46+
47+
## Development Process (Midterm)
48+
***
49+
50+
### I. Data Flow Diagram Construction
51+
Before starting the code implementation, I decided to develop a **Data Flow Diagram (DFD)**, which provides a visual
52+
representation of how data flows through a software system. While researching effective DFDs for inspiration, I came across
53+
a **[technical whitepaper by Amazon Web Services (AWS)][AWS-whitepaper]** on Distributed Data Management, and I found it very helpful in drafting
54+
my own DFD. As I was still relatively new to the codebase, it helped me simplify
55+
the current system into manageable components and better understand how to implement the rest of the project.
56+
57+
[insert DFD here with explanation of directory setup]
58+
59+
### II. Identifying the First Data Source to Target
60+
The main approach for implementing this project was to target one specific data source and complete its data extraction, analysis,
61+
and report generation process before adding more data sources to the codebase. There were two possible strategies to consider:
62+
(1) work on the easier data sources first, or (2) begin with the highest complexity data source and then add the easier
63+
ones later. Both approaches have notable pros and cons; however, I decided to adopt the second strategy of
64+
starting with the most complex data source first. Although this would take slightly longer to implement, it would simplify the process
65+
later on. As a result, I began implementing the software for the **Google Custom Search**
66+
data source, which has the largest number of data retrieval potential among all the other sources.
67+
68+
### III. Directory Setup + Code Implementation
69+
Based on the DFD, **[Timid Robot][timid-robot]** (my mentor) and I identifed the directory process to be as such: within our `scripts` directory, we would have
70+
separate sub-directories to reflect the phases of data flow, `1-fetched`, `2-processed`, `3-reports`. The code would then be
71+
set up to interact in chronological order. Additionally, a shared directory was implemented to optimize similar functions and paths. <br>
72+
73+
**`1-fetched`**
74+
75+
As I mentioned in the previous sections, starting to code the initial file was a challenge, as I had to learn how to use
76+
new technologies and libraries on-the-go. As a matter of fact, my struggles began when I couldn't even import the
77+
shared module correctly. However, slowly but surely, I found that consistent research of available documentation as well
78+
as constant insights from Timid Robot made it so that I finally understood everything that I was working with. There were
79+
two specific things that helped me especially, and I would like to share them here incase it helps any software
80+
developer reading this post:
81+
82+
1. **Reading Technical Whitepapers:** As I mentioned earlier, I studied a technical whitepaper by AWS to help me design my DFD.
83+
From this, I realized that consulting relevant whitepapers by industry giants to see how they approach similar tasks
84+
helped me a lot in understanding best practices to implementing the system. Here is another resource by Meta that I referenced,
85+
called **[Composable Data Management at Meta][meta-whitepaper]** (I mainly used the _Building on Similarities_ section
86+
to study the logical components of data systems).
87+
88+
2. **Referencing the Most Recent Quantifying Codebase:** The pre-automation code that was already implemented by previous developers
89+
for _Quantifying the Commons_
90+
was the closest thing to my own project that I could reference. Although not all of the code was relevant to the Automating project,
91+
there were many aspects of the codebase I found very helpful to take inspiration from, especially when online research led to a
92+
dead end.
93+
94+
As for the license data retrieval process using the Google Custom Search API Key,
95+
I did have a little hesitation running everything for the first time.
96+
Since I had never worked with confidential information or such large data inputs before,
97+
I was scared of messing something up. Sure enough, the first time I ran everything with the language and country parameters,
98+
it did cause a crash, since the API query-per-day limit was crossed with one script run. As I continued to update
99+
the script, I learned a very useful trick when it comes to handling big data:
100+
to avoid hitting the query limit while testing, you can replace the actual API calls
101+
with logging statements to show the parameters being used. This helps you
102+
understand the outputs without actually consuming API quota, and it can help you identify bugs easier. <br>
103+
104+
Upon successful completion of basic data retrieval and state management in Phase 1,
105+
I felt much more confident about the trajectory of this project, and implementing
106+
future steps and fixing new bugs became progressively easier.
107+
108+
**`2-processed`**
109+
110+
coming soon!
111+
112+
**`3-reports`**
113+
114+
coming soon!
115+
116+
## Mid-Program Conclusions and Upcoming Tasks
117+
***
118+
119+
Coming soon!
120+
121+
## Additional Readings
122+
***
123+
124+
- (Stay posted for the second part of this series, coming soon!) Automating Quantifying the Commons: Part 2
125+
- [Data Science Discovery: Quantifying the Commons][quantifying] | Author: Dun-Ming Huang (Brandon Huang) | Dec. 2022
126+
127+
[quantifying]: https://opensource.creativecommons.org/blog/entries/2022-12-07-berkeley-quantifying/
128+
[logging]: https://github.com/creativecommons/quantifying/pull/97
129+
[AWS-whitepaper]: https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/distributed-data-management.html
130+
[meta-whitepaper]: https://engineering.fb.com/2024/05/22/data-infrastructure/composable-data-management-at-meta/
131+
[timid-robot]: https://opensource.creativecommons.org/blog/authors/TimidRobot/
132+
133+
134+
135+

0 commit comments

Comments
 (0)