Skip to content

Commit 5c1810f

Browse files
authored
Merge pull request creativecommons#789 from naishasinha/main
Google Summer of Code '24 Mid-Program Blog Post (Automating Quantifying the Commons)
2 parents 56039bd + 5ba9b49 commit 5c1810f

File tree

7 files changed

+205
-0
lines changed

7 files changed

+205
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
username: naishasinha
2+
---
3+
name: Naisha Sinha
4+
---
5+
md5_hashed_email: c6f768d61d96f508d9523bf28664cb64
6+
---
7+
about:
8+
9+
Naisha worked on [Automating Quantifying the Commons][repository] as a developer for [Google
10+
Summer of Code (GSoC) 2024](/programs/history/). <br>
11+
GitHub: [`@naishasinha`][github]
12+
13+
[repository]: https://github.com/creativecommons/quantifying
14+
[github]: https://github.com/naishasinha
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
name: big-data
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
name: cc-software
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
name: gsoc-2024
Loading
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
title: Automating Quantifying the Commons: Part 1
2+
---
3+
categories:
4+
gsoc-2024
5+
gsoc
6+
big-data
7+
quantifying-the-commons
8+
cc-software
9+
open-source
10+
community
11+
---
12+
author: naishasinha
13+
---
14+
pub_date: 2024-07-10
15+
---
16+
body:
17+
18+
![GSoC 2024](Automating - GSoC Logo.png)
19+
20+
## Introduction
21+
***
22+
23+
Quantifying the Commons, an initiative emerging from the UC Berkeley Data Science Discovery Program,
24+
aims to quantify the frequency of open domain and CC license usage for future accessibility and analysis purposes
25+
(Refer to the initial CC article for Quantifying **[here!][quantifying]**).
26+
To date, the scope of the previous project advancements has not included automation or combined reporting,
27+
which is necessary to minimize the potential for human error and allow for more timely updates,
28+
especially for a system that engages with substantial streams of data. <br>
29+
30+
As a selected developer for Google Summer of Code 2024,
31+
my goal this summer is to develop automation software for data gathering, flow, and report generation,
32+
ensuring that reports are never more than 3 months out-of-date. This blog post serves as a technical journal
33+
for my endeavor till the midterm evaluation period. Part 2 will be posted after successful completion of the
34+
entire summer program.
35+
36+
## Pre-Program Knowledge and Associated Challenges
37+
***
38+
39+
As an undergraduate CS student, I had not yet had any experience working with codebases
40+
as intricate as this one; the most complex software I had worked on prior to this undertaking
41+
was most probably a medium-complexity full-stack application. In my pre-GSoC contributions to Quantifying, I did successfully
42+
implement logging across all the Python files (**[PR #97][logging]**), but admittedly, I was not familiar with a lot of the other modules that
43+
were being used in these files. As a result, this caused minor inconveniences to my development process from the very beginning.
44+
For example, not being experienced with operating system (OS) modules had me confused as to how I was supposed to
45+
join new directories. In addition, I had never worked with such large streams of data before, so it was initially a
46+
challenge to map out pseudocode for handling big data effectively. The next section elaborates on my development process and how I resolved these setbacks.
47+
48+
## Development Process (Midterm)
49+
***
50+
51+
### I. Data Flow Diagram Construction
52+
Before starting the code implementation, I decided to develop a **Data Flow Diagram (DFD)**, which provides a visual
53+
representation of how data flows through a software system. While researching effective DFDs for inspiration, I came across
54+
a **[technical whitepaper by Amazon Web Services (AWS)][AWS-whitepaper]** on Distributed Data Management, and I found it very helpful in drafting
55+
my own DFD. As I was still relatively new to the codebase, it helped me simplify
56+
the current system into manageable components and better understand how to implement the rest of the project.
57+
58+
![DFD](DFD.png)
59+
This was the initial layout for the data directory flow; however, the more I delved into the development process,
60+
the more the steps changed. I will present the final directory flow in Part 2 at the end of the program.
61+
62+
### II. Identifying the First Data Source to Target
63+
The main approach for implementing this project was to target one specific data source and complete its data extraction, analysis,
64+
and report generation process before adding more data sources to the codebase. There were two possible strategies to consider:
65+
(1) work on the easier data sources first, or (2) begin with the highest complexity data source and then add the easier
66+
ones later. Both approaches have notable pros and cons; however, I decided to adopt the second strategy of
67+
starting with the most complex data source first. Although this would take slightly longer to implement, it would simplify the process
68+
later on. As a result, I began implementing the software for the **Google Custom Search**
69+
data source, which has the largest number of data retrieval potential among all the other sources.
70+
71+
### III. Directory Setup + Code Implementation
72+
Based on the DFD, **[Timid Robot][timid-robot]** (my mentor) and I identified the directory process to be as such: within our `scripts` directory, we would have
73+
separate sub-directories to reflect the phases of data flow, `1-fetch`, `2-process`, `3-report`. The code would then be
74+
set up to interact between systems in chronological order. Additionally, a shared directory was implemented to optimize similar functions and paths. <br>
75+
76+
**`1-fetch`**
77+
78+
As I mentioned in the previous sections, starting to code the initial file was a challenge, as I had to learn how to use
79+
new technologies and libraries on-the-go. As a matter of fact, my struggles began when I couldn't even import the
80+
shared module correctly. However, slowly but surely, I found that consistent research of available documentation as well
81+
as constant insights from Timid Robot made it so that I finally understood everything that I was working with. There were
82+
a few specific things that helped me especially, and I would like to share them here in case it helps any software
83+
developer reading this post:
84+
85+
1. **Reading Technical Whitepapers:** As I mentioned earlier, I studied a technical whitepaper by AWS to help me design my DFD.
86+
From this, I realized that consulting relevant whitepapers by industry giants to see how they approach similar tasks
87+
helped me a lot in understanding best practices to implementing the system. Here is another resource by Meta that I referenced,
88+
called **[Composable Data Management at Meta][meta-whitepaper]** (I mainly used the _Building on Similarities_ section
89+
to study the logical components of data systems).
90+
91+
2. **Referencing the Most Recent Quantifying Codebase:** The pre-automation code that was already implemented by previous developers
92+
for _Quantifying the Commons_
93+
was the closest thing to my own project that I could reference. Although not all of the code was relevant to the Automating project,
94+
there were many aspects of the codebase I found very helpful to take inspiration from, especially when online research led to a
95+
dead end.
96+
97+
3. **Writing Documentation for the Code:** As a part of this project, I assigned myself the task of developing documentation for
98+
the Automating Quantifying the Commons project (**[can be accessed here!][documentation]**). Heavily inspired by the "rubber duck debugging"
99+
method, where explaining the code or problem step-by-step to someone or something will make the solution present itself, I decided to create documentation
100+
for future developers to reference, in which I break down the code step-by-step to explain each module or function. I found that in
101+
doing this, I was able to better understand my own code better.
102+
103+
As for the license data retrieval process using the Google Custom Search API Key,
104+
I did have a little hesitation running everything for the first time.
105+
Since I had never worked with confidential information or such large data inputs before,
106+
I was scared of messing something up. Sure enough, the first time I ran everything with the language and country parameters,
107+
it did cause a crash, since the API query-per-day limit was crossed with one script run. As I continued to update
108+
the script, I learned a very useful trick when it comes to handling big data:
109+
to avoid hitting the query limit while testing, you can replace the actual API calls
110+
with logging statements to show the parameters being used. This helps you
111+
understand the outputs without actually consuming API quota, and it can help you identify bugs more easily. <br>
112+
113+
A notable aspect of this software is the directory organization. Throughout the process, I designed it so that the datasets are automatically stored within their
114+
respective quarter's directories rather than being stored altogether. This ensures efficient organization in order for users to easily access in the future,
115+
especially when the number of datasets multiplies.
116+
117+
Upon successful completion of basic data retrieval and state management in Phase 1,
118+
I felt much more confident about the trajectory of this project, and implementing
119+
future steps and fixing new bugs became progressively easier.
120+
121+
**`2-process`**
122+
123+
The long-term goal of the Quantifying project is to have comprehensive datasets for each quarter, encompassing
124+
license data that scales up to millions and even billions. For the `2-process` phase specifically, the aim is
125+
to analyze and compare data between quarters to be able to display in the reports. However, given our Google Custom Search
126+
API constraints as well as the time period we're working with for the GSoC period (most of this period is mainly
127+
2024Q3), it is not possible to have a fully completed Phase 2. However, in order to deploy as complete of an automation software as possible,
128+
I have set up a basic psuedocode that can be implemented
129+
and built upon by future development efforts as more data is collected in the upcoming quarters/years.
130+
131+
**`3-report`**
132+
133+
As mentioned earlier, the Google Custom Search API constraints made it difficult to create a comprehensive and detailed dataset, so I plan to
134+
initiate the development of a more fletched-out Google Custom Search post-GSoC, when more data can be accumulated (discussed further in the next section).
135+
As of now, there are three main completed report visualization schemes: **(1)** Reports by Country, **(2)** Reports by License Type,
136+
and **(3)** Reports by Language. Although the visualizations are basic in design, I made sure to incorporate accessibility into the
137+
visualizations for a better user experience. This included adding elements like labels on top of the bars with specific number counts for better
138+
readability and understanding of the reports. In addition, I included three key features in the reports codebase to cater to various possible
139+
needs of the report users.
140+
141+
1. **Key Feature #1:** I implemented command line arguments in which users can choose any quarter to visualize, as I believe this would be useful
142+
for anyone in need of individual reports from previous quarters, not just reports from this quarter.
143+
2. **Key Feature #2:** Successfully stores reports into the data reports directory specific to each quarter for optimal organization (similar to the
144+
dataset organization in Phase 1). In this way,
145+
reports from one quarter will not be mixed up with reports from another quarter, making it easier for users to navigate and use.
146+
3. **Key Feature #3:** The program automatically generates and/or updates an individual `README` file for each quarter's reports. This `README` organizes
147+
all generated report images within that quarter into one page, alongside basic report descriptions.
148+
149+
## Mid-Program Conclusions and Upcoming Tasks
150+
***
151+
152+
Overall, my understanding and skillset for this project increased ten-fold after completing all the phases for Google Custom Search.
153+
Going into the second half of the Google Summer of Code program, I expect that I will complete the future data sources at a more efficient and faster rate,
154+
given the license data sizes and my heightened expertise. In fact, as of now (the midterm evaluation point), I have completed
155+
a relatively detailed Phase 1 for Flickr, which only involves 10 licenses. My biggest takeaway from the first half of the coding period is that rather than developing
156+
a basic querying process and adding on later, it's easier to start off with a complex and detailed version before moving on to Phases 2 and 3. Additionally, using
157+
the `shared` module within the scripts can be very beneficial to simplify the coding process.
158+
159+
In the second half of the GSoC program, I plan to keep both of these takeaways in mind when developing scripts for the rest of the data sources. On a formal level,
160+
the final goal for the end of GSoC 2024 is to have a working codebase for Phases 1, 2, and 3 of all data sources,
161+
including a completed automation setup for these scripts. Due to the effectiveness of the current directory organization and report generation features,
162+
I will be standardizing them across all data sources.
163+
164+
Finally, after the software is complete to the extent that is possible during the GSoC period,
165+
I plan to raise issues in the repository respective to all the next steps that
166+
could be taken post-GSoC by open-source developers for a more comprehensive software system.
167+
168+
So far, my journey at Creative Commons has significantly enhanced my skillset as a software developer, and I have never felt more motivated to take on more challenging tasks.
169+
I'm looking forward to more levels of growth and accomplishments in the second half of the program.
170+
I'll be back with Part 2 at the end of the summer with an updated, completed project!
171+
172+
## Additional Readings
173+
***
174+
175+
- (Stay posted for the second part of this series, coming soon!) Automating Quantifying the Commons: Part 2
176+
- [Data Science Discovery: Quantifying the Commons][quantifying] | Author: Dun-Ming Huang (Brandon Huang) | Dec. 2022
177+
178+
[quantifying]: https://opensource.creativecommons.org/blog/entries/2022-12-07-berkeley-quantifying/
179+
[logging]: https://github.com/creativecommons/quantifying/pull/97
180+
[AWS-whitepaper]: https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/distributed-data-management.html
181+
[meta-whitepaper]: https://engineering.fb.com/2024/05/22/data-infrastructure/composable-data-management-at-meta/
182+
[timid-robot]: https://opensource.creativecommons.org/blog/authors/TimidRobot/
183+
[documentation]: https://unmarred-gym-686.notion.site/Automating-Quantifying-the-Commons-Documentation-441056ae02364d8a9a51d5e820401db5?pvs=4
184+
185+
186+
187+
188+

0 commit comments

Comments
 (0)