You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Quantifying the Commons, an initiative emerging from the UC Berkeley Data Science Discovery Program,
23
+
aims to quantify the frequency of open domain and CC license usage for future accessibility and analysis purposes
24
+
(Refer to the initial CC article for Quantifying **[here!][quantifying]**).
25
+
To date, the scope of the previous project advancements had not included automation or combined reporting,
26
+
which is necessary to minimize the potential for human error and allow for more timely updates,
27
+
especially for a system that engages with substantial streams of data. <br>
28
+
29
+
As a selected developer for Google Summer of Code 2024,
30
+
my goal for this summer is to develop an automation software for data gathering, flow, and report generation,
31
+
ensuring that reports are never more than 3 months out-of-date. This blog post serves as a technical journal
32
+
for my endeavor till the midterm evaluation period. Part 2 will be posted after successful completion of the
33
+
entire summer program.
34
+
35
+
## Pre-Program Knowledge and Associated Challenges
36
+
***
37
+
38
+
As an undergraduate CS student, I had not yet had any experience working with codebases
39
+
as intricate as this one; the most complex software I had worked on prior to this undertaking
40
+
was most probably a medium-complexity full-stack application. In my pre-GSoC contributions to Quantifying, I did successfully
41
+
implement logging across all the Python files (**[PR #97][logging]**), but admittedly, I was not familiar with a lot of the other modules that
42
+
were being used in these files. As a result, this caused minor inconveniences to my development process from the very beginning.
43
+
For example, not being experienced with operating system (OS) modules had me confused as to how I was supposed to
44
+
join new directories. In addition, I had never worked with such large streams of data before, so it was initially a
45
+
challenge to map out pseudocode for handling big data effectively. The next section elaborates on my development process and how I resolved these setbacks.
46
+
47
+
## Development Process (Midterm)
48
+
***
49
+
50
+
### I. Data Flow Diagram Construction
51
+
Before starting the code implementation, I decided to develop a **Data Flow Diagram (DFD)**, which provides a visual
52
+
representation of how data flows through a software system. While researching effective DFDs for inspiration, I came across
53
+
a **[technical whitepaper by Amazon Web Services (AWS)][AWS-whitepaper]** on Distributed Data Management, and I found it very helpful in drafting
54
+
my own DFD. As I was still relatively new to the codebase, it helped me simplify
55
+
the current system into manageable components and better understand how to implement the rest of the project.
56
+
57
+
[insert DFD here with explanation of directory setup]
58
+
59
+
### II. Identifying the First Data Source to Target
60
+
The main approach for implementing this project was to target one specific data source and complete its data extraction, analysis,
61
+
and report generation process before adding more data sources to the codebase. There were two possible strategies to consider:
62
+
(1) work on the easier data sources first, or (2) begin with the highest complexity data source and then add the easier
63
+
ones later. Both approaches have notable pros and cons; however, I decided to adopt the second strategy of
64
+
starting with the most complex data source first. Although this would take slightly longer to implement, it would simplify the process
65
+
later on. As a result, I began implementing the software for the **Google Custom Search**
66
+
data source, which has the largest number of data retrieval potential among all the other sources.
67
+
68
+
### III. Directory Setup + Code Implementation
69
+
Based on the DFD, **[Timid Robot][timid-robot]** (my mentor) and I identifed the directory process to be as such: within our `scripts` directory, we would have
70
+
separate sub-directories to reflect the phases of data flow, `1-fetched`, `2-processed`, `3-reports`. The code would then be
71
+
set up to interact in chronological order. Additionally, a shared directory was implemented to optimize similar functions and paths. <br>
72
+
73
+
**`1-fetched`**
74
+
75
+
As I mentioned in the previous sections, starting to code the initial file was a challenge, as I had to learn how to use
76
+
new technologies and libraries on-the-go. As a matter of fact, my struggles began when I couldn't even import the
77
+
shared module correctly. However, slowly but surely, I found that consistent research of available documentation as well
78
+
as constant insights from Timid Robot made it so that I finally understood everything that I was working with. There were
79
+
two specific things that helped me especially, and I would like to share them here incase it helps any software
80
+
developer reading this post:
81
+
82
+
1.**Reading Technical Whitepapers:** As I mentioned earlier, I studied a technical whitepaper by AWS to help me design my DFD.
83
+
From this, I realized that consulting relevant whitepapers by industry giants to see how they approach similar tasks
84
+
helped me a lot in understanding best practices to implementing the system. Here is another resource by Meta that I referenced,
85
+
called **[Composable Data Management at Meta][meta-whitepaper]** (I mainly used the _Building on Similarities_ section
86
+
to study the logical components of data systems).
87
+
88
+
2.**Referencing the Most Recent Quantifying Codebase:** The pre-automation code that was already implemented by previous developers
89
+
for _Quantifying the Commons_
90
+
was the closest thing to my own project that I could reference. Although not all of the code was relevant to the Automating project,
91
+
there were many aspects of the codebase I found very helpful to take inspiration from, especially when online research led to a
92
+
dead end.
93
+
94
+
As for the license data retrieval process using the Google Custom Search API Key,
95
+
I did have a little hesitation running everything for the first time.
96
+
Since I had never worked with confidential information or such large data inputs before,
97
+
I was scared of messing something up. Sure enough, the first time I ran everything with the language and country parameters,
98
+
it did cause a crash, since the API query-per-day limit was crossed with one script run. As I continued to update
99
+
the script, I learned a very useful trick when it comes to handling big data:
100
+
to avoid hitting the query limit while testing, you can replace the actual API calls
101
+
with logging statements to show the parameters being used. This helps you
102
+
understand the outputs without actually consuming API quota, and it can help you identify bugs easier. <br>
103
+
104
+
Upon successful completion of basic data retrieval and state management in Phase 1,
105
+
I felt much more confident about the trajectory of this project, and implementing
106
+
future steps and fixing new bugs became progressively easier.
107
+
108
+
**`2-processed`**
109
+
110
+
coming soon!
111
+
112
+
**`3-reports`**
113
+
114
+
coming soon!
115
+
116
+
## Mid-Program Conclusions and Upcoming Tasks
117
+
***
118
+
119
+
Coming soon!
120
+
121
+
## Additional Readings
122
+
***
123
+
124
+
- (Stay posted for the second part of this series, coming soon!) Automating Quantifying the Commons: Part 2
0 commit comments