|
| 1 | +title: Automating Quantifying the Commons: Part 1 |
| 2 | +--- |
| 3 | +categories: |
| 4 | +gsoc-2024 |
| 5 | +gsoc |
| 6 | +big-data |
| 7 | +quantifying-the-commons |
| 8 | +cc-software |
| 9 | +open-source |
| 10 | +community |
| 11 | +--- |
| 12 | +author: naishasinha |
| 13 | +--- |
| 14 | +pub_date: 2024-07-10 |
| 15 | +--- |
| 16 | +body: |
| 17 | + |
| 18 | + |
| 19 | + |
| 20 | +## Introduction |
| 21 | +*** |
| 22 | + |
| 23 | +Quantifying the Commons, an initiative emerging from the UC Berkeley Data Science Discovery Program, |
| 24 | +aims to quantify the frequency of open domain and CC license usage for future accessibility and analysis purposes |
| 25 | +(Refer to the initial CC article for Quantifying **[here!][quantifying]**). |
| 26 | +To date, the scope of the previous project advancements has not included automation or combined reporting, |
| 27 | +which is necessary to minimize the potential for human error and allow for more timely updates, |
| 28 | +especially for a system that engages with substantial streams of data. <br> |
| 29 | + |
| 30 | +As a selected developer for Google Summer of Code 2024, |
| 31 | +my goal this summer is to develop automation software for data gathering, flow, and report generation, |
| 32 | +ensuring that reports are never more than 3 months out-of-date. This blog post serves as a technical journal |
| 33 | +for my endeavor till the midterm evaluation period. Part 2 will be posted after successful completion of the |
| 34 | +entire summer program. |
| 35 | + |
| 36 | +## Pre-Program Knowledge and Associated Challenges |
| 37 | +*** |
| 38 | + |
| 39 | +As an undergraduate CS student, I had not yet had any experience working with codebases |
| 40 | +as intricate as this one; the most complex software I had worked on prior to this undertaking |
| 41 | +was most probably a medium-complexity full-stack application. In my pre-GSoC contributions to Quantifying, I did successfully |
| 42 | +implement logging across all the Python files (**[PR #97][logging]**), but admittedly, I was not familiar with a lot of the other modules that |
| 43 | +were being used in these files. As a result, this caused minor inconveniences to my development process from the very beginning. |
| 44 | +For example, not being experienced with operating system (OS) modules had me confused as to how I was supposed to |
| 45 | +join new directories. In addition, I had never worked with such large streams of data before, so it was initially a |
| 46 | +challenge to map out pseudocode for handling big data effectively. The next section elaborates on my development process and how I resolved these setbacks. |
| 47 | + |
| 48 | +## Development Process (Midterm) |
| 49 | +*** |
| 50 | + |
| 51 | +### I. Data Flow Diagram Construction |
| 52 | +Before starting the code implementation, I decided to develop a **Data Flow Diagram (DFD)**, which provides a visual |
| 53 | +representation of how data flows through a software system. While researching effective DFDs for inspiration, I came across |
| 54 | +a **[technical whitepaper by Amazon Web Services (AWS)][AWS-whitepaper]** on Distributed Data Management, and I found it very helpful in drafting |
| 55 | +my own DFD. As I was still relatively new to the codebase, it helped me simplify |
| 56 | +the current system into manageable components and better understand how to implement the rest of the project. |
| 57 | + |
| 58 | + |
| 59 | +This was the initial layout for the data directory flow; however, the more I delved into the development process, |
| 60 | +the more the steps changed. I will present the final directory flow in Part 2 at the end of the program. |
| 61 | + |
| 62 | +### II. Identifying the First Data Source to Target |
| 63 | +The main approach for implementing this project was to target one specific data source and complete its data extraction, analysis, |
| 64 | +and report generation process before adding more data sources to the codebase. There were two possible strategies to consider: |
| 65 | +(1) work on the easier data sources first, or (2) begin with the highest complexity data source and then add the easier |
| 66 | +ones later. Both approaches have notable pros and cons; however, I decided to adopt the second strategy of |
| 67 | +starting with the most complex data source first. Although this would take slightly longer to implement, it would simplify the process |
| 68 | +later on. As a result, I began implementing the software for the **Google Custom Search** |
| 69 | +data source, which has the largest number of data retrieval potential among all the other sources. |
| 70 | + |
| 71 | +### III. Directory Setup + Code Implementation |
| 72 | +Based on the DFD, **[Timid Robot][timid-robot]** (my mentor) and I identified the directory process to be as such: within our `scripts` directory, we would have |
| 73 | +separate sub-directories to reflect the phases of data flow, `1-fetch`, `2-process`, `3-report`. The code would then be |
| 74 | +set up to interact between systems in chronological order. Additionally, a shared directory was implemented to optimize similar functions and paths. <br> |
| 75 | + |
| 76 | +**`1-fetch`** |
| 77 | + |
| 78 | +As I mentioned in the previous sections, starting to code the initial file was a challenge, as I had to learn how to use |
| 79 | +new technologies and libraries on-the-go. As a matter of fact, my struggles began when I couldn't even import the |
| 80 | +shared module correctly. However, slowly but surely, I found that consistent research of available documentation as well |
| 81 | +as constant insights from Timid Robot made it so that I finally understood everything that I was working with. There were |
| 82 | +a few specific things that helped me especially, and I would like to share them here in case it helps any software |
| 83 | +developer reading this post: |
| 84 | + |
| 85 | +1. **Reading Technical Whitepapers:** As I mentioned earlier, I studied a technical whitepaper by AWS to help me design my DFD. |
| 86 | +From this, I realized that consulting relevant whitepapers by industry giants to see how they approach similar tasks |
| 87 | +helped me a lot in understanding best practices to implementing the system. Here is another resource by Meta that I referenced, |
| 88 | +called **[Composable Data Management at Meta][meta-whitepaper]** (I mainly used the _Building on Similarities_ section |
| 89 | +to study the logical components of data systems). |
| 90 | + |
| 91 | +2. **Referencing the Most Recent Quantifying Codebase:** The pre-automation code that was already implemented by previous developers |
| 92 | +for _Quantifying the Commons_ |
| 93 | +was the closest thing to my own project that I could reference. Although not all of the code was relevant to the Automating project, |
| 94 | +there were many aspects of the codebase I found very helpful to take inspiration from, especially when online research led to a |
| 95 | +dead end. |
| 96 | + |
| 97 | +3. **Writing Documentation for the Code:** As a part of this project, I assigned myself the task of developing documentation for |
| 98 | +the Automating Quantifying the Commons project (**[can be accessed here!][documentation]**). Heavily inspired by the "rubber duck debugging" |
| 99 | +method, where explaining the code or problem step-by-step to someone or something will make the solution present itself, I decided to create documentation |
| 100 | +for future developers to reference, in which I break down the code step-by-step to explain each module or function. I found that in |
| 101 | +doing this, I was able to better understand my own code better. |
| 102 | + |
| 103 | +As for the license data retrieval process using the Google Custom Search API Key, |
| 104 | +I did have a little hesitation running everything for the first time. |
| 105 | +Since I had never worked with confidential information or such large data inputs before, |
| 106 | +I was scared of messing something up. Sure enough, the first time I ran everything with the language and country parameters, |
| 107 | +it did cause a crash, since the API query-per-day limit was crossed with one script run. As I continued to update |
| 108 | +the script, I learned a very useful trick when it comes to handling big data: |
| 109 | +to avoid hitting the query limit while testing, you can replace the actual API calls |
| 110 | +with logging statements to show the parameters being used. This helps you |
| 111 | +understand the outputs without actually consuming API quota, and it can help you identify bugs more easily. <br> |
| 112 | + |
| 113 | +A notable aspect of this software is the directory organization. Throughout the process, I designed it so that the datasets are automatically stored within their |
| 114 | +respective quarter's directories rather than being stored altogether. This ensures efficient organization in order for users to easily access in the future, |
| 115 | +especially when the number of datasets multiplies. |
| 116 | + |
| 117 | +Upon successful completion of basic data retrieval and state management in Phase 1, |
| 118 | +I felt much more confident about the trajectory of this project, and implementing |
| 119 | +future steps and fixing new bugs became progressively easier. |
| 120 | + |
| 121 | +**`2-process`** |
| 122 | + |
| 123 | +The long-term goal of the Quantifying project is to have comprehensive datasets for each quarter, encompassing |
| 124 | +license data that scales up to millions and even billions. For the `2-process` phase specifically, the aim is |
| 125 | +to analyze and compare data between quarters to be able to display in the reports. However, given our Google Custom Search |
| 126 | +API constraints as well as the time period we're working with for the GSoC period (most of this period is mainly |
| 127 | +2024Q3), it is not possible to have a fully completed Phase 2. However, in order to deploy as complete of an automation software as possible, |
| 128 | +I have set up a basic psuedocode that can be implemented |
| 129 | +and built upon by future development efforts as more data is collected in the upcoming quarters/years. |
| 130 | + |
| 131 | +**`3-report`** |
| 132 | + |
| 133 | +As mentioned earlier, the Google Custom Search API constraints made it difficult to create a comprehensive and detailed dataset, so I plan to |
| 134 | +initiate the development of a more fletched-out Google Custom Search post-GSoC, when more data can be accumulated (discussed further in the next section). |
| 135 | +As of now, there are three main completed report visualization schemes: **(1)** Reports by Country, **(2)** Reports by License Type, |
| 136 | +and **(3)** Reports by Language. Although the visualizations are basic in design, I made sure to incorporate accessibility into the |
| 137 | +visualizations for a better user experience. This included adding elements like labels on top of the bars with specific number counts for better |
| 138 | +readability and understanding of the reports. In addition, I included three key features in the reports codebase to cater to various possible |
| 139 | +needs of the report users. |
| 140 | + |
| 141 | +1. **Key Feature #1:** I implemented command line arguments in which users can choose any quarter to visualize, as I believe this would be useful |
| 142 | +for anyone in need of individual reports from previous quarters, not just reports from this quarter. |
| 143 | +2. **Key Feature #2:** Successfully stores reports into the data reports directory specific to each quarter for optimal organization (similar to the |
| 144 | +dataset organization in Phase 1). In this way, |
| 145 | +reports from one quarter will not be mixed up with reports from another quarter, making it easier for users to navigate and use. |
| 146 | +3. **Key Feature #3:** The program automatically generates and/or updates an individual `README` file for each quarter's reports. This `README` organizes |
| 147 | +all generated report images within that quarter into one page, alongside basic report descriptions. |
| 148 | + |
| 149 | +## Mid-Program Conclusions and Upcoming Tasks |
| 150 | +*** |
| 151 | + |
| 152 | +Overall, my understanding and skillset for this project increased ten-fold after completing all the phases for Google Custom Search. |
| 153 | +Going into the second half of the Google Summer of Code program, I expect that I will complete the future data sources at a more efficient and faster rate, |
| 154 | +given the license data sizes and my heightened expertise. In fact, as of now (the midterm evaluation point), I have completed |
| 155 | +a relatively detailed Phase 1 for Flickr, which only involves 10 licenses. My biggest takeaway from the first half of the coding period is that rather than developing |
| 156 | +a basic querying process and adding on later, it's easier to start off with a complex and detailed version before moving on to Phases 2 and 3. Additionally, using |
| 157 | +the `shared` module within the scripts can be very beneficial to simplify the coding process. |
| 158 | + |
| 159 | +In the second half of the GSoC program, I plan to keep both of these takeaways in mind when developing scripts for the rest of the data sources. On a formal level, |
| 160 | +the final goal for the end of GSoC 2024 is to have a working codebase for Phases 1, 2, and 3 of all data sources, |
| 161 | +including a completed automation setup for these scripts. Due to the effectiveness of the current directory organization and report generation features, |
| 162 | +I will be standardizing them across all data sources. |
| 163 | + |
| 164 | +Finally, after the software is complete to the extent that is possible during the GSoC period, |
| 165 | +I plan to raise issues in the repository respective to all the next steps that |
| 166 | +could be taken post-GSoC by open-source developers for a more comprehensive software system. |
| 167 | + |
| 168 | +So far, my journey at Creative Commons has significantly enhanced my skillset as a software developer, and I have never felt more motivated to take on more challenging tasks. |
| 169 | +I'm looking forward to more levels of growth and accomplishments in the second half of the program. |
| 170 | +I'll be back with Part 2 at the end of the summer with an updated, completed project! |
| 171 | + |
| 172 | +## Additional Readings |
| 173 | +*** |
| 174 | + |
| 175 | +- (Stay posted for the second part of this series, coming soon!) Automating Quantifying the Commons: Part 2 |
| 176 | +- [Data Science Discovery: Quantifying the Commons][quantifying] | Author: Dun-Ming Huang (Brandon Huang) | Dec. 2022 |
| 177 | + |
| 178 | +[quantifying]: https://opensource.creativecommons.org/blog/entries/2022-12-07-berkeley-quantifying/ |
| 179 | +[logging]: https://github.com/creativecommons/quantifying/pull/97 |
| 180 | +[AWS-whitepaper]: https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/distributed-data-management.html |
| 181 | +[meta-whitepaper]: https://engineering.fb.com/2024/05/22/data-infrastructure/composable-data-management-at-meta/ |
| 182 | +[timid-robot]: https://opensource.creativecommons.org/blog/authors/TimidRobot/ |
| 183 | +[documentation]: https://unmarred-gym-686.notion.site/Automating-Quantifying-the-Commons-Documentation-441056ae02364d8a9a51d5e820401db5?pvs=4 |
| 184 | + |
| 185 | + |
| 186 | + |
| 187 | + |
| 188 | + |
0 commit comments