|
| 1 | +title: Many Mona Lisas? Artistic Data Quantification and Assessment |
| 2 | +--- |
| 3 | +categories: |
| 4 | +cc-dataviz |
| 5 | +collaboration |
| 6 | +community |
| 7 | +quantifying-the-commons |
| 8 | +--- |
| 9 | +author: |
| 10 | +grace_coleman |
| 11 | +anthony_ho |
| 12 | +tyler_phillips |
| 13 | +claire_wan |
| 14 | +--- |
| 15 | +pub_date: 2023-04-26 |
| 16 | +--- |
| 17 | +body: |
| 18 | + |
| 19 | +Quantifying the Commons |
| 20 | + |
| 21 | +University of Michigan, School of Information |
| 22 | + |
| 23 | + |
| 24 | +## Project Objective and Problem Statement |
| 25 | + |
| 26 | +Creative Commons (CC) has over one billion licensed works. However, there is no |
| 27 | +central data or organization of CC’s licensed works, making it difficult to |
| 28 | +quantify the number of works and to analyze which licenses are useful or should |
| 29 | +be retired. The goal of this project is to help CC staff identify redundant |
| 30 | +licenses and use quantitative data in marketing its impact. It focuses on Open Education Resources (OER). |
| 31 | + |
| 32 | + |
| 33 | +## Data Collection |
| 34 | + |
| 35 | +Data was collected from [OER Commons][oercommons], which is one of CC’s |
| 36 | +platforms and a library containing digital education resources. The first step |
| 37 | +in data collection was identifying which licenses this data source uses and how |
| 38 | +many works are under each license within OER Commons. OER Commons uses the |
| 39 | +licenses CC-BY, CC-BY-SA, CC-BY-ND, CC-BY-NC, CC-BY-NC-SA, and CC-BY-NC-ND |
| 40 | +which contribute to both ‘fair use’ and ‘commercial use’ assets, respectively. |
| 41 | +The next step in data collection was querying the Application Programming |
| 42 | +Interface (API) by license. In order to retrieve all works for a license, |
| 43 | +queries are batched by a maximum of 50 works retrieved at once. This process |
| 44 | +is repeated until all works for a license are retrieved. These steps are run |
| 45 | +for every license. For every API call, the response is in XML which is parsed |
| 46 | +for features including education level, subject area, material type, media |
| 47 | +format, languages, primary user, and educational use. The results are outputted |
| 48 | +to a tab-separated CSV file. |
| 49 | + |
| 50 | +[oercommons]: https://www.oercommons.org/ |
| 51 | + |
| 52 | + |
| 53 | +## Exploratory Data Analysis (EDA) |
| 54 | + |
| 55 | +After collecting all of our data, we began exploring the different columns in |
| 56 | +our dataframe. In particular, we looked at the distribution of different |
| 57 | +languages, the distribution of items by license type, and when items were added |
| 58 | +to the OER Commons API. Through this exploration, we were able to further |
| 59 | +specify our analysis and dig deeper into the different relationships of the |
| 60 | +data. |
| 61 | + |
| 62 | + |
| 63 | +### Diagram #1: |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | +Diagram #1 shows the distribution of items taken from OER Commons by license |
| 68 | +type. It is clear that the CC-BY license type is the most popular, with 43% of |
| 69 | +the items having that license type. The CC-BY-SA license is also fairly |
| 70 | +popular, accounting for 27% of the items collected. |
| 71 | + |
| 72 | + |
| 73 | +### Diagram #2: |
| 74 | + |
| 75 | + |
| 76 | + |
| 77 | +Diagram #2 shows when items have been added to the OER Commons API. There is |
| 78 | +little activity from December 2015, up to the beginning of 2023. However, close |
| 79 | +to 30,000 items were added to the API in early 2023. |
| 80 | + |
| 81 | + |
| 82 | +### Diagram #3: |
| 83 | + |
| 84 | + |
| 85 | + |
| 86 | +Diagram #3 shows the percentage of items by language. English is the most used |
| 87 | +language, with about 86% of the items being in English. The other languages |
| 88 | +each have a small amount of the items. |
| 89 | + |
| 90 | + |
| 91 | +### Diagram #4: |
| 92 | + |
| 93 | + |
| 94 | + |
| 95 | +Since English is clearly the most popular language, we decided to see the |
| 96 | +license distribution for items that are in English. Diagram #4 shows a similar |
| 97 | +distribution to the pie chart depicting the overall license distribution; this |
| 98 | +is to be expected since items in English account for 86% of all items, so the |
| 99 | +distribution of licenses is similar to the overall distribution. |
| 100 | + |
| 101 | + |
| 102 | +### Diagram #5: |
| 103 | + |
| 104 | + |
| 105 | + |
| 106 | +We continued to look at the distribution of licenses by each language. |
| 107 | +Diagram #5 shows that for the items in French, CC-BY license is the most |
| 108 | +popular at 49%, with CC-BY-SA being right behind it at 32%. |
| 109 | + |
| 110 | + |
| 111 | +## Visualizations |
| 112 | + |
| 113 | + |
| 114 | +### Diagram #6: |
| 115 | + |
| 116 | + |
| 117 | + |
| 118 | +Diagram #6 shows the distribution of items on OER commons by primary user and |
| 119 | +broken down by license type. The platform predominantly contains items designed |
| 120 | +for teachers and students, with the rest for parents, administrators, |
| 121 | +librarians among others. The breakdown of licenses for each primary user is |
| 122 | +relatively consistent with the overall breakdown of the platform, as seen from |
| 123 | +the charts below (Diagram #7 and Diagram #8). |
| 124 | + |
| 125 | + |
| 126 | +### Diagram #7: |
| 127 | + |
| 128 | + |
| 129 | + |
| 130 | + |
| 131 | +### Diagram #8: |
| 132 | + |
| 133 | + |
| 134 | + |
| 135 | + |
| 136 | +### Diagram #9: |
| 137 | + |
| 138 | + |
| 139 | + |
| 140 | +Another aspect analyzed was inspecting the subject areas and the licenses that |
| 141 | +they hold as shown in Diagram #9. Some preliminary data cleaning had to be |
| 142 | +conducted as there were too many subjects on the platform, while some subjects |
| 143 | +had very low counts. The team grouped similar subjects into nine different |
| 144 | +categories, for example, social science, anthropology, sociology, |
| 145 | +communication, world cultures, psychology, women’s studies, and social work |
| 146 | +were grouped into social sciences. |
| 147 | + |
| 148 | +It can be seen from Diagram #9 that the most popular subject areas on the |
| 149 | +platform are health sciences, language/arts and other sciences. Diving deeper |
| 150 | +into these subject areas, health sciences and language/arts have a higher |
| 151 | +proportion of items with the CC-BY-NC-SA license. |
| 152 | + |
| 153 | + |
| 154 | +### Diagram #10: |
| 155 | + |
| 156 | + |
| 157 | + |
| 158 | +Finally, the team analyzed the material types of the items and sorted it by |
| 159 | +education level that the items were created for. Again, some data cleaning was |
| 160 | +required as there were too many material types to analyze and some also had |
| 161 | +very small data counts. The seven material types shown in Diagram #10 were the |
| 162 | +most popular, and represented roughly 2/3 of the total. |
| 163 | + |
| 164 | +After sorting the education levels in chronological order, an interesting trend |
| 165 | +that emerged is that the number of items increases with education level from |
| 166 | +preschool, hits a peak at the community college level, and then decreases |
| 167 | +afterwards. A shift in the material types can also be drawn from the graph, as |
| 168 | +lesson plans represent a large proportion of items from preschool to high |
| 169 | +school, but become insignificant from the college level onwards. On the other |
| 170 | +hand, this is replaced by a higher proportion of readings. Another observation |
| 171 | +worth remarking is that there is also a higher proportion of items at the |
| 172 | +college level for textbooks. |
| 173 | + |
| 174 | + |
| 175 | +## Key Value |
| 176 | + |
| 177 | +The insights created through the analysis of this project will be helpful for |
| 178 | +CC’s marketing efforts. The ability to understand the distribution of license |
| 179 | +types in different contexts such as education level, will help CC be better |
| 180 | +equipped to target their marketing toward key demographics such as preschool |
| 181 | +education materials for example. Another take away in terms of key value was |
| 182 | +CC’s initiative to long term preservation. CC’s need to centralize their |
| 183 | +collaborators' content into a database warehouse system has been an identified |
| 184 | +direction since the start of this project. Our prototype database of the OER |
| 185 | +Commons has contributed to these efforts in both small scale implementation as |
| 186 | +well as meeting the scope of our database system modeling. As other CC cohort |
| 187 | +chapters contribute their own databases of licenced works, there is a hopeful |
| 188 | +expectation that a merger of acquisition will take place with other CC chapters |
| 189 | +in the future. |
| 190 | + |
| 191 | + |
| 192 | +## Next Steps |
| 193 | + |
| 194 | +As CC expands its contributing members into the open-source initiative of |
| 195 | +bringing licensed works to the world, other internal systems of data |
| 196 | +preservation and maintenance start to become a point of serious interest as the |
| 197 | +databases start to become an integrated endeavor in the future. Running our |
| 198 | +prototype case study of the OER-Commons database has given us insights on the |
| 199 | +direction of CC current database system and how this system will be better |
| 200 | +suited to evolve into a data warehouse hub as a long-term solution. When we |
| 201 | +started the process of data mining and data analysis, using Python3 has been a |
| 202 | +staple in both our groups efforts as well as CC’s previous protocols with Git. |
| 203 | +So, complementing this framework with other Python libraries that allow for |
| 204 | +easier database querying will be a step in the right direction for the next |
| 205 | +cohort of CC contributors to further this process along. An example of this |
| 206 | +library integration would be pandasql to utilize the family pandas library |
| 207 | +methods along with the SQL command logic that makes database maintenance easy |
| 208 | +and manageable. Besides updating the data storage, future work can continue to |
| 209 | +collect data from other sources with CC licensed work including the GLAM and |
| 210 | +Internet Archive. |
| 211 | + |
| 212 | + |
| 213 | +## Acknowledgements |
| 214 | + |
| 215 | +We would like to express our gratitude towards Timid Robot Zehta, our client, |
| 216 | +for working on behalf of CC, as well as [OER Commons][oercommons] for their |
| 217 | +valuable contributions towards the development of digital licensing and open |
| 218 | +source databasing initiatives. Without them, this project would not have been |
| 219 | +possible. Their efforts have been instrumental in giving us the tools and |
| 220 | +resources to help progress in the open-source initiative by allowing us to |
| 221 | +promote the free exchange of ideas, knowledge, and resources within the art, |
| 222 | +health, and education sectors of non-profit endeavors. Open source projects are |
| 223 | +important because they allow the public to use and work on projects without |
| 224 | +restrictions or keys. Since this initiative is open source, our efforts can be |
| 225 | +added to and built upon, allowing the project to continue through the addition |
| 226 | +of new contributors with fresh perspectives. Both of their commitment to |
| 227 | +promoting accessible and inclusive content has enabled individuals and |
| 228 | +organizations to create and distribute digital assets without facing any legal |
| 229 | +restrictions around the world. It has been an absolute pleasure to work with |
| 230 | +these organizations and be a part of their mission to democratize access to |
| 231 | +information. |
| 232 | + |
| 233 | +[oercommons]: https://www.oercommons.org/ |
0 commit comments