Skip to content

Commit 99cf2d3

Browse files
committed
add blog post and authors
1 parent 0b6ce83 commit 99cf2d3

File tree

15 files changed

+277
-0
lines changed

15 files changed

+277
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
username: anthony_ho
2+
---
3+
name: Anthony Ho
4+
---
5+
md5_hashed_email:
6+
---
7+
about:
8+
As of 2023, Anthony Ho is a senior studying information analysis and economics
9+
at the University of Michigan. Upon graduation, he is returning to Hong Kong
10+
and pursuing a career in consulting.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
username: claire_wan
2+
---
3+
name: Claire Wan
4+
---
5+
md5_hashed_email:
6+
---
7+
about:
8+
As of 2023, Claire Wan is a senior studying Information and Cognitive Science
9+
with a minor in Computer Science at the University of Michigan. She is working
10+
as a software engineer in Chicago after graduation.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
username: grace_coleman
2+
---
3+
name: Grace Coleman
4+
---
5+
md5_hashed_email:
6+
---
7+
about:
8+
As of 2023, Grace Coleman is a senior studying information analysis at the
9+
University of Michigan. She is staying at Michigan to pursue a Masters in
10+
Business Analytics at the Ross School of Business upon graduation.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
username: tyler_phillips
2+
---
3+
name: Tyler Phillips
4+
---
5+
md5_hashed_email:
6+
---
7+
about:
8+
As of 2023, Tyler Phillips is a graduating senior at the University of Michigan
9+
School of Information (UMSI), receiving his Bachelor of Science in Information
10+
Analysis accompanied with a minor in Art & Design - Sonic & Visual Arts
11+
Integration. Tyler will continue his graduate studies in the fall, with
12+
reasonable expectations of a duel-degree, heading back to UMSI to obtain a
13+
Masters of Science in Information - Big Data Analytics as well as a Masters in
14+
Business Analytics, at the UM Ross School of Business.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
title: Many Mona Lisas? Artistic Data Quantification and Assessment
2+
---
3+
categories:
4+
cc-dataviz
5+
collaboration
6+
community
7+
quantifying-the-commons
8+
---
9+
author:
10+
grace_coleman
11+
anthony_ho
12+
tyler_phillips
13+
claire_wan
14+
---
15+
pub_date: 2023-04-27
16+
---
17+
body:
18+
19+
Quantifying the Commons
20+
21+
University of Michigan, School of Information
22+
23+
24+
## Project Objective and Problem Statement
25+
26+
Creative Commons (CC) has over one billion licensed works. However, there is no
27+
central data or organization of CC’s licensed works, making it difficult to
28+
quantify the number of works and to analyze which licenses are useful or should
29+
be retired. The goal of this project is to help CC staff identify redundant
30+
licenses and use quantitative data in marketing its impact. It focuses on Open Education Resources (OER).
31+
32+
33+
## Data Collection
34+
35+
Data was collected from [OER Commons][oercommons], which is one of CC’s
36+
platforms and a library containing digital education resources. The first step
37+
in data collection was identifying which licenses this data source uses and how
38+
many works are under each license within OER Commons. OER Commons uses the
39+
licenses CC-BY, CC-BY-SA, CC-BY-ND, CC-BY-NC, CC-BY-NC-SA, and CC-BY-NC-ND
40+
which contribute to both ‘fair use’ and ‘commercial use’ assets, respectively.
41+
The next step in data collection was querying the Application Programming
42+
Interface (API) by license. In order to retrieve all works for a license,
43+
queries are batched by a maximum of 50 works retrieved at once. This process
44+
is repeated until all works for a license are retrieved. These steps are run
45+
for every license. For every API call, the response is in XML which is parsed
46+
for features including education level, subject area, material type, media
47+
format, languages, primary user, and educational use. The results are outputted
48+
to a tab-separated CSV file.
49+
50+
[oercommons]: https://www.oercommons.org/
51+
52+
53+
## Exploratory Data Analysis (EDA)
54+
55+
After collecting all of our data, we began exploring the different columns in
56+
our dataframe. In particular, we looked at the distribution of different
57+
languages, the distribution of items by license type, and when items were added
58+
to the OER Commons API. Through this exploration, we were able to further
59+
specify our analysis and dig deeper into the different relationships of the
60+
data.
61+
62+
63+
### Diagram #1:
64+
65+
![img](diagram_01.png)
66+
67+
Diagram #1 shows the distribution of items taken from OER Commons by license
68+
type. It is clear that the CC-BY license type is the most popular, with 43% of
69+
the items having that license type. The CC-BY-SA license is also fairly
70+
popular, accounting for 27% of the items collected.
71+
72+
73+
### Diagram #2:
74+
75+
![img](diagram_02.png)
76+
77+
Diagram #2 shows when items have been added to the OER Commons API. There is
78+
little activity from December 2015, up to the beginning of 2023. However, close
79+
to 30,000 items were added to the API in early 2023.
80+
81+
82+
### Diagram #3:
83+
84+
![img](diagram_03.png)
85+
86+
Diagram #3 shows the percentage of items by language. English is the most used
87+
language, with about 86% of the items being in English. The other languages
88+
each have a small amount of the items.
89+
90+
91+
### Diagram #4:
92+
93+
![img](diagram_04.png)
94+
95+
Since English is clearly the most popular language, we decided to see the
96+
license distribution for items that are in English. Diagram #4 shows a similar
97+
distribution to the pie chart depicting the overall license distribution; this
98+
is to be expected since items in English account for 86% of all items, so the
99+
distribution of licenses is similar to the overall distribution.
100+
101+
102+
### Diagram #5:
103+
104+
![img](diagram_05.png)
105+
106+
We continued to look at the distribution of licenses by each language.
107+
Diagram #5 shows that for the items in French, CC-BY license is the most
108+
popular at 49%, with CC-BY-SA being right behind it at 32%.
109+
110+
111+
## Visualizations
112+
113+
114+
### Diagram #6:
115+
116+
![img](diagram_06.png)
117+
118+
Diagram #6 shows the distribution of items on OER commons by primary user and
119+
broken down by license type. The platform predominantly contains items designed
120+
for teachers and students, with the rest for parents, administrators,
121+
librarians among others. The breakdown of licenses for each primary user is
122+
relatively consistent with the overall breakdown of the platform, as seen from
123+
the charts below (Diagram #7 and Diagram #8).
124+
125+
126+
### Diagram #7:
127+
128+
![img](diagram_07.png)
129+
130+
131+
### Diagram #8:
132+
133+
![img](diagram_08.png)
134+
135+
136+
### Diagram #9:
137+
138+
![img](diagram_09.png)
139+
140+
Another aspect analyzed was inspecting the subject areas and the licenses that
141+
they hold as shown in Diagram #9. Some preliminary data cleaning had to be
142+
conducted as there were too many subjects on the platform, while some subjects
143+
had very low counts. The team grouped similar subjects into nine different
144+
categories, for example, social science, anthropology, sociology,
145+
communication, world cultures, psychology, women’s studies, and social work
146+
were grouped into social sciences.
147+
148+
It can be seen from Diagram #9 that the most popular subject areas on the
149+
platform are health sciences, language/arts and other sciences. Diving deeper
150+
into these subject areas, health sciences and language/arts have a higher
151+
proportion of items with the CC-BY-NC-SA license.
152+
153+
154+
### Diagram #10:
155+
156+
![img](diagram_10.png)
157+
158+
Finally, the team analyzed the material types of the items and sorted it by
159+
education level that the items were created for. Again, some data cleaning was
160+
required as there were too many material types to analyze and some also had
161+
very small data counts. The seven material types shown in Diagram #10 were the
162+
most popular, and represented roughly 2/3 of the total.
163+
164+
After sorting the education levels in chronological order, an interesting trend
165+
that emerged is that the number of items increases with education level from
166+
preschool, hits a peak at the community college level, and then decreases
167+
afterwards. A shift in the material types can also be drawn from the graph, as
168+
lesson plans represent a large proportion of items from preschool to high
169+
school, but become insignificant from the college level onwards. On the other
170+
hand, this is replaced by a higher proportion of readings. Another observation
171+
worth remarking is that there is also a higher proportion of items at the
172+
college level for textbooks.
173+
174+
175+
## Key Value
176+
177+
The insights created through the analysis of this project will be helpful for
178+
CC’s marketing efforts. The ability to understand the distribution of license
179+
types in different contexts such as education level, will help CC be better
180+
equipped to target their marketing toward key demographics such as preschool
181+
education materials for example. Another take away in terms of key value was
182+
CC’s initiative to long term preservation. CC’s need to centralize their
183+
collaborators' content into a database warehouse system has been an identified
184+
direction since the start of this project. Our prototype database of the OER
185+
Commons has contributed to these efforts in both small scale implementation as
186+
well as meeting the scope of our database system modeling. As other CC cohort
187+
chapters contribute their own databases of licenced works, there is a hopeful
188+
expectation that a merger of acquisition will take place with other CC chapters
189+
in the future.
190+
191+
192+
## Next Steps
193+
194+
As CC expands its contributing members into the open-source initiative of
195+
bringing licensed works to the world, other internal systems of data
196+
preservation and maintenance start to become a point of serious interest as the
197+
databases start to become an integrated endeavor in the future. Running our
198+
prototype case study of the OER-Commons database has given us insights on the
199+
direction of CC current database system and how this system will be better
200+
suited to evolve into a data warehouse hub as a long-term solution. When we
201+
started the process of data mining and data analysis, using Python3 has been a
202+
staple in both our groups efforts as well as CC’s previous protocols with Git.
203+
So, complementing this framework with other Python libraries that allow for
204+
easier database querying will be a step in the right direction for the next
205+
cohort of CC contributors to further this process along. An example of this
206+
library integration would be pandasql to utilize the family pandas library
207+
methods along with the SQL command logic that makes database maintenance easy
208+
and manageable. Besides updating the data storage, future work can continue to
209+
collect data from other sources with CC licensed work including the GLAM and
210+
Internet Archive.
211+
212+
213+
## Acknowledgements
214+
215+
We would like to express our gratitude towards Timid Robot Zehta, our client,
216+
for working on behalf of CC, as well as [OER Commons][oercommons] for their
217+
valuable contributions towards the development of digital licensing and open
218+
source databasing initiatives. Without them, this project would not have been
219+
possible. Their efforts have been instrumental in giving us the tools and
220+
resources to help progress in the open-source initiative by allowing us to
221+
promote the free exchange of ideas, knowledge, and resources within the art,
222+
health, and education sectors of non-profit endeavors. Open source projects are
223+
important because they allow the public to use and work on projects without
224+
restrictions or keys. Since this initiative is open source, our efforts can be
225+
added to and built upon, allowing the project to continue through the addition
226+
of new contributors with fresh perspectives. Both of their commitment to
227+
promoting accessible and inclusive content has enabled individuals and
228+
organizations to create and distribute digital assets without facing any legal
229+
restrictions around the world. It has been an absolute pleasure to work with
230+
these organizations and be a part of their mission to democratize access to
231+
information.
232+
233+
[oercommons]: https://www.oercommons.org/
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)