Quantifying the Commons — Data Science Discovery Program Fall 2022

10 min readNov 29, 2022

Background:

Creative Commons is a nonprofit organization that helps overcome legal difficulties of sharing knowledge and creativity through providing Creative Commons licenses and public domain tools, so as to create a free, simple, and standardized way to grant copyright permissions.

The licenses of Creative Commons have been developed into various categories: i.e. CC BY 4.0 is Attribution 4.0 International, which permits the rights of being free to “copy and redistribute the material in any medium or format”, and “remix, transform, and build upon the materialfor any purpose, even commercially”.

Our goal in this project, Quantifying the Commons, is to quantify the frequency of using Creative Commons licenses, to see what fields the licenses are used in, where they are used, how the usage changes or develops overtime, how the usage distribute among different licenses, etc. To give a sense of how the quantifying result may look like, here is the history of quantifying projects.

2017 State of the Commons — Worldwide Usage of all Creative Commons licenses

Data Science Discovery Program: a program of Data Science department in University of California, Berkeley, which is founded in 2015 and the mission is to incubate cutting edge data science, machine learning, and AI projects with partners including academic researchers, government agencies, non-profits, and industry partners. Data Science Discovery Program provides an opportunity to connect student researchers/volunteers with these project partners.

In our fall 2022 project, we decided to improve the method of collecting data. Unlike the past procedures of communicating to people to get access to the datasets, we decided to use reproducible ways of directly retrieving data from APIs of platforms that may be either popular among users or may have CC licenses widely used.

This blog goes over the general process of what I have done in this project, for more detailed information, check out our Github repo!

Step 1: Retrieving the Data

For me, I worked on the data retrieving from Flickr API. The CC licenses used on Flickr platform cover the following ones: CC BY-NC-SA 2.0, CC BY-NC 2.0, CC BY-NC-ND 2.0, CC BY 2.0, CC BY-SA 2.0, CC BY-ND 2.0, CC0 1.0, Public Domain Mark 1.0.

According to the instruction on the official document of Flickr API, it seems that two methods, flickr.photos.search and flickr.photos.getInfo, may be the main methods helpful for our project. I used flickr.photos.search to get photos IDs under each licenses, and flickr.photos.getInfo to get detailed information based on different ID.

First get all photo IDs: (due to the limitation of Flickr API, we can only get the first 4000 IDs for each licenses)

while i in license_list:
while j <= total:
# use search method to pull photo id in each license
photosJson = flickr.photos.search(license=i, per_page=100, page=j)
time.sleep(1)
photos = json.loads(photosJson.decode("utf-8"))
id = [x["id"] for x in photos["photos"]["photo"]]

Then use ID to find detailed information: (including date of uploading the photo, location of the uploader, description of the photo, comments number, views, tags, etc.)

for index in range(0, len(id)):
detailJson = flickr.photos.getInfo(license=i, photo_id=id[index])
time.sleep(1)
photos_detail = json.loads(detailJson.decode("utf-8"))

Then query out useful information and save as 2d arrays:

def query_helper1(raw, part, detail, temp_list, index):
"""Helper function 1 for query_data"""
# part and detail should be string
queried_raw = raw["photo"][part][detail]
yield queried_raw


def query_helper2(raw, part, temp_list, index):
"""Helper function 2 for query_data"""
# part should be string
queried_raw = raw["photo"][part]
yield queried_raw

def query_data()
for a in range(0, len(name_list)):
# name list: ["id", "dateuploaded", "isfavorite",
# "license", "realname", "location", "title", "description",
# "dates", "views", "comments", "tags"]
if (0 <= a < 4) or a == 9:
# to query info of "views" "id" "dateuploaded" or "isfavorite"
# use query_helper2
temp = query_helper2(raw_data, name_list[a], data_list, a)
data_list[a].append(next(temp))
elif a == 4 or a == 5:
temp = query_helper1(raw_data, "owner", name_list[a], data_list, a)
data_list[a].append(next(temp))
elif a == 6 or a == 7 or a == 10:
temp = query_helper1(
raw_data, name_list[a], "_content", data_list, a
)
data_list[a].append(next(temp))
elif a == 8:
temp = query_helper1(raw_data, name_list[a], "taken", data_list, a)
data_list[a].append(next(temp))

After that save the 2d arrays to Pandas DataFrame:

def to_df(datalist, namelist):
# this is to transform pulled and queried data into dataframe
# by iterating through the list of columns
df = pd.DataFrame(datalist).transpose()
df.columns = namelist
return df

Finally save into CSV: (to avoid sudden situations or errors, I save every 100 data to a temporary CSV and merge into the final CSV)

def df_to_csv(temp_list, name_list, temp_csv, final_csv):
df = to_df(temp_list, name_list)
df.to_csv(temp_csv)
df = pd.concat(map(pd.read_csv, [temp_csv, final_csv]), ignore_index=True)
df.to_csv(final_csv)

The final dataset looks like this:

CC Licenses on Flickr Dataset

Step2: Cleaning the data:

The data pulled from Flickr API faces several issues:

  1. Many duplicates exist
  2. Inconsistent format
  3. NaN values
  4. Data in Tags column is saved as a string of list
  5. Data in Location column has both oral and official names of places and has different scopes (some are at the national level but some state level)

Duplicates are checked and dropped based on repeated photo IDs and the high similarity of datauploaded column and description column.

NaN values are dropped using below function or pd.dropna() method:

# Remove NaN rows in description
import math
for i in np.arange(len(merged1_2['description'])):
if type(merged1_2['description'][i]) == float and math.isnan(merged1_2['description'][i]):
merged1_2.replace(merged1_2['description'][i],np.NaN)
# Colab cannot use drop() to drop rows for some reason........
merged1_2.dropna(inplace=True)

These issues are solved in the preprocessing parts and issue 4 and 5 are especially solved when it comes to different analysis requirements.

Step3: Visualization and analysis:

Some bar charts, line charts, and word cloud for Flickr: (with partial code)

def view_compare():
license1 = pd.read_csv("../flickr/dataset/cleaned_license1.csv")
license2 = pd.read_csv("../flickr/dataset/cleaned_license2.csv")
license3 = pd.read_csv("../flickr/dataset/cleaned_license3.csv")
license4 = pd.read_csv("../flickr/dataset/cleaned_license4.csv")
license5 = pd.read_csv("../flickr/dataset/cleaned_license5.csv")
license6 = pd.read_csv("../flickr/dataset/cleaned_license6.csv")
license9 = pd.read_csv("../flickr/dataset/cleaned_license9.csv")
license10 = pd.read_csv("../flickr/dataset/cleaned_license10.csv")
licenses = [license1, license2, license3, license4,
license5, license6, license9, license10]
maxs = []
for lic in licenses:
maxs.append(view_compare_helper(lic))
print(maxs)
temp_data = pd.DataFrame()
temp_data["Licenses"] = ["CC BY-NC-SA 2.0", "CC BY-NC 2.0", "CC BY-NC-ND 2.0", "CC BY 2.0",
"CC BY-SA 2.0", "CC BY-ND 2.0", "CC0 1.0", "Public Domain Mark 1.0"]
temp_data["views"] = maxs
fig, ax = plt.subplots(figsize=(13, 10))
ax.grid(b=True, color='grey',
linestyle='-.', linewidth=0.5,
alpha=0.6)
sns.set_style("dark")
sns.barplot(data=temp_data, x="Licenses", y="views", palette="flare", errorbar="sd")
ax.bar_label(ax.containers[0])
ax.text(x=0.5, y=1.1, s='Maximum Views of Pictures under all Licenses', fontsize=15,
weight='bold', ha='center', va='bottom', transform=ax.transAxes)
ax.text(x=0.5, y=1.05, s='Data range: first 4000 pictures for each license',
fontsize=13, alpha=0.75, ha='center',
va='bottom', transform=ax.transAxes)
current_values = plt.gca().get_yticks()
plt.gca().set_yticklabels(['{:,.0f}'.format(x) for x in current_values])
plt.savefig('../analyze/compare_graphs/max_views.png', dpi=300, bbox_inches='tight')
plt.show()
Maximum Views of Photos on Flickr under CC Licenses

We can see that photos under CC BY-NC-ND 2.0 and CC BY-ND 2.0 licenses have gained highest views, while photos under CC BY-SA 2.0 seem to be the least popular.

Yearly Trend of Usage of Flickr Photos under CC Licenses

Through the yearly trend chart, we can see that for photos taken between 2018 and 2022, the ones under Public Domain Mark 1.0, CC BY-SA 2.0, and CC BY-ND 2.0 licenses have fast and high increase, while the ones under CC BY 2.0, CC BY-NC 2.0, and CC BY-NC-ND 2.0 licenses are relative more taken in the year of 2020.

Total Usage of all licenses on Flickr (interactive html)

The last license — Public Domain Mark 1.0 have higher usage than the others.

License CC BY-ND 2.0 on Flickr Word Cloud
License CC0 1.0 Usage on Flickr Monthly Trend 1967–2022

OTL — Open Textbook Library graphs:

Further more, to increase the viability of datasets we collected, and to broaden the fields they cover, I include the dataset from OTL — Open Textbook Library:

Though this dataset is still a bit small to cover the usage of CC licenses in education field. Further collection from OER — Open Education Resources API may be expected.

Based on the OTL data, I also generated several graphs to visualize and quantify the license usage situation:

Word Cloud for books under CC Licenses on OTL
Yearly Trend of Amount of Books under CC Licenses on OTL
Amounts of Books under each CC Licenses on OTL

Flickr Data Geographical Visualization using GeoPandas Library: (with core code)

Country = []
for row in merged_all_cleaned["location"][0:]:
location_list = str(row).split(",")
country = location_list[len(location_list) - 1]
country = country.rstrip() # remove the last whitespace
country = country.lstrip() # remove the first whitespace
country = country.title()
if country == "Usa":
country = "United States"
if country == "Uk":
country = "United Kingdom"
Country.append(country)
# This is to transfer country name to alpha3code (i.e. America to USA)
import pycountry
def alpha3code(column):
CODE=[]
for country in column:
try:
code=pycountry.countries.get(name=country).alpha_3
# .alpha_3 means 3-letter country code
# .alpha_2 means 2-letter country code
CODE.append(code)
except:
CODE.append('None')
return CODE
# Translate the locations with non-English names to English first:
def translate(df):
count = 0
text1 = []
translator = Translator()
for row in df["country"][0:]:
# print(translator.detect(row).lang)
if translator.detect(row).lang != "en" and row != "nan" and row != "":
text1.append(translator.translate(row, dest='en').text)
count += 1
print("translate", count)
else:
text1.append(row)
count += 1
print(count)
df["country"] = text1
Geographical Distribution of all CC Licenses

We can see from the geographically mapping graph that most Creative Commons licenses are used in Canada, America, Brazil, South Africa, UK, Spain, France, Ukraine, Australia, etc, but less used in Asian countries like China.

Step4: Machine Learning and Model Building:

Since our dataset of Flickr mostly include large amount of text description of photos, I decided to use NLP — Natural Language Processing to clean, simplify, and gain the keywords out of the original text. And then use word vectorization to build SVM — Support Vector Machines classifier.

The whole process used nltk and sklearn libraries.

The process of Machine Learning for this project can be divided into the three parts:

1. Dropping NaN, cleaning data, and doing translation

2. Word tokenization, word Stemming/lemmenting, stopwords checking, and word vectorization

3. Data randomly shuffling, train and test splitting, model building, and accuracy calculation

In this example, I used the merged dataset of license 1 — CC BY-NC-SA 2.0 and license 2 — CC BY-NC 2.0 and the photos under them to perform the ML process.

The dataset before NLP:

and we want to apply NLP on the “description” column.

Since the description is typed by photo uploaders, they are in different languages. To minimize data loss, I chose to use google translate library to first transfer non-English text to English:

# Translate non-English text to English
translator = Translator()
for i in (list(merged1_2.index)):
temp_each_new_description = ""
if translator.detect(merged1_2['description'][i][0]).lang != "en":
for j in np.arange(len(merged1_2['description'][i])):
temp_each_new_description += translator.translate(merged1_2['description'][i][j], dest='en').text
merged1_2.replace(merged1_2['description'][i], temp_each_new_description)

Then change all text to lower cases in case that python thinks Dog and dog are two different words:

# Change all the text to lower case
merged1_2['description'] = [entry.lower() for entry in merged1_2['description']]

After tokenization:

# Tokenization : each entry in the dataset will be broken into set of words
merged1_2['description'] = [word_tokenize(entry) for
entry in merged1_2['description']]

the description column looks like this:

Then stopwords removed: (this part is inspired and changed based on: https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34)

nltk.download('stopwords')
from nltk.corpus import stopwords

# Remove Stop words, Non-Numeric and perfom Word Stemming
# WordNetLemmatizer requires Pos tags to understand
# if the word is noun or verb or adjective etc.
# By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
for (index,entry) in zip(merged1_2.index, list(merged1_2['description'])):
# Declaring Empty List to store the words that follow the rules for this step
Final_words = []
# Initializing WordNetLemmatizer()
word_Lemmatized = WordNetLemmatizer()
# pos_tag function below will provide the 'tag' i.e
# if the word is Noun(N) or Verb(V) or something else.
for word, tag in pos_tag(entry):
# Below condition is to check for Stop words and
# consider only alphabets
if word not in stopwords.words('english') and word.isalpha():
word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
Final_words.append(word_Final)
# The final processed set of words for each iteration
# will be stored in 'text_final'
merged1_2.loc[index,'text_final'] = str(Final_words)

Now we need to shuffle the data to get the most reliable classification and prediction result. Let’s say we shuffle 1000 times.

Then we need to split the datasets into train and test parts — 70% for training and 30% for testing.

And word tokenization — i.e. ‘unstoppable’: 4573, ‘charles’: 735, ‘broadway’: 564, ‘image’: 2096, ‘june’: 2268, ‘web’: 4795, …

And use svm in sklearn to classify and predict.

import random
def shuffle_data(original_dataset):
# Now make the datasets ordered ramdonly
index = list(original_dataset.index)
# Now shuffle and update the index
random.shuffle(index)
# reindex
shuffled_dataset = original_dataset.reindex(index)
return shuffled_dataset

def svm_data(shuffled_dataset):
# Now split the train and test datatsets
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(
shuffled_dataset['text_final'],shuffled_dataset['license'],test_size=0.3)
# Word vectorization: turning a collection of text documents into numerical feature vectors
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(shuffled_dataset['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)
# Use SVM to Predict the outcome
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
svm_accuracy = accuracy_score(predictions_SVM, Test_Y)*100
return svm_accuracy

accuracy_scores = []
repetitions = 1000
for i in np.arange(repetitions):
temp_shuffled = shuffle_data(merged1_2)
temp_accuracy = svm_data(temp_shuffled)
accuracy_scores.append(temp_accuracy)
print("current process:", i)
if i < 20:
print(accuracy_scores)

And the distribution of accuracy of 1000 times model building:

We can see that the mean accuracy is marked with a red point, and is in between 69 to 70, which can be considered as reliable.

Here comes to an end of Quantifying the Commons Project. Thanks to my project partner, Brandon, and my project leader, Timid for their huge contribution and instruction during my research process!

If you are interested to learn more or contribute as well, please check out our open Github repository!

This blog is licensed under Creative Commons Attribution-ShareAlike 4.0 International license (CC BY-SA 4.0): https://creativecommons.org/licenses/by-sa/4.0/

--

--

No responses yet