"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"R2: 0.979\n"
]
}
],
"source": [
"licenses = sorted(list(licenses), key=lambda x: -usages[x])\n",
"important_licenses = list(filter(lambda x: usages[x] > 100, licenses))\n",
"\n",
"plt.rcParams['figure.figsize'] = [20, 10]\n",
"plt.style.use('ggplot')\n",
"y = [usages[i] for i in important_licenses]\n",
"plt.bar(important_licenses, y, log=True)\n",
"plt.xticks(rotation=90, fontsize='large')\n",
"plt.xlabel(\"License\")\n",
"plt.ylabel(\"Log Usage\")\n",
"plt.title(\"Most Popular Licenses\")\n",
"plt.show()\n",
"\n",
"_, _, rvalue, _, _ = scipy.stats.linregress(np.arange(1, len(y)+1), np.log(np.array(y)))\n",
"print(\"R2:\", round(rvalue**2, 3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that the distribution of licenses is very skewed. This is to be expected, as some licenses are more going to be more versatile and more popular than others. We see that `by-sa 3.0` (Attribution-ShareAlike 3.0) is by far the most popular license, accounting for over 25% of all Creative Commons licenses. The top 8 licenses account for 81% of the data, and any license outside the top 16 accounts for less than 1% of the dataset. Anything outside of the top 46 has less than 100 appearances and accounts for less than 0.001% of the dataset. Also, we notice that there is a strong exponential relationship between the rank of the license and it's usage ($R^2=.97$ for a linear regression of log Usage versus rank). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Local License Attributes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we consider how licenses are distributed at the node level. For example, how many different works does the average domain have? How are these works distributed among license types? Do domains generally use only a single license type, or are the different license types randomly distributed among domains?"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"all_cc_licenses = cc_graph_ops.cc_licenses_by_domain(g)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Proportion\n",
"Num domains w/ CC works: 225260 95.5%\n",
"Num domains w/ > 1 license type: 45331 19.22%\n",
"Num domains w/ exactly 1 CC work: 85436 36.22%\n",
"Num domains w/ >= 5 CC works: 99238 42.07%\n",
"Num domains w/ predominantly* 1 license: 194518 82.47%\n",
"\n",
"* over 75% of works are of the same license\n"
]
}
],
"source": [
"mult_licenses = 0\n",
"single_work = 0\n",
"zero_works = 0\n",
"ge_five_works = 0\n",
"predominanly_single_license = 0\n",
"\n",
"all_cc_licenses = cc_graph_ops.cc_licenses_by_domain(g)\n",
"for node_id, cc_licenses in all_cc_licenses.items():\n",
" if len(cc_licenses) > 1:\n",
" mult_licenses += 1\n",
" licenses_qty = sum(cc_licenses.values())\n",
" if licenses_qty == 1:\n",
" single_work += 1\n",
" if licenses_qty >= 5:\n",
" ge_five_works += 1\n",
"\n",
" for license in cc_licenses:\n",
" total_licenses = sum(cc_licenses.values())\n",
" if cc_licenses[license] > 0.75*total_licenses:\n",
" predominanly_single_license += 1\n",
"\n",
"template = \"{:<40} {:>10} {:>20}%\"\n",
"template2 = \"{:>73}\"\n",
"total_domains = len(g.nodes())\n",
"print(template2.format(\"Proportion\"))\n",
"print(template.format(\"Num domains w/ CC works:\", len(all_cc_licenses), round(100*len(all_cc_licenses)/total_domains, 2)))\n",
"print(template.format(\"Num domains w/ > 1 license type:\", mult_licenses, round(100*mult_licenses/total_domains,2)))\n",
"print(template.format(\"Num domains w/ exactly 1 CC work:\", single_work, round(100*single_work/total_domains,2)))\n",
"print(template.format(\"Num domains w/ >= 5 CC works:\", ge_five_works, round(100*ge_five_works/total_domains,2)))\n",
"print(template.format(\"Num domains w/ predominantly* 1 license:\", predominanly_single_license,\n",
" round(100*predominanly_single_license/total_domains,2)))\n",
"print()\n",
"print(\"* over 75% of works are of the same license\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"26% of domains in the dataset only contain one work licensed by Creative Commons. It is unlikely that these nodes will matter very much when we do our impact analysis. However, we also see that 20% of domains use more than one difference license types, meaning the remaining 54% of domains have multiple CC licensed works all under the same license. Furthermore, we see that 87% of domains predominantly use one license. I run some $\\chi^2$ goodness of fit tests below to see if there is an overall tendency to host a few types of licenses, or if licenses are more or less randomly distributed."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Most popular licenses: [\"('by-sa', '3.0')\", \"('by', '4.0')\", \"('by-sa', '4.0')\", \"('by-nc-sa', '3.0')\", \"('by-nc-nd', '3.0')\", \"('by-nc-sa', '4.0')\", \"('by', '3.0')\", \"('by-nc-nd', '4.0')\"]\n"
]
}
],
"source": [
"total_usages = sum([total for license, total in usages.items()])\n",
"most_popular_licenses = list(filter(lambda x: usages[x]/total_usages > 0.05, licenses))\n",
"p = [usages[license]/total_usages for license in most_popular_licenses]\n",
"\n",
"print(\"Most popular licenses:\", most_popular_licenses)\n",
"chi2_vals = []\n",
"pvals = []\n",
"qtys = []\n",
"for node_id, cc_licenses in all_cc_licenses.items():\n",
" obs = []\n",
" for i, license in enumerate(most_popular_licenses):\n",
" if license in cc_licenses:\n",
" obs.append(cc_licenses[license])\n",
" else:\n",
" obs.append(0)\n",
" \n",
" qty = sum(cc_licenses.values())\n",
" exp = [p[i]*qty for i in range(len(obs))]\n",
"\n",
" if sum(exp) > 100:\n",
" obs = np.array(obs)\n",
" exp = np.array(exp)\n",
"\n",
" res = scipy.stats.chisquare(obs, exp)\n",
" chi2_vals.append(res.statistic)\n",
" pvals.append(res.pvalue)\n",
" qtys.append(qty)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Domains Analyzed: 27688\n",
"Weighted avg chi-squared: 3.46E+09\n",
"Unweighted avg chi-squared: 9.04E+03\n",
"Weighted avg p-value: 8.37E-08\n",
"Unweighted avg p-value: 5.47E-10\n"
]
}
],
"source": [
"chi2_avg = 0\n",
"pvals_avg = 0\n",
"for i in range(len(qtys)):\n",
" chi2_avg += chi2_vals[i] * qtys[i]\n",
" pvals_avg += pvals[i] * qtys[i]\n",
"chi2_avg /= len(qtys)\n",
"pvals_avg /= len(qtys)\n",
"\n",
"template = \"{:<30}{:>10.3G}\"\n",
"print(\"{:<30}{:>10}\".format(\"Domains Analyzed:\", len(qtys)))\n",
"print(template.format(\"Weighted avg chi-squared:\", chi2_avg))\n",
"print(template.format(\"Unweighted avg chi-squared:\", sum(chi2_vals)/len(chi2_vals)))\n",
"print(template.format(\"Weighted avg p-value:\", pvals_avg))\n",
"print(template.format(\"Unweighted avg p-value:\", sum(pvals)/len(pvals)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Regardless of how we weight the average, it turns out that the the $p$-values for our $\\chi^2$-test are extremely small. Thus, we can conclude that domains do not select licenses for the works they host at random. Instead, it is more likely for a domain to host works of only a few different licenses.\n",
"\n",
"For some technical details, we did the computation for only the top 8 most popular licenses and those domains with more than 100 licenses labeled, which means we are only at around 12% of the overall dataset. Both of these restrictions are so that our expected count data for each license type is above 5 for each license type, which is the minimum recommended for applying a $\\chi^2$-test. We could alternatively opt for the exact multinomial test, but this would be very computationally expensive, especially if we want to consider more than just the top 8 licenses. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### License Attributes for Popular Domains"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now focus our attention to those domains with the most CC-licensed works."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"licenses_qty = collections.defaultdict(int)\n",
"\n",
"for node_id, cc_licenses in all_cc_licenses.items():\n",
" for lisc, qty in cc_licenses.items():\n",
" licenses_qty[node_id] += qty\n",
"\n",
"sorted_domains = list(licenses_qty.items())\n",
"sorted_domains.sort(key=lambda x: x[1], reverse=True)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Looking at large domains that have 98% of works with the same license\n",
"Domain License Count Prop\n",
"wikipedia ('by-sa', '3.0') 5340549 48.119%\n",
"stackexchange ('by-sa', '4.0') 1228722 28.994%\n",
"wiktionary ('by-sa', '3.0') 564437 5.089%\n",
"wikisource ('by-sa', '3.0') 296031 2.662%\n",
"wikiquote ('by-sa', '3.0') 183751 1.657%\n",
"globalvoices ('by', '3.0') 123121 5.353%\n",
"wikibooks ('by-sa', '3.0') 108649 0.977%\n",
"stackoverflow ('by-sa', '4.0') 108560 2.556%\n",
"google ('by', '4.0') 103215 2.112%\n",
"ifixit ('by-nc-sa', '3.0') 86196 2.290%\n",
"dbpedia ('by-sa', '3.0') 58892 0.531%\n",
"europa ('by', '2.5') 57168 12.099%\n",
"formez ('by-sa', '4.0') 55177 1.289%\n",
"ingv ('by', '4.0') 53519 1.098%\n",
"ucla ('by', '4.0') 49421 0.997%\n",
"tue ('by-nc-sa', '4.0') 48876 1.813%\n",
"igem ('by', '3.0') 48399 2.083%\n",
"waset ('by', '4.0') 46692 0.960%\n",
"openstack ('by', '3.0') 45506 1.984%\n",
"libreoffice ('by-sa', '3.0') 42556 0.381%\n",
"wikicfp ('by-sa', '3.0') 41569 0.375%\n",
"bashkortostan ('by', '4.0') 41113 0.846%\n",
"upenn ('by', '4.0') 41094 0.829%\n",
"wikidoc ('by-sa', '3.0') 39845 0.358%\n",
"fanpage ('by-nc-nd', '3.0') 38376 1.149%\n",
"mozilla ('by-sa', '3.0') 38284 0.339%\n",
"familysearch ('by-sa', '4.0') 38153 0.900%\n",
"cite-sciences ('by-sa', '3.0') 37727 0.340%\n",
"xfce ('by-nc-sa', '4.0') 36836 1.370%\n",
"ladyada ('by-sa', '4.0') 35934 0.849%\n",
"re3data ('cc0', '1.0') 35916 4.427%\n",
"uni-greifswald ('cc0', '1.0') 34469 4.229%\n",
"embrapa ('by-nc-nd', '4.0') 34377 1.512%\n",
"wn ('by-sa', '3.0') 33265 0.300%\n",
"mixxx ('by-nc-sa', '4.0') 33132 1.232%\n",
"libhunt ('by-sa', '4.0') 32610 0.770%\n",
"ubuntu-fr ('by-sa', '3.0') 32457 0.292%\n",
"kernel ('by', '4.0') 31681 0.650%\n",
"meneame ('by-sa', '3.0') 30838 0.278%\n",
"wikifur ('by-sa', '4.0') 30731 0.720%\n",
"freecadweb ('by', '3.0') 30688 1.341%\n",
"stanford ('by', '3.0') 30428 1.310%\n",
"deakin ('by', '4.0') 29863 0.606%\n",
"nh ('by', '4.0') 29448 0.606%\n",
"desciclopedia ('by-nc-sa', '3.0') 28988 0.770%\n",
"bu ('by', '4.0') 28721 0.581%\n",
"aari ('by-nc-sa', '3.0') 28259 0.751%\n",
"kremlin ('by', '4.0') 28212 0.580%\n",
"secondlife ('by-sa', '3.0') 28080 0.253%\n",
"kasahorow ('cc0', '1.0') 27828 3.435%\n",
"teara ('by-nc', '3.0') 27054 3.462%\n",
"audio-lingua ('by-nc-sa', '3.0') 27042 0.719%\n",
"kcl ('by-nc', '4.0') 26867 2.584%\n",
"musee-mccord ('by-nc-nd', '2.5') 26803 2.608%\n",
"astroerrante ('by-nc-sa', '3.0') 26341 0.700%\n",
"uni-trier ('cc0', '1.0') 26207 3.233%\n"
]
}
],
"source": [
"mult_lisc_domains = []\n",
"\n",
"print(\"Looking at large domains that have 98% of works with the same license\")\n",
"template = \"{:<20}{:<20}{:>10} {:>10.3f}%\"\n",
"print(\"{:<20}{:<20}{:>10} {:>10}\".format(\"Domain\", \"License\", \"Count\", \"Prop\"))\n",
"for domain, _ in sorted_domains[:100]:\n",
" cc_licenses = g.nodes[domain]['cc_licenses'] \n",
" xlabels = []\n",
" heights = []\n",
" for lisc, qty in cc_licenses.items():\n",
" xlabels.append(lisc)\n",
" heights.append(qty)\n",
" xlabels.sort(key=lambda x: cc_licenses[x], reverse=True)\n",
" heights.sort(reverse=True)\n",
" \n",
" total = sum(heights) \n",
" if heights[0] > 0.98*total:\n",
" print(template.format(domain, xlabels[0], total, 100 * heights[0]/usages[xlabels[0]]))\n",
" else:\n",
" mult_lisc_domains.append(domain)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('by-sa', '3.0') 11089625 wikipedia 48.119\n",
"('by-sa', '4.0') 4234494 stackexchange 28.994\n",
"('by-nd', '3.0') 384638 wordpress 20.463\n",
"('by-nd', '2.5') 35781 blogspot 17.965\n",
"('by-nd-nc', '1.0') 17183 blogspot 10.580\n",
"('by-nc-nd', '2.5') 1027532 blogspot 11.783\n",
"('by-nc', '2.5') 237377 blogspot 34.191\n",
"('by', '2.5') 469121 wikinews 19.731\n",
"('by', '2.5') 469121 europa 12.099\n",
"('by-nc-nd', '2.0') 394289 sherpa 12.688\n",
"('gpl', '2.0') 129596 tuwien 18.799\n",
"('by', '1.0') 30028 xwiki 67.567\n",
"('by-nc', '2.0') 114218 aadl 14.309\n",
"('pdm', '1.0') 101682 publicdomainfiles 12.118\n",
"('by-nd-nc', '1.0') 17183 indymedia 16.819\n",
"('by-sa', '2.1') 18105 oops 49.301\n",
"('by-nc-sa', '1.0') 15202 luomus 45.770\n",
"('by-nc', '2.1') 13489 agora-web 41.597\n",
"('by-nd-nc', '1.0') 17183 keskusta 26.491\n",
"('by-nc', '2.1') 13489 foroatletismo 31.596\n",
"('by-nd', '2.5') 35781 book-log 11.288\n",
"('by', '2.1') 18472 nig 17.383\n",
"('by-nc-sa', '2.1') 14935 yamaguchistore 14.376\n",
"('by', '2.1') 18472 diarimaresme 10.995\n",
"('by-nd-nc', '1.0') 17183 noreporter 11.407\n",
"('by-nd', '2.1') 4226 g-mark 40.724\n",
"('by-sa', '1.0') 4408 wikicreole 31.647\n",
"('by-nc', '1.0') 2485 mensa 40.523\n",
"('by-nd', '1.0') 1208 coastsider 81.705\n",
"('zero', '1.0') 3464 k1v1n 25.693\n",
"('sa', '1.0') 1063 jpn 44.591\n",
"('by-nd', '2.1') 4226 wp-simplicity 13.867\n",
"('by-nc', '1.0') 2485 samizdata 23.018\n",
"('by-nd', '2.1') 4226 wp-cocoon 10.530\n",
"('zero', '1.0') 3464 italianostranieri 12.385\n",
"('sa', '1.0') 1063 communitywiki 26.152\n",
"('by-nc', '1.0') 2485 tofufortwo 10.342\n"
]
}
],
"source": [
"lisc_10_pct_single_domain = []\n",
"for domain, _ in sorted_domains:\n",
" cc_licenses = g.nodes[domain]['cc_licenses']\n",
" for lisc in cc_licenses:\n",
" if cc_licenses[lisc] > 0.10 * usages[lisc]:\n",
" lisc_10_pct_single_domain.append((lisc, \n",
" usages[lisc], \n",
" domain, \n",
" 100 * cc_licenses[lisc]/usages[lisc]))\n",
"\n",
"\"\"\"It's only interesting if a liscense is dominated by a domain\n",
"if that license is sufficiently large.\n",
"We threshold at 1000 usages for that license.\n",
"\"\"\"\n",
"lisc_10_pct_single_domain = list(filter(lambda x: x[1] > 1000, lisc_10_pct_single_domain))\n",
"for t in lisc_10_pct_single_domain:\n",
" print(\"{:<20}{:>10}{:>20}{:>10.3f}\".format(*t))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some observations: We see that many of the domains that have the most CC works almost exclusively use one type of license. There isn't an immediately obvious correlation between the type of institution and the license they use; for example, we see UCLA and Uni Trier use different licenses despite both being college websites. \n",
"\n",
"However, one thing to note that Wikipedia accounts for almost half of all uses of the `by-sa 3.0` license. (Similarly, StackExchange accounts for a quarter of the usage of the `by-sa 4.0` license.) Perhaps in future analysis, we should consider controlling for these effects. For example, it is reasonable to think that many domains using the `by-sa 3.0` license works are merely linking from Wikipedia, so we would want to compute the influence of Wikipedia on promoting CC works separately from the rest of domains that use `by-sa 3.0`. \n",
"\n",
"We also see that blogspot surprisingly contributes more than 10% of each for 4 different licenses, the smallest of which has 17000 usages. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# Warning: This function is not optimized.\n",
"# Cell takes ~45 s to run\n",
"def compute_entropy(dist):\n",
" \"\"\"Computes the entropy of the distribution with a leave-one-out error bar.\n",
" \n",
" Parameters:\n",
" dist: a list or iterable containing count data\n",
" \n",
" Returns:\n",
" H: entropy of the distribution\n",
" error: error computed using the max change in H due to leaving out one point \n",
" \"\"\"\n",
" dist = list(dist)\n",
" if sum(dist) == 1:\n",
" return 0, 0\n",
" H = scipy.stats.entropy(dist)\n",
" error = 0\n",
" for i in range(len(dist)):\n",
" if dist[i] >= 1:\n",
" dist[i] -= 1\n",
" H_alt = scipy.stats.entropy(dist)\n",
" error = max(error, abs(H_alt - H))\n",
" dist[i] += 1\n",
" return H, error\n",
"\n",
"entropies = pd.DataFrame(index=all_cc_licenses.keys(), \n",
" columns=['entropy', 'error', 'total'], \n",
" dtype='float64')\n",
"for node_id, cc_licenses in all_cc_licenses.items():\n",
" H, error = compute_entropy(cc_licenses.values())\n",
" qty = sum(cc_licenses.values())\n",
" entropies.loc[node_id, 'entropy'] = H\n",
" entropies.loc[node_id, 'error'] = error\n",
" entropies.loc[node_id, 'qty'] = qty"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Looking at the high entropy domains\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
entropy
\n",
"
error
\n",
"
total
\n",
"
qty
\n",
"
\n",
" \n",
" \n",
"
\n",
"
ctan
\n",
"
2.847428
\n",
"
0.048607
\n",
"
NaN
\n",
"
37.0
\n",
"
\n",
"
\n",
"
moneysoldiers
\n",
"
2.775219
\n",
"
0.015649
\n",
"
NaN
\n",
"
70.0
\n",
"
\n",
"
\n",
"
tecnologiahechapalabra
\n",
"
2.772993
\n",
"
0.042073
\n",
"
NaN
\n",
"
53.0
\n",
"
\n",
"
\n",
"
dogmazic
\n",
"
2.635713
\n",
"
0.004798
\n",
"
NaN
\n",
"
751.0
\n",
"
\n",
"
\n",
"
k-blogg
\n",
"
2.575363
\n",
"
0.034064
\n",
"
NaN
\n",
"
85.0
\n",
"
\n",
"
\n",
"
kritisches-netzwerk
\n",
"
2.544010
\n",
"
0.004908
\n",
"
NaN
\n",
"
490.0
\n",
"
\n",
"
\n",
"
freecomputerbooks
\n",
"
2.522859
\n",
"
0.021830
\n",
"
NaN
\n",
"
165.0
\n",
"
\n",
"
\n",
"
marisolcollazos
\n",
"
2.511022
\n",
"
0.010756
\n",
"
NaN
\n",
"
146.0
\n",
"
\n",
"
\n",
"
southernspaces
\n",
"
2.510129
\n",
"
0.029601
\n",
"
NaN
\n",
"
108.0
\n",
"
\n",
"
\n",
"
blogspot
\n",
"
2.503455
\n",
"
0.000016
\n",
"
NaN
\n",
"
777251.0
\n",
"
\n",
"
\n",
"
arc2020
\n",
"
2.499437
\n",
"
0.046797
\n",
"
NaN
\n",
"
54.0
\n",
"
\n",
"
\n",
"
ethz
\n",
"
2.477322
\n",
"
0.005917
\n",
"
NaN
\n",
"
901.0
\n",
"
\n",
"
\n",
"
archive
\n",
"
2.476347
\n",
"
0.001085
\n",
"
NaN
\n",
"
6770.0
\n",
"
\n",
"
\n",
"
amerika21
\n",
"
2.470962
\n",
"
0.015124
\n",
"
NaN
\n",
"
275.0
\n",
"
\n",
"
\n",
"
fu-berlin
\n",
"
2.465522
\n",
"
0.011975
\n",
"
NaN
\n",
"
373.0
\n",
"
\n",
"
\n",
"
kontactr
\n",
"
2.449990
\n",
"
0.054750
\n",
"
NaN
\n",
"
43.0
\n",
"
\n",
"
\n",
"
ubc
\n",
"
2.434573
\n",
"
0.001457
\n",
"
NaN
\n",
"
4840.0
\n",
"
\n",
"
\n",
"
wixsite
\n",
"
2.432783
\n",
"
0.006688
\n",
"
NaN
\n",
"
783.0
\n",
"
\n",
"
\n",
"
umd
\n",
"
2.430791
\n",
"
0.088797
\n",
"
NaN
\n",
"
15.0
\n",
"
\n",
"
\n",
"
livescience
\n",
"
2.422214
\n",
"
0.053938
\n",
"
NaN
\n",
"
45.0
\n",
"
\n",
"
\n",
"
facilisimo
\n",
"
2.409830
\n",
"
0.020131
\n",
"
NaN
\n",
"
192.0
\n",
"
\n",
"
\n",
"
wordpress
\n",
"
2.407837
\n",
"
0.000015
\n",
"
NaN
\n",
"
785389.0
\n",
"
\n",
"
\n",
"
paperblog
\n",
"
2.403983
\n",
"
0.044051
\n",
"
NaN
\n",
"
63.0
\n",
"
\n",
"
\n",
"
ncsu
\n",
"
2.394921
\n",
"
0.003533
\n",
"
NaN
\n",
"
1714.0
\n",
"
\n",
"
\n",
"
educavox
\n",
"
2.387413
\n",
"
0.017227
\n",
"
NaN
\n",
"
238.0
\n",
"
\n",
"
\n",
"
thewavingcat
\n",
"
2.372595
\n",
"
0.034808
\n",
"
NaN
\n",
"
91.0
\n",
"
\n",
"
\n",
"
auboutdufil
\n",
"
2.357432
\n",
"
0.016818
\n",
"
NaN
\n",
"
248.0
\n",
"
\n",
"
\n",
"
forskning
\n",
"
2.350311
\n",
"
0.057123
\n",
"
NaN
\n",
"
43.0
\n",
"
\n",
"
\n",
"
bund
\n",
"
2.337589
\n",
"
0.039470
\n",
"
NaN
\n",
"
77.0
\n",
"
\n",
"
\n",
"
yahoo
\n",
"
2.333567
\n",
"
0.039523
\n",
"
NaN
\n",
"
77.0
\n",
"
\n",
"
\n",
"
wildflowersearch
\n",
"
2.332792
\n",
"
0.000962
\n",
"
NaN
\n",
"
7954.0
\n",
"
\n",
"
\n",
"
bibliothekarisch
\n",
"
2.330734
\n",
"
0.014471
\n",
"
NaN
\n",
"
304.0
\n",
"
\n",
"
\n",
"
curationist
\n",
"
2.327312
\n",
"
0.054611
\n",
"
NaN
\n",
"
47.0
\n",
"
\n",
"
\n",
"
belltower
\n",
"
2.312601
\n",
"
0.067769
\n",
"
NaN
\n",
"
33.0
\n",
"
\n",
"
\n",
"
interweb3000
\n",
"
2.306437
\n",
"
0.082136
\n",
"
NaN
\n",
"
23.0
\n",
"
\n",
"
\n",
"
haykranen
\n",
"
2.303488
\n",
"
0.097890
\n",
"
NaN
\n",
"
15.0
\n",
"
\n",
"
\n",
"
indrastra
\n",
"
2.303471
\n",
"
0.016727
\n",
"
NaN
\n",
"
254.0
\n",
"
\n",
"
\n",
"
sciencenews
\n",
"
2.297355
\n",
"
0.014544
\n",
"
NaN
\n",
"
305.0
\n",
"
\n",
"
\n",
"
freitag
\n",
"
2.296689
\n",
"
0.043947
\n",
"
NaN
\n",
"
67.0
\n",
"
\n",
"
\n",
"
motargument
\n",
"
2.296587
\n",
"
0.017957
\n",
"
NaN
\n",
"
232.0
\n",
"
\n",
"
\n",
"
ifross
\n",
"
2.292613
\n",
"
0.054640
\n",
"
NaN
\n",
"
48.0
\n",
"
\n",
"
\n",
"
scienceline
\n",
"
2.285979
\n",
"
0.031082
\n",
"
NaN
\n",
"
111.0
\n",
"
\n",
"
\n",
"
patagoniawildflowers
\n",
"
2.283880
\n",
"
0.016360
\n",
"
NaN
\n",
"
263.0
\n",
"
\n",
"
\n",
"
theseus
\n",
"
2.281336
\n",
"
0.019655
\n",
"
NaN
\n",
"
207.0
\n",
"
\n",
"
\n",
"
ctolib
\n",
"
2.275146
\n",
"
0.044721
\n",
"
NaN
\n",
"
66.0
\n",
"
\n",
"
\n",
"
globalrights
\n",
"
2.274977
\n",
"
0.060657
\n",
"
NaN
\n",
"
41.0
\n",
"
\n",
"
\n",
"
schlaglichter
\n",
"
2.271072
\n",
"
0.030626
\n",
"
NaN
\n",
"
114.0
\n",
"
\n",
"
\n",
"
israelnetz
\n",
"
2.269855
\n",
"
0.028175
\n",
"
NaN
\n",
"
128.0
\n",
"
\n",
"
\n",
"
uni-koeln
\n",
"
2.269696
\n",
"
0.012294
\n",
"
NaN
\n",
"
381.0
\n",
"
\n",
"
\n",
"
queensu
\n",
"
2.268783
\n",
"
0.031238
\n",
"
NaN
\n",
"
111.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" entropy error total qty\n",
"ctan 2.847428 0.048607 NaN 37.0\n",
"moneysoldiers 2.775219 0.015649 NaN 70.0\n",
"tecnologiahechapalabra 2.772993 0.042073 NaN 53.0\n",
"dogmazic 2.635713 0.004798 NaN 751.0\n",
"k-blogg 2.575363 0.034064 NaN 85.0\n",
"kritisches-netzwerk 2.544010 0.004908 NaN 490.0\n",
"freecomputerbooks 2.522859 0.021830 NaN 165.0\n",
"marisolcollazos 2.511022 0.010756 NaN 146.0\n",
"southernspaces 2.510129 0.029601 NaN 108.0\n",
"blogspot 2.503455 0.000016 NaN 777251.0\n",
"arc2020 2.499437 0.046797 NaN 54.0\n",
"ethz 2.477322 0.005917 NaN 901.0\n",
"archive 2.476347 0.001085 NaN 6770.0\n",
"amerika21 2.470962 0.015124 NaN 275.0\n",
"fu-berlin 2.465522 0.011975 NaN 373.0\n",
"kontactr 2.449990 0.054750 NaN 43.0\n",
"ubc 2.434573 0.001457 NaN 4840.0\n",
"wixsite 2.432783 0.006688 NaN 783.0\n",
"umd 2.430791 0.088797 NaN 15.0\n",
"livescience 2.422214 0.053938 NaN 45.0\n",
"facilisimo 2.409830 0.020131 NaN 192.0\n",
"wordpress 2.407837 0.000015 NaN 785389.0\n",
"paperblog 2.403983 0.044051 NaN 63.0\n",
"ncsu 2.394921 0.003533 NaN 1714.0\n",
"educavox 2.387413 0.017227 NaN 238.0\n",
"thewavingcat 2.372595 0.034808 NaN 91.0\n",
"auboutdufil 2.357432 0.016818 NaN 248.0\n",
"forskning 2.350311 0.057123 NaN 43.0\n",
"bund 2.337589 0.039470 NaN 77.0\n",
"yahoo 2.333567 0.039523 NaN 77.0\n",
"wildflowersearch 2.332792 0.000962 NaN 7954.0\n",
"bibliothekarisch 2.330734 0.014471 NaN 304.0\n",
"curationist 2.327312 0.054611 NaN 47.0\n",
"belltower 2.312601 0.067769 NaN 33.0\n",
"interweb3000 2.306437 0.082136 NaN 23.0\n",
"haykranen 2.303488 0.097890 NaN 15.0\n",
"indrastra 2.303471 0.016727 NaN 254.0\n",
"sciencenews 2.297355 0.014544 NaN 305.0\n",
"freitag 2.296689 0.043947 NaN 67.0\n",
"motargument 2.296587 0.017957 NaN 232.0\n",
"ifross 2.292613 0.054640 NaN 48.0\n",
"scienceline 2.285979 0.031082 NaN 111.0\n",
"patagoniawildflowers 2.283880 0.016360 NaN 263.0\n",
"theseus 2.281336 0.019655 NaN 207.0\n",
"ctolib 2.275146 0.044721 NaN 66.0\n",
"globalrights 2.274977 0.060657 NaN 41.0\n",
"schlaglichter 2.271072 0.030626 NaN 114.0\n",
"israelnetz 2.269855 0.028175 NaN 128.0\n",
"uni-koeln 2.269696 0.012294 NaN 381.0\n",
"queensu 2.268783 0.031238 NaN 111.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Looking at the high entropy domains with > 1000 works\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
entropy
\n",
"
error
\n",
"
total
\n",
"
qty
\n",
"
\n",
" \n",
" \n",
"
\n",
"
blogspot
\n",
"
2.503455
\n",
"
0.000016
\n",
"
NaN
\n",
"
777251.0
\n",
"
\n",
"
\n",
"
archive
\n",
"
2.476347
\n",
"
0.001085
\n",
"
NaN
\n",
"
6770.0
\n",
"
\n",
"
\n",
"
ubc
\n",
"
2.434573
\n",
"
0.001457
\n",
"
NaN
\n",
"
4840.0
\n",
"
\n",
"
\n",
"
wordpress
\n",
"
2.407837
\n",
"
0.000015
\n",
"
NaN
\n",
"
785389.0
\n",
"
\n",
"
\n",
"
ncsu
\n",
"
2.394921
\n",
"
0.003533
\n",
"
NaN
\n",
"
1714.0
\n",
"
\n",
"
\n",
"
wildflowersearch
\n",
"
2.332792
\n",
"
0.000962
\n",
"
NaN
\n",
"
7954.0
\n",
"
\n",
"
\n",
"
weebly
\n",
"
2.176437
\n",
"
0.001168
\n",
"
NaN
\n",
"
6509.0
\n",
"
\n",
"
\n",
"
free
\n",
"
2.162823
\n",
"
0.000327
\n",
"
NaN
\n",
"
27726.0
\n",
"
\n",
"
\n",
"
github
\n",
"
2.155364
\n",
"
0.000745
\n",
"
NaN
\n",
"
10940.0
\n",
"
\n",
"
\n",
"
creativecommons
\n",
"
2.155298
\n",
"
0.003488
\n",
"
NaN
\n",
"
1822.0
\n",
"
\n",
"
\n",
"
hatenablog
\n",
"
2.146505
\n",
"
0.005132
\n",
"
NaN
\n",
"
1151.0
\n",
"
\n",
"
\n",
"
mentalfloss
\n",
"
2.143123
\n",
"
0.003733
\n",
"
NaN
\n",
"
1685.0
\n",
"
\n",
"
\n",
"
unlp
\n",
"
2.127393
\n",
"
0.001742
\n",
"
NaN
\n",
"
4133.0
\n",
"
\n",
"
\n",
"
blogs
\n",
"
2.112163
\n",
"
0.004780
\n",
"
NaN
\n",
"
1262.0
\n",
"
\n",
"
\n",
"
uab
\n",
"
2.102620
\n",
"
0.000236
\n",
"
NaN
\n",
"
33613.0
\n",
"
\n",
"
\n",
"
reset
\n",
"
2.101141
\n",
"
0.003802
\n",
"
NaN
\n",
"
1662.0
\n",
"
\n",
"
\n",
"
lse
\n",
"
2.067014
\n",
"
0.000744
\n",
"
NaN
\n",
"
11087.0
\n",
"
\n",
"
\n",
"
freemusicarchive
\n",
"
2.064793
\n",
"
0.002142
\n",
"
NaN
\n",
"
3283.0
\n",
"
\n",
"
\n",
"
uio
\n",
"
2.043913
\n",
"
0.004793
\n",
"
NaN
\n",
"
1275.0
\n",
"
\n",
"
\n",
"
merlot
\n",
"
2.009391
\n",
"
0.002407
\n",
"
NaN
\n",
"
2893.0
\n",
"
\n",
"
\n",
"
ranker
\n",
"
2.007076
\n",
"
0.004732
\n",
"
NaN
\n",
"
1304.0
\n",
"
\n",
"
\n",
"
ua
\n",
"
1.995738
\n",
"
0.003822
\n",
"
NaN
\n",
"
1684.0
\n",
"
\n",
"
\n",
"
ceon
\n",
"
1.994244
\n",
"
0.004155
\n",
"
NaN
\n",
"
1117.0
\n",
"
\n",
"
\n",
"
xn--untergrund-blttle-2qb
\n",
"
1.979329
\n",
"
0.004271
\n",
"
NaN
\n",
"
1481.0
\n",
"
\n",
"
\n",
"
snl
\n",
"
1.978340
\n",
"
0.002457
\n",
"
NaN
\n",
"
2164.0
\n",
"
\n",
"
\n",
"
fluchtgrund
\n",
"
1.958010
\n",
"
0.004335
\n",
"
NaN
\n",
"
1069.0
\n",
"
\n",
"
\n",
"
livejournal
\n",
"
1.942301
\n",
"
0.004938
\n",
"
NaN
\n",
"
1255.0
\n",
"
\n",
"
\n",
"
fro
\n",
"
1.925841
\n",
"
0.003328
\n",
"
NaN
\n",
"
2008.0
\n",
"
\n",
"
\n",
"
ektoplazm
\n",
"
1.917095
\n",
"
0.004848
\n",
"
NaN
\n",
"
1289.0
\n",
"
\n",
"
\n",
"
tsarizm
\n",
"
1.910577
\n",
"
0.005477
\n",
"
NaN
\n",
"
1116.0
\n",
"
\n",
"
\n",
"
lecturio
\n",
"
1.905781
\n",
"
0.004422
\n",
"
NaN
\n",
"
1441.0
\n",
"
\n",
"
\n",
"
boell
\n",
"
1.899089
\n",
"
0.002933
\n",
"
NaN
\n",
"
2339.0
\n",
"
\n",
"
\n",
"
demokratisch-links
\n",
"
1.893770
\n",
"
0.005399
\n",
"
NaN
\n",
"
1139.0
\n",
"
\n",
"
\n",
"
unesp
\n",
"
1.868239
\n",
"
0.002034
\n",
"
NaN
\n",
"
1604.0
\n",
"
\n",
"
\n",
"
hypotheses
\n",
"
1.867037
\n",
"
0.000674
\n",
"
NaN
\n",
"
12734.0
\n",
"
\n",
"
\n",
"
blog
\n",
"
1.849684
\n",
"
0.000367
\n",
"
NaN
\n",
"
25308.0
\n",
"
\n",
"
\n",
"
libsyn
\n",
"
1.843547
\n",
"
0.002548
\n",
"
NaN
\n",
"
2783.0
\n",
"
\n",
"
\n",
"
hs-hannover
\n",
"
1.843229
\n",
"
0.003191
\n",
"
NaN
\n",
"
1272.0
\n",
"
\n",
"
\n",
"
tolweb
\n",
"
1.819488
\n",
"
0.001467
\n",
"
NaN
\n",
"
5286.0
\n",
"
\n",
"
\n",
"
uv
\n",
"
1.816176
\n",
"
0.002626
\n",
"
NaN
\n",
"
2699.0
\n",
"
\n",
"
\n",
"
unizar
\n",
"
1.812974
\n",
"
0.005031
\n",
"
NaN
\n",
"
1258.0
\n",
"
\n",
"
\n",
"
uba
\n",
"
1.808644
\n",
"
0.001764
\n",
"
NaN
\n",
"
4284.0
\n",
"
\n",
"
\n",
"
canberra
\n",
"
1.790755
\n",
"
0.005386
\n",
"
NaN
\n",
"
1165.0
\n",
"
\n",
"
\n",
"
greciantiga
\n",
"
1.790670
\n",
"
0.002921
\n",
"
NaN
\n",
"
2394.0
\n",
"
\n",
"
\n",
"
eklablog
\n",
"
1.789438
\n",
"
0.000595
\n",
"
NaN
\n",
"
14825.0
\n",
"
\n",
"
\n",
"
unc
\n",
"
1.786398
\n",
"
0.005220
\n",
"
NaN
\n",
"
1210.0
\n",
"
\n",
"
\n",
"
uc
\n",
"
1.784737
\n",
"
0.000837
\n",
"
NaN
\n",
"
10072.0
\n",
"
\n",
"
\n",
"
ufba
\n",
"
1.784536
\n",
"
0.004375
\n",
"
NaN
\n",
"
1492.0
\n",
"
\n",
"
\n",
"
news
\n",
"
1.770598
\n",
"
0.002627
\n",
"
NaN
\n",
"
2718.0
\n",
"
\n",
"
\n",
"
workingpreacher
\n",
"
1.770494
\n",
"
0.002264
\n",
"
NaN
\n",
"
1506.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" entropy error total qty\n",
"blogspot 2.503455 0.000016 NaN 777251.0\n",
"archive 2.476347 0.001085 NaN 6770.0\n",
"ubc 2.434573 0.001457 NaN 4840.0\n",
"wordpress 2.407837 0.000015 NaN 785389.0\n",
"ncsu 2.394921 0.003533 NaN 1714.0\n",
"wildflowersearch 2.332792 0.000962 NaN 7954.0\n",
"weebly 2.176437 0.001168 NaN 6509.0\n",
"free 2.162823 0.000327 NaN 27726.0\n",
"github 2.155364 0.000745 NaN 10940.0\n",
"creativecommons 2.155298 0.003488 NaN 1822.0\n",
"hatenablog 2.146505 0.005132 NaN 1151.0\n",
"mentalfloss 2.143123 0.003733 NaN 1685.0\n",
"unlp 2.127393 0.001742 NaN 4133.0\n",
"blogs 2.112163 0.004780 NaN 1262.0\n",
"uab 2.102620 0.000236 NaN 33613.0\n",
"reset 2.101141 0.003802 NaN 1662.0\n",
"lse 2.067014 0.000744 NaN 11087.0\n",
"freemusicarchive 2.064793 0.002142 NaN 3283.0\n",
"uio 2.043913 0.004793 NaN 1275.0\n",
"merlot 2.009391 0.002407 NaN 2893.0\n",
"ranker 2.007076 0.004732 NaN 1304.0\n",
"ua 1.995738 0.003822 NaN 1684.0\n",
"ceon 1.994244 0.004155 NaN 1117.0\n",
"xn--untergrund-blttle-2qb 1.979329 0.004271 NaN 1481.0\n",
"snl 1.978340 0.002457 NaN 2164.0\n",
"fluchtgrund 1.958010 0.004335 NaN 1069.0\n",
"livejournal 1.942301 0.004938 NaN 1255.0\n",
"fro 1.925841 0.003328 NaN 2008.0\n",
"ektoplazm 1.917095 0.004848 NaN 1289.0\n",
"tsarizm 1.910577 0.005477 NaN 1116.0\n",
"lecturio 1.905781 0.004422 NaN 1441.0\n",
"boell 1.899089 0.002933 NaN 2339.0\n",
"demokratisch-links 1.893770 0.005399 NaN 1139.0\n",
"unesp 1.868239 0.002034 NaN 1604.0\n",
"hypotheses 1.867037 0.000674 NaN 12734.0\n",
"blog 1.849684 0.000367 NaN 25308.0\n",
"libsyn 1.843547 0.002548 NaN 2783.0\n",
"hs-hannover 1.843229 0.003191 NaN 1272.0\n",
"tolweb 1.819488 0.001467 NaN 5286.0\n",
"uv 1.816176 0.002626 NaN 2699.0\n",
"unizar 1.812974 0.005031 NaN 1258.0\n",
"uba 1.808644 0.001764 NaN 4284.0\n",
"canberra 1.790755 0.005386 NaN 1165.0\n",
"greciantiga 1.790670 0.002921 NaN 2394.0\n",
"eklablog 1.789438 0.000595 NaN 14825.0\n",
"unc 1.786398 0.005220 NaN 1210.0\n",
"uc 1.784737 0.000837 NaN 10072.0\n",
"ufba 1.784536 0.004375 NaN 1492.0\n",
"news 1.770598 0.002627 NaN 2718.0\n",
"workingpreacher 1.770494 0.002264 NaN 1506.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"entropies = entropies.sort_values(by='entropy', ascending=False)\n",
"print(\"Looking at the high entropy domains\")\n",
"display(entropies.iloc[:50, :])\n",
"print(\"Looking at the high entropy domains with > 1000 works\")\n",
"display(entropies[entropies['qty'] > 1000].iloc[:50, :])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Earlier, we considered domains that were predominantly a single type of license. For sites like Wikipedia and StackExchange, all articles/answers are licensed under the same CC license. However, it is also interesting to consider domains that host a diverse catalog of CC-licensed materials. One way to measure this mathematically is to compute the entropy of the distribution, which is computed as $$H = -\\sum p_i \\log_2(p_i)$$ where $p_i$ is the probability that a randomly sampled work from the domain will have license type $i$. (I also compute the maximum leave-one-out error for this quantity to capture information about the uncertainty of this value.) High entropies correspond to domains that have a diverse catalog.\n",
"\n",
"When we look only at the sites with the highest entropies, we see some unexpected results -- CTAN, MoneySoldiers, and TecnologiaHechaPalabras are not sites that most people are familiar with. However, when we restrict to domains with more than 1000 CC-licensed works, we see a lot of educational sites (ubc, ncsu, unc, uc) and blog sites with user-submitted content (blogspot, archive, wordpress, weebly, github). It seems that user-contributed content on forums might be the key to increasing the diversity of CC licenses. We have already noted that the distribution of licenses with respect to rank is inverse exponential, which would indicate that the most popular licenses influence others to use that license in a network effect. However, assessing the impact license diversity might help users learn about CC and better choose the correct license for their needs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## License Subgraphs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also consider some simple degree statistics on the most popular license subgraphs. These subgraphs are induced by the nodes for which their most popular license is the given license (e.g. on a site with 2 `by-sa 3.0` licenses and 1 `gpl 3` license, the domain would be a node in the `by-sa 3.0` subgraph)."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Takes ~10 sec to run\n",
"subgraph_by_license = dict()\n",
"for license in licenses:\n",
" subgraph_by_license[license] = cc_graph_ops.restrict_graph_by_license(g, license)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('by-sa', '3.0')\n",
"DescribeResult(nobs=26357, minmax=(0, 17133), mean=7.628713434761164, variance=17692.071884567937, skewness=107.89284684691681, kurtosis=12744.57371228335)\n",
"('by', '4.0')\n",
"DescribeResult(nobs=21077, minmax=(0, 10212), mean=10.224225459031171, variance=9707.230608772095, skewness=81.81495211055676, kurtosis=7804.796615616107)\n",
"('by-sa', '4.0')\n",
"DescribeResult(nobs=19536, minmax=(0, 1927), mean=4.002354627354627, variance=762.7109235570588, skewness=39.66665737679886, kurtosis=2194.2244732485296)\n",
"('by-nc-sa', '3.0')\n",
"DescribeResult(nobs=16266, minmax=(0, 1169), mean=3.4518627812615272, variance=373.13303464849236, skewness=30.292218809745354, kurtosis=1373.006675423128)\n",
"('by-nc-nd', '3.0')\n",
"DescribeResult(nobs=14407, minmax=(0, 7862), mean=5.515374470743389, variance=8778.608520377255, skewness=76.01113591801361, kurtosis=6128.541715246596)\n",
"('by-nc-sa', '4.0')\n",
"DescribeResult(nobs=15456, minmax=(0, 4045), mean=2.962085921325052, variance=1290.5780512681913, skewness=95.67269014823553, kurtosis=10463.057344924542)\n",
"('by', '3.0')\n",
"DescribeResult(nobs=26354, minmax=(0, 10471), mean=4.463914396296578, variance=10040.110886145443, skewness=95.97746991654843, kurtosis=9592.317754133524)\n",
"('by-nc-nd', '4.0')\n",
"DescribeResult(nobs=13650, minmax=(0, 2330), mean=2.4205128205128204, variance=509.380046626878, skewness=84.48061120837981, kurtosis=8416.113115282138)\n",
"('by-nc', '4.0')\n",
"DescribeResult(nobs=7390, minmax=(0, 192), mean=1.6056833558863328, variance=41.9478902827247, skewness=15.067042565267231, kurtosis=336.6939509412717)\n",
"('by-nc-nd', '2.5')\n",
"DescribeResult(nobs=2940, minmax=(0, 182), mean=1.554421768707483, variance=40.1851965011932, skewness=16.777186908225385, kurtosis=373.1504485459005)\n",
"('cc0', '1.0')\n",
"DescribeResult(nobs=5762, minmax=(0, 914), mean=2.073932662270045, variance=198.97491836241497, skewness=49.07750640942431, kurtosis=3056.0173156535498)\n",
"('by-nc', '3.0')\n",
"DescribeResult(nobs=3942, minmax=(0, 79), mean=1.3221714865550482, variance=18.837057145921104, skewness=7.829982663912705, kurtosis=84.7900918646856)\n",
"('by-nc-sa', '2.5')\n",
"DescribeResult(nobs=2972, minmax=(0, 341), mean=1.465006729475101, variance=54.96309683603684, skewness=33.618144924075196, kurtosis=1486.9969985639234)\n",
"('by', '2.5')\n",
"DescribeResult(nobs=2045, minmax=(0, 132), mean=1.1814180929095355, variance=28.59574064947679, skewness=16.64853953887616, kurtosis=352.21903308304604)\n",
"('by-nc-sa', '2.0')\n",
"DescribeResult(nobs=5407, minmax=(0, 951), mean=1.1514703162567044, variance=179.4086100279293, skewness=66.19203829014891, kurtosis=4676.8026178683795)\n",
"('by', '2.0')\n",
"DescribeResult(nobs=22775, minmax=(0, 17814), mean=4.931899012074643, variance=15345.361876498104, skewness=132.63169473930125, kurtosis=18822.733879101943)\n",
"('by-nc-nd', '2.0')\n",
"DescribeResult(nobs=5759, minmax=(0, 281), mean=0.9369682236499393, variance=24.613775523651434, skewness=35.14375300043892, kurtosis=1812.819731743664)\n",
"('by-nd', '3.0')\n",
"DescribeResult(nobs=1939, minmax=(0, 118), mean=0.8313563692625064, variance=12.929748984906519, skewness=20.670694918830787, kurtosis=605.8473106698779)\n",
"('by-sa', '2.0')\n",
"DescribeResult(nobs=11471, minmax=(0, 900), mean=1.270682590881353, variance=107.79045602051553, skewness=61.43580641837381, kurtosis=4999.628095953764)\n",
"('by-nc', '2.5')\n",
"DescribeResult(nobs=712, minmax=(0, 86), mean=0.5589887640449438, variance=11.158259454163305, skewness=23.614206904434706, kurtosis=599.9442323987463)\n",
"('by-sa', '2.5')\n",
"DescribeResult(nobs=1920, minmax=(0, 139), mean=0.7541666666666667, variance=17.90305714782005, skewness=20.663567247361858, kurtosis=607.5698143239433)\n",
"('by-nd', '4.0')\n",
"DescribeResult(nobs=2028, minmax=(0, 82), mean=0.7830374753451677, variance=9.518764918180498, skewness=14.532084119611664, kurtosis=306.09515431033935)\n",
"('gpl', '2.0')\n",
"DescribeResult(nobs=798, minmax=(0, 747), mean=2.7593984962406015, variance=702.6898425486556, skewness=27.79204101713537, kurtosis=777.5793846096772)\n",
"('by-nc', '2.0')\n",
"DescribeResult(nobs=2639, minmax=(0, 29), mean=0.35392194012883665, variance=2.081666758119662, skewness=9.422718433041185, kurtosis=126.92577372081226)\n",
"('pdm', '1.0')\n",
"DescribeResult(nobs=434, minmax=(0, 48), mean=0.7649769585253456, variance=9.616692031800428, skewness=11.238990806583137, kurtosis=150.6147039948908)\n",
"('by-nd', '2.0')\n",
"DescribeResult(nobs=2268, minmax=(0, 23), mean=0.09876543209876543, variance=0.38547708125711344, skewness=24.288089424016846, kurtosis=835.7485646742024)\n",
"('by-nd', '2.5')\n",
"DescribeResult(nobs=202, minmax=(0, 8), mean=0.40594059405940597, variance=1.0881237377469088, skewness=3.7970412675102683, kurtosis=18.211733053861508)\n",
"('by', '1.0')\n",
"DescribeResult(nobs=77, minmax=(0, 2), mean=0.15584415584415584, variance=0.21223513328776478, skewness=3.002701176533076, kurtosis=8.182886415085537)\n",
"('by', '2.1')\n",
"DescribeResult(nobs=251, minmax=(0, 33), mean=1.5059760956175299, variance=14.266964143426295, skewness=5.200808296163103, kurtosis=31.91242345332809)\n",
"('by-sa', '2.1')\n",
"DescribeResult(nobs=151, minmax=(0, 11), mean=0.9139072847682119, variance=3.8258719646799113, skewness=3.1885825695405807, kurtosis=10.898761415704524)\n",
"('by-nd-nc', '1.0')\n",
"DescribeResult(nobs=112, minmax=(0, 5), mean=0.3392857142857143, variance=0.6045688545688545, skewness=3.14409504726106, kurtosis=12.283202398657927)\n",
"('by-nc-nd', '2.1')\n",
"DescribeResult(nobs=193, minmax=(0, 26), mean=2.010362694300518, variance=40.11447538860104, skewness=3.2919026546646006, kurtosis=9.059583916364614)\n",
"('by-nc-sa', '1.0')\n",
"DescribeResult(nobs=184, minmax=(0, 3), mean=0.21739130434782608, variance=0.280351627464956, skewness=2.607903779792756, kurtosis=6.759218615340421)\n",
"('by-nc-sa', '2.1')\n",
"DescribeResult(nobs=174, minmax=(0, 19), mean=0.9310344827586207, variance=5.162846322503489, skewness=5.585469557167569, kurtosis=37.29971369607226)\n",
"('by-nc', '2.1')\n",
"DescribeResult(nobs=94, minmax=(0, 17), mean=0.5957446808510638, variance=4.544497826584306, skewness=5.6917591359425455, kurtosis=37.35958038625602)\n",
"('by-sa', '1.0')\n",
"DescribeResult(nobs=166, minmax=(0, 7), mean=0.37349397590361444, variance=1.605111354508945, skewness=3.720487066225204, kurtosis=13.480478982808247)\n",
"('by-nd', '2.1')\n",
"DescribeResult(nobs=39, minmax=(0, 4), mean=0.3076923076923077, variance=0.7449392712550608, skewness=3.0987907428123607, kurtosis=9.05006498109641)\n",
"('zero', '1.0')\n",
"DescribeResult(nobs=22, minmax=(0, 1), mean=0.09090909090909091, variance=0.08658008658008656, skewness=2.8460498941515424, kurtosis=6.100000000000003)\n",
"('by-nc', '1.0')\n",
"DescribeResult(nobs=28, minmax=(0, 1), mean=0.07142857142857142, variance=0.06878306878306878, skewness=3.3282011773513753, kurtosis=9.076923076923077)\n",
"('by-nd', '1.0')\n",
"DescribeResult(nobs=11, minmax=(0, 0), mean=0.0, variance=0.0, skewness=0.0, kurtosis=-3.0)\n",
"('sa', '1.0')\n",
"DescribeResult(nobs=29, minmax=(0, 2), mean=0.13793103448275862, variance=0.1945812807881773, skewness=3.2476863816034554, kurtosis=9.833600384553764)\n",
"('by-nd-nc', '2.0')\n",
"DescribeResult(nobs=11, minmax=(0, 3), mean=0.9090909090909091, variance=1.8909090909090907, skewness=0.8940540422332118, kurtosis=-1.0888498520710062)\n",
"('by-nc-nd', '1.0')\n",
"DescribeResult(nobs=6, minmax=(0, 0), mean=0.0, variance=0.0, skewness=0.0, kurtosis=-3.0)\n",
"('nc-sa', '1.0')\n",
"DescribeResult(nobs=9, minmax=(0, 0), mean=0.0, variance=0.0, skewness=0.0, kurtosis=-3.0)\n",
"('lgpl', '2.1')\n",
"DescribeResult(nobs=26, minmax=(0, 2), mean=0.15384615384615385, variance=0.21538461538461542, skewness=3.0280669583701694, kurtosis=8.347755102040813)\n",
"('by', '5.0')\n",
"DescribeResult(nobs=7, minmax=(0, 2), mean=0.5714285714285714, variance=0.9523809523809522, skewness=0.9486832980505139, kurtosis=-1.0999999999999994)\n"
]
}
],
"source": [
"for license in licenses:\n",
" subgraph = subgraph_by_license[license]\n",
" if len(subgraph) > 5:\n",
" degree_sequence = [d for n, d in subgraph.degree()]\n",
" basic_stats = scipy.stats.describe(degree_sequence)\n",
" print(license)\n",
" print(basic_stats)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Centrality and Community Measures"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also want to measure some basic centrality metrics on a subset of data, to see if we have anything interesting here and to help us decide what to implement when we scale up. Because these metrics are usually expensive to compute, we restrict to the maximum degree nodes. We don't actually expect this to change the results too much, because most domains in the internet have little to no influence"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"tags": [
"outputPrepend"
]
},
"outputs": [],
"source": [
"in_degrees = list(g.in_degree())\n",
"in_degrees.sort(key=lambda x: x[1], reverse=True)\n",
"\n",
"cited_domains = [domain for domain, degree in in_degrees]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"def benchmark_centrality_metrics(metrics, graph, print_output=True):\n",
" for mt in metrics:\n",
" res = cc_graph_ops.time_method(mt, graph)\n",
" res = list(res.items())\n",
" res.sort(key=lambda x: x[1], reverse=True)\n",
" if print_output:\n",
" display(res[:20])\n",
" print()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"metrics = [\n",
" nx.eigenvector_centrality,\n",
" nx.pagerank,\n",
" nx.closeness_centrality,\n",
"# nx.betweenness_centrality, # takes a long time to run\n",
"# nx.katz_centrality # takes a long time to run\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"eigenvector_centrality computed in 8.474 seconds.\n"
]
},
{
"data": {
"text/plain": [
"[('twitter', 0.23540802031621272),\n",
" ('facebook', 0.2184444648150715),\n",
" ('youtube', 0.1905154762829384),\n",
" ('google', 0.18691832786081633),\n",
" ('wikipedia', 0.16381111350448735),\n",
" ('flickr', 0.14828851331967058),\n",
" ('wordpress', 0.1295664715342409),\n",
" ('linkedin', 0.1288637560340983),\n",
" ('github', 0.11695900964987041),\n",
" ('wikimedia', 0.1146460801644215),\n",
" ('blogspot', 0.10415498505531817),\n",
" ('doi', 0.09828839961687742),\n",
" ('amazon', 0.0903121200037344),\n",
" ('apple', 0.08726888809148506),\n",
" ('nytimes', 0.08023058425430811),\n",
" ('archive', 0.07837667439460239),\n",
" ('nih', 0.07598451281768512),\n",
" ('europa', 0.07316699026604065),\n",
" ('bit', 0.07187090391051508),\n",
" ('vimeo', 0.07091365610106368)]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"pagerank computed in 10.239 seconds.\n"
]
},
{
"data": {
"text/plain": [
"[('twitter', 0.06392698058549683),\n",
" ('github', 0.0620463960745103),\n",
" ('google', 0.05884529759649454),\n",
" ('facebook', 0.04029837567812295),\n",
" ('wikipedia', 0.03806386389015411),\n",
" ('wikimedia', 0.02862591970399904),\n",
" ('youtube', 0.02615629754617124),\n",
" ('opensource', 0.014831076517465068),\n",
" ('mediawiki', 0.01283726284266241),\n",
" ('flickr', 0.012091441850861142),\n",
" ('wordpress', 0.011472483868506534),\n",
" ('linkedin', 0.010064851025013161),\n",
" ('doi', 0.009995044423576207),\n",
" ('android', 0.008361774860182677),\n",
" ('wikidata', 0.008349132901763564),\n",
" ('blogspot', 0.008146777871015093),\n",
" ('blogger', 0.008070574129244943),\n",
" ('w3', 0.007815071202658388),\n",
" ('stackoverflow', 0.006437766584758897),\n",
" ('apache', 0.0064181348624635075)]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"closeness_centrality computed in 319.699 seconds.\n"
]
},
{
"data": {
"text/plain": [
"[('twitter', 0.7313914384093927),\n",
" ('facebook', 0.6946596503740903),\n",
" ('youtube', 0.632544168169017),\n",
" ('google', 0.6285516840657449),\n",
" ('wikipedia', 0.5898840026737472),\n",
" ('flickr', 0.564464566377419),\n",
" ('wordpress', 0.5380454788090449),\n",
" ('linkedin', 0.5345656865518397),\n",
" ('github', 0.5299047727570552),\n",
" ('wikimedia', 0.522534748101185),\n",
" ('blogspot', 0.517108023364899),\n",
" ('amazon', 0.4980689753217563),\n",
" ('apple', 0.4944340877827688),\n",
" ('doi', 0.48560335147550354),\n",
" ('bit', 0.48251340603996185),\n",
" ('nytimes', 0.4797686699248053),\n",
" ('archive', 0.4784910643432064),\n",
" ('vimeo', 0.476175518231625),\n",
" ('w3', 0.4727980005413881),\n",
" ('europa', 0.47182638688574874)]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"subg = g.subgraph(cited_domains[:10_000])\n",
"benchmark_centrality_metrics(metrics, subg, print_output=True)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"subg = g.subgraph(cited_domains[:1000])\n",
"communities_generator = community.girvan_newman(subg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Country Breakdown"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We do a simple search on the country codes for different domains and licenses."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"def get_suffix(url):\n",
" suffix = url[url.rfind('.')+1:]\n",
" if ':' in suffix:\n",
" return suffix[:suffix.find(':')]\n",
" return suffix"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"licenses_country_usage = collections.defaultdict(lambda: collections.defaultdict(int))"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"for node, data in g.nodes(data=True):\n",
" suffix = get_suffix(data['provider_domain'])\n",
" if isinstance(data['cc_licenses'], dict):\n",
" for license, usage in data['cc_licenses'].items():\n",
" licenses_country_usage[license][suffix] += usage"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"licenses_suffixes = dict()"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"with open('country_codes.txt', 'w') as f:\n",
" for license, suffix_dict in licenses_country_usage.items():\n",
" print(license, file=f)\n",
" suffixes = sorted(suffix_dict.keys(), key=lambda x: suffix_dict[x], reverse=True)\n",
" suffixes = list(filter(lambda x: x.isalpha(), suffixes))\n",
" suffixes = [(s, suffix_dict[s]) for s in suffixes]\n",
" suffixes = list(filter(lambda x: x[1] > 10, suffixes))\n",
" print(suffixes, file=f, end='\\n\\n')\n",
" licenses_suffixes[license] = suffixes"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"with open('country_codes.pkl', 'wb') as f:\n",
" pickle.dump(licenses_suffixes, f)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"country_license_usage = collections.defaultdict(lambda: collections.defaultdict(int))"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"for license, country_usage in licenses_country_usage.items():\n",
" for country, usage in country_usage.items():\n",
" country_license_usage[country][license] = licenses_country_usage[license][country]"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"country_total = list()\n",
"for country, license_usage in country_license_usage.items():\n",
" total = sum([usage for _, usage in license_usage.items()])\n",
" if total > 100:\n",
"# print(country)\n",
"# print(total)\n",
"# print(license_usage)\n",
" country_total.append((country, total))"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"# sorted(country_total, key=lambda x: -x[1])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.5 64-bit ('linked_commons': conda)",
"language": "python",
"name": "python38564bitlinkedcommonsconda8c925ff8f8704234b7d011f0d1aa2749"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}