Skip to content

Commit 9863d6c

Browse files
committed
Synchronized build
1 parent bffd571 commit 9863d6c

File tree

2 files changed

+8
-8
lines changed

2 files changed

+8
-8
lines changed

blog/entries/cc-datacatalog-data-processing/index.html

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -121,10 +121,10 @@ <h2 class="mb-0">Visualize CC Catalog data - data processing</h2>
121121
<h3 id="data-extraction">Data Extraction</h3><p>Each month, Creative Commons uses <a href="http://commoncrawl.org/">Common Crawl</a> data to find all domains that contain CC licensed content. As you might be guessing, the amount of data is very big, so the CC Catalog data is stored in <a href="http://commoncrawl.org/the-data/get-started/">S3</a> buckets and <a href="https://spark.apache.org/">Apache Spark</a> is used to extract the data from Common Crawl.</p>
122122
<p>Spark is used again in this project to extract the data, in the form of parquet files, from the buckets. In order to facilitate the analysis and processing of the data, the files are converted to TSV (tab-separated values). The dataset I work on contains the following fields:</p>
123123
<ul>
124-
<li>provider_domain: name of the domain with licensed content.</li>
125-
<li>cc_license: path to the Creative Commons license deed used by the _provider_domain_.</li>
126-
<li>images: number of images showed in the _provider_domain_ web page.</li>
127-
<li>links: JSON field that contains a dictionary with domains as keys, and number of links as values. A link appears when _provider_domain_ has an href tag in its web page that points to the domain key.</li>
124+
<li><code>provider_domain</code>: name of the domain with licensed content.</li>
125+
<li><code>cc_license</code>: path to the Creative Commons license deed used by the <code>provider_domain</code>.</li>
126+
<li><code>images</code>: number of images showed in the <code>provider_domain</code> web page.</li>
127+
<li><code>links</code>: JSON field that contains a dictionary with domains as keys, and number of links as values. A link appears when <code>provider_domain</code> has an href tag in its web page that points to the domain key.</li>
128128
</ul>
129129
<p>Each file can easily contain dozens of millions of rows. My first aproach is to load the information in a Pandas Dataframe, but this can become very slow. Therefore, I will test the scripts for the data processing with a portion of the real data. Afterwards, I will use <a href="https://dask.org/">Dask</a> with the entire dataset. Dask provides advanced parallelism for analytics, and has a behaviour similar to Pandas.</p>
130130
<h3 id="cleansing-and-filtering">Cleansing and Filtering</h3><p>This step is about preparing the data for analysis and reducing the amount of data, in order to get a meaningful visualization. The data that comes from the S3 buckets is actually pretty neat (no strange characters for example, or incomplete rows). Nevertheless as a first step, duplicate rows are deleted (given by duplicate URLs). Next I, develop pruning rules. I try to:</p>

blog/feed.xml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,10 @@
2121
&lt;h3 id=&quot;data-extraction&quot;&gt;Data Extraction&lt;/h3&gt;&lt;p&gt;Each month, Creative Commons uses &lt;a href=&quot;http://commoncrawl.org/&quot;&gt;Common Crawl&lt;/a&gt; data to find all domains that contain CC licensed content. As you might be guessing, the amount of data is very big, so the CC Catalog data is stored in &lt;a href=&quot;http://commoncrawl.org/the-data/get-started/&quot;&gt;S3&lt;/a&gt; buckets and &lt;a href=&quot;https://spark.apache.org/&quot;&gt;Apache Spark&lt;/a&gt; is used to extract the data from Common Crawl.&lt;/p&gt;
2222
&lt;p&gt;Spark is used again in this project to extract the data, in the form of parquet files, from the buckets. In order to facilitate the analysis and processing of the data, the files are converted to TSV (tab-separated values). The dataset I work on contains the following fields:&lt;/p&gt;
2323
&lt;ul&gt;
24-
&lt;li&gt;provider_domain: name of the domain with licensed content.&lt;/li&gt;
25-
&lt;li&gt;cc_license: path to the Creative Commons license deed used by the _provider_domain_.&lt;/li&gt;
26-
&lt;li&gt;images: number of images showed in the _provider_domain_ web page.&lt;/li&gt;
27-
&lt;li&gt;links: JSON field that contains a dictionary with domains as keys, and number of links as values. A link appears when _provider_domain_ has an href tag in its web page that points to the domain key.&lt;/li&gt;
24+
&lt;li&gt;&lt;code&gt;provider_domain&lt;/code&gt;: name of the domain with licensed content.&lt;/li&gt;
25+
&lt;li&gt;&lt;code&gt;cc_license&lt;/code&gt;: path to the Creative Commons license deed used by the &lt;code&gt;provider_domain&lt;/code&gt;.&lt;/li&gt;
26+
&lt;li&gt;&lt;code&gt;images&lt;/code&gt;: number of images showed in the &lt;code&gt;provider_domain&lt;/code&gt; web page.&lt;/li&gt;
27+
&lt;li&gt;&lt;code&gt;links&lt;/code&gt;: JSON field that contains a dictionary with domains as keys, and number of links as values. A link appears when &lt;code&gt;provider_domain&lt;/code&gt; has an href tag in its web page that points to the domain key.&lt;/li&gt;
2828
&lt;/ul&gt;
2929
&lt;p&gt;Each file can easily contain dozens of millions of rows. My first aproach is to load the information in a Pandas Dataframe, but this can become very slow. Therefore, I will test the scripts for the data processing with a portion of the real data. Afterwards, I will use &lt;a href=&quot;https://dask.org/&quot;&gt;Dask&lt;/a&gt; with the entire dataset. Dask provides advanced parallelism for analytics, and has a behaviour similar to Pandas.&lt;/p&gt;
3030
&lt;h3 id=&quot;cleansing-and-filtering&quot;&gt;Cleansing and Filtering&lt;/h3&gt;&lt;p&gt;This step is about preparing the data for analysis and reducing the amount of data, in order to get a meaningful visualization. The data that comes from the S3 buckets is actually pretty neat (no strange characters for example, or incomplete rows). Nevertheless as a first step, duplicate rows are deleted (given by duplicate URLs). Next I, develop pruning rules. I try to:&lt;/p&gt;

0 commit comments

Comments
 (0)