Skip to content

Commit a410ed6

Browse files
authored
Change the appearance of field names
1 parent a249a92 commit a410ed6

File tree

1 file changed

+4
-4
lines changed
  • content/blog/entries/cc-datacatalog-data-processing

1 file changed

+4
-4
lines changed

content/blog/entries/cc-datacatalog-data-processing/contents.lr

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,10 @@ Each month, Creative Commons uses [Common Crawl](http://commoncrawl.org/) data t
2222

2323
Spark is used again in this project to extract the data, in the form of parquet files, from the buckets. In order to facilitate the analysis and processing of the data, the files are converted to TSV (tab-separated values). The dataset I work on contains the following fields:
2424

25-
- provider_domain: name of the domain with licensed content.
26-
- cc_license: path to the Creative Commons license deed used by the _provider\_domain_.
27-
- images: number of images showed in the _provider\_domain_ web page.
28-
- links: JSON field that contains a dictionary with domains as keys, and number of links as values. A link appears when _provider\_domain_ has an href tag in its web page that points to the domain key.
25+
- `provider_domain`: name of the domain with licensed content.
26+
- `cc_license`: path to the Creative Commons license deed used by the `provider_domain`.
27+
- `images`: number of images showed in the `provider_domain` web page.
28+
- `links`: JSON field that contains a dictionary with domains as keys, and number of links as values. A link appears when `provider_domain` has an href tag in its web page that points to the domain key.
2929

3030

3131

0 commit comments

Comments
 (0)