Skip to content

Improve HTML link extraction#72

Merged
ldko merged 1 commit into
iipc:masterfrom
sebastian-nagel:wat-improved-link-extraction
Mar 21, 2017
Merged

Improve HTML link extraction#72
ldko merged 1 commit into
iipc:masterfrom
sebastian-nagel:wat-improved-link-extraction

Conversation

@sebastian-nagel

Copy link
Copy Markdown
Collaborator

- add extractors for more elements which can take URLs as attribute
  values, add missing attributes
- generalize extraction of "global" attributes (`background`)
- add custom data attributes frequently used for linking (`data-href`,
  `data-uri`)
- add unit test to cover link extraction
@sebastian-nagel sebastian-nagel force-pushed the wat-improved-link-extraction branch from 3ee87b7 to 11579c2 Compare February 22, 2017 13:07
sebastian-nagel added a commit to commoncrawl/ia-web-commons that referenced this pull request Feb 22, 2017

@ldko ldko left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me and extracted/wrote more links to the WAT I generated with the changes. Thanks @sebastian-nagel !

@ldko ldko merged commit 11579c2 into iipc:master Mar 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants