Skip to content

Use CharsetDetector to guess encoding of HTML documents#68

Merged
ldko merged 4 commits into
iipc:masterfrom
sebastian-nagel:HTMLResourceFactory_to_use_CharsetDetector
Feb 15, 2017
Merged

Use CharsetDetector to guess encoding of HTML documents#68
ldko merged 4 commits into
iipc:masterfrom
sebastian-nagel:HTMLResourceFactory_to_use_CharsetDetector

Conversation

@sebastian-nagel

Copy link
Copy Markdown
Collaborator

@sebastian-nagel sebastian-nagel force-pushed the HTMLResourceFactory_to_use_CharsetDetector branch from f9c8d42 to 607acaa Compare November 24, 2016 12:00
  <META http-equiv=Content-Type content="text/html; charset=windows-1256">

@ldko ldko left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that running ResourceExtractor with this change worked for me to fix an encoding issue I had previously encountered in a generated WAT. Just one question...

return value;
}
String result = value;
if (result.isEmpty())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this redundant to the prior if statement?

@sebastian-nagel sebastian-nagel Jan 27, 2017

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I've first fixed the charset detection in commoncrawl/ia-web-commons, and didn't pay enough attention when transferring the fix (CharsetDetector didn't have the "is empty" check on value).

I'll update the pull request. Thanks!

@ldko ldko added the accepted label Jan 27, 2017
@ldko ldko merged commit ed33bae into iipc:master Feb 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants