Add encoding detection to WET text extraction

The WET text extraction assumes always UTF-8, it should rely on a robust charset detection instead. See the discussions [.wet file encoding](https://groups.google.com/d/msg/common-crawl/fTkLw4efzNo/Nsf8gQxEAAAJ) and [problem with East European encodings in WET files](https://groups.google.com/forum/#!searchin/common-crawl/.wet$20file$20encoding|sort:relevance/common-crawl/HuKCk-o9t_w/bda0FEueJM4J).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add encoding detection to WET text extraction #4

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add encoding detection to WET text extraction #4

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions