The WET text extraction assumes always UTF-8, it should rely on a robust charset detection instead. See the discussions [.wet file encoding](https://groups.google.com/d/msg/common-crawl/fTkLw4efzNo/Nsf8gQxEAAAJ) and [problem with East European encodings in WET files](https://groups.google.com/forum/#!searchin/common-crawl/.wet$20file$20encoding|sort:relevance/common-crawl/HuKCk-o9t_w/bda0FEueJM4J).
The WET text extraction assumes always UTF-8, it should rely on a robust charset detection instead. See the discussions .wet file encoding and problem with East European encodings in WET files.