You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improvements and fixes processing HTML from WARC files
- encoding detection: use EncodingDetector which tries BOM, metadata
charset or detects the encoding from byte content
- fix method is_html used in case WARC-Identitifed-Payload-Type is absent:
use Content-Type from HTTP header (not WARC header) if present
- TagCountJob: use improved method is_html
0 commit comments