You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
abstract = "Removal of boilerplate is one of the essential tasks in web corpus construction and web indexing. Boilerplate (redundant and automatically inserted material like menus, copyright notices, navigational elements, etc.) is usually considered to be linguistically unattractive for inclusion in a web corpus. Also, search engines should not index such material because it can lead to spurious results for search terms if these terms appear in boilerplate regions of the web page. The size of large web corpora necessitates the use of efficient algorithms while a high accuracy directly improves the quality of the final corpus. In this paper, I present and evaluate a supervised machine learning approach to general-purpose boilerplate detection for languages based on Latin alphabets which is both very efficient and very accurate. Using a Multilayer Perceptron and a high number of carefully engineered features, I achieve between 95\% and 99\% correct classifications (depending on the input language) with precision and recall over 0.95. Since the perceptrons are trained on language-specific data, I also evaluate how well perceptrons trained on one language perform on other languages. The single features are also evaluated for the merit they contribute to the classification. I show that the accuracy of the Multilayer Perceptron is on a par with that of other classifiers such as Support Vector Machines. I conclude that the quality of general-purpose boilerplate detectors depends mainly on the availability of many well-engineered features and which are highly language-independent. The method has been implemented in the open-source texrex web page cleaning software, and large corpora constructed using it are available from the COW initiative, including the CommonCOW corpora created from CommonCrawl data sets.",
18
+
abstract = "Removal of boilerplate is one of the essential tasks in web corpus construction and web indexing.
19
+
Boilerplate (redundant and automatically inserted material like menus, copyright notices, navigational
20
+
elements, etc.) is usually considered to be linguistically unattractive for inclusion in a web corpus.
21
+
Also, search engines should not index such material because it can lead to spurious results for search
22
+
terms if these terms appear in boilerplate regions of the web page. The size of large web corpora
23
+
necessitates the use of efficient algorithms while a high accuracy directly improves the quality of the
24
+
final corpus. In this paper, I present and evaluate a supervised machine learning approach to
25
+
general-purpose boilerplate detection for languages based on Latin alphabets which is both very
26
+
efficient and very accurate. Using a Multilayer Perceptron and a high number of carefully engineered
27
+
features, I achieve between 95\% and 99\% correct classifications (depending on the input language)
28
+
with precision and recall over 0.95. Since the perceptrons are trained on language-specific data, I
29
+
also evaluate how well perceptrons trained on one language perform on other languages. The single
30
+
features are also evaluated for the merit they contribute to the classification. I show that the
31
+
accuracy of the Multilayer Perceptron is on a par with that of other classifiers such as Support Vector
32
+
Machines. I conclude that the quality of general-purpose boilerplate detectors depends mainly on the
33
+
availability of many well-engineered features and which are highly language-independent. The method has
34
+
been implemented in the open-source texrex web page cleaning software, and large corpora constructed
35
+
using it are available from the COW initiative, including the CommonCOW corpora created from
36
+
CommonCrawl data sets.",
19
37
keywords = "Boilerplate, Corpus construction, Non-destructive corpus normalization, Web corpora",
20
38
cc-author-affiliation = "Freie Universität Berlin, Germany",
0 commit comments