Download Making Presentation Math Computable A Context Sensitive Approach for Translating LaTeX to Computer Algebra Systems 1st Edition André Greiner-Petter ebook All Chapters PDF
Download Making Presentation Math Computable A Context Sensitive Approach for Translating LaTeX to Computer Algebra Systems 1st Edition André Greiner-Petter ebook All Chapters PDF
com
OR CLICK HERE
DOWLOAD NOW
https://ebookmeta.com/product/the-latex-graphics-companion-tools-and-
techniques-for-computer-typesetting-2nd-edition-michel-goossens/
ebookmeta.com
https://ebookmeta.com/product/basic-math-pre-algebra-all-in-one-for-
dummies-mark-zegarelli/
ebookmeta.com
Guide to the Rabbit Hole For Those Who Just Took the Red
Pill 1st Edition Anonymous
https://ebookmeta.com/product/guide-to-the-rabbit-hole-for-those-who-
just-took-the-red-pill-1st-edition-anonymous/
ebookmeta.com
Crop Pollination by Bees Evolution Ecology Conservation
and Management Keith S Delaplane
https://ebookmeta.com/product/crop-pollination-by-bees-evolution-
ecology-conservation-and-management-keith-s-delaplane/
ebookmeta.com
https://ebookmeta.com/product/what-is-a-man-3-000-years-of-wisdom-on-
the-art-of-manly-virtue-1st-edition-waller-randy-newell/
ebookmeta.com
https://ebookmeta.com/product/docker-demystified-1st-edition-saibal-
ghosh/
ebookmeta.com
https://ebookmeta.com/product/the-bank-war-and-the-partisan-press-
stephen-w-campbell/
ebookmeta.com
Making
Presentation
Math Computable
A Context-Sensitive Approach
for Translating LaTeX to Computer
Algebra Systems
Making Presentation Math Computable
André Greiner-Petter
Making Presentation
Math Computable
A Context-Sensitive Approach for
Translating LaTeX to Computer
Algebra Systems
André Greiner-Petter
Berlin, Germany
© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this book are included in the book’s Creative Commons
license, unless indicated otherwise in a credit line to the material. If material is not included in the
book’s Creative Commons license and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with
regard to jurisdictional claims in published maps and institutional affiliations.
This Springer Vieweg imprint is published by the registered company Springer Fachmedien
Wiesbaden GmbH, part of Springer Nature.
The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany
Front Matter
Contents
FRONT MATTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Zusammenfassung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CHAPTER 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation & Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Research Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER 2
Mathematical Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Background and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Mathematical Formats and Their Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Web Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Word Processor Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Computable Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.4 Images and Tree Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.5 Math Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 From Presentation to Content Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.2 Benchmarking MathML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.3 Evaluation of Context-Agnostic Conversion Tools . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.4 Summary of MathML Converters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4 Mathematical Information Retrieval for LaTeX Translations . . . . . . . . . . . . . . . . . . . . . 51
v
CHAPTER 3
Semantification of Mathematical LaTeX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.1 Semantification via Math-Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.1 Foundations and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1.2 Semantic Knowledge Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.3 On Overcoming the Issues of Knowledge Extraction Approaches . . . . . . . 68
3.1.4 The Future of Math Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2 Semantification with Mathematical Objects of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.3 Frequency Distributions of Mathematical Formulae . . . . . . . . . . . . . . . . . . . . . 76
3.2.4 Relevance Ranking for Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2.6 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3 Semantification with Textual Context Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3.1 Semantification, Translation & Evaluation Pipeline . . . . . . . . . . . . . . . . . . . . . . 91
CHAPTER 4
From LaTeX to Computer Algebra Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1 Context-Agnostic Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.1.1 Training Datasets & Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.1.3 Evaluation of the Convolutional Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Context-Sensitive Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.3 Formal Mathematical Language Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.4 Document Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.5 Annotated Dependency Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.6 Semantic Macro Replacement Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
CHAPTER 5
Qualitative and Quantitative Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1 Evaluations on the Digital Library of Mathematical Functions . . . . . . . . . . . . . . . . . . . 114
5.1.1 The DLMF dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1.2 Semantic LaTeX to CAS translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.3 Evaluation of the DLMF using CAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.1.5 Conclude Quantitative Evaluations on the DLMF . . . . . . . . . . . . . . . . . . . . . . . . 131
5.2 Evaluations on Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.2.1 Symbolic and Numeric Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2.2 Benchmark Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2.4 Error Analysis & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.2.5 Conclude Qualitative Evaluations on Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . 139
vi Contents
CHAPTER 6
Conclusion and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.2 Contributions and Impact of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3.1 Improved Translation Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.3.2 Improve LaTeX to MathML Converters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.3.3 Enhanced Formulae in Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3.4 Language Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Contents vii
Front Matter
List of Figures
ix
Front Matter
List of Tables
xi
FRONT MATTER
Abstract
This thesis addresses the issue of translating mathematical expressions from LATEX to the syntax
of Computer Algebra Systems (CAS). Over the past decades, especially in the domain of Science,
Technology, Engineering, and Mathematics (STEM), LATEX has become the de-facto standard
to typeset mathematical formulae in publications. Since scientists are generally required to
publish their work, LATEX has become an integral part of today’s publishing workflow. On the
other hand, modern research increasingly relies on CAS to simplify, manipulate, compute, and
visualize mathematics. However, existing LATEX import functions in CAS are limited to simple
arithmetic expressions and are, therefore, insufficient for most use cases. Consequently, the
workflow of experimenting and publishing in the Sciences often includes time-consuming and
error-prone manual conversions between presentational LATEX and computational CAS formats.
To address the lack of a reliable and comprehensive translation tool between LATEX and CAS,
this thesis makes the following three contributions.
First, it provides an approach to semantically enhance LATEX expressions with sufficient semantic
information for translations into CAS syntaxes. This, so called, semantification process analyzes
the structure of the formula and its textual context to conclude semantic information. The
research for this semantification process additionally contributes towards related Mathematical
Information Retrieval (MathIR) tasks, such as mathematical education assistance, math recom-
mendation and question answering systems, search engines, automatic plagiarism detection,
and math type assistance systems.
Second, this thesis demonstrates the first context-aware LATEX to CAS translation framework
LACAST. LACAST uses the developed semantification approach to transform LATEX expressions
into an intermediate semantic LATEX format, which is then further translated to CAS based
on translation patterns. These patterns were manually crafted by mathematicians to assure
accurate and reliable translations. In comparison, this thesis additionally elaborates a non-
context aware neural machine translation approach trained on a mathematical library generated
by Mathematica.
Third, the thesis provides a novel approach to evaluate the performance for LATEX to CAS
translations on large-scaled datasets with an automatic verification of equations in digital math-
ematical libraries. This evaluation approach is based on the assumption that equations in digital
mathematical libraries can be computationally verified by CAS, if a translation between both
systems exists. In addition, the thesis provides an in-depth manual evaluation on mathematical
articles from the English Wikipedia.
The presented context-aware translation framework LACAST increases the efficiency and reliability
of translations to CAS. Via LACAST, we strengthened the Digital Library of Mathematical Functions
(DLMF) by identifying numerous of issues, from missing or wrong semantic annotations to sign
errors. Further, via LACAST, we were able to discover several issues with the commercial CAS
Maple and Mathematica. The fundamental approaches to semantically enhance mathematics
developed in this thesis additionally contributed towards several related MathIR tasks. For
xiii
instance, the large-scale analysis of mathematical notations and the studies on math-embeddings
motivated new approaches for math plagiarism detection systems, search engines, and allow
typing assistance for mathematical inputs. Finally, LACAST translations will have a direct real-
world impact, as they are scheduled to be integrated into upcoming versions of the DLMF and
Wikipedia.
xiv Abstract
FRONT MATTER
Zusammenfassung
Diese Dissertation befasst sich mit der Problematik von Übersetzungen mathematischer For-
meln zwischen LATEX und Computeralgebrasystemen (CAS). Im Laufe des digitalen Zeitalters
wurde LATEX zum Quasistandard für das Schreiben mathematischer Formeln auf dem Computer,
insbesondere in den Disziplinen Mathematik, Informatik, Naturwissenschaften und Technik
(MINT). Da Wissenschaftler gemeinhin ihre Arbeit publizieren, ist LATEX zu einem integralen
Bestandteil moderner Forschung geworden. Gleichermaßen verlassen sich Wissenschaftler
immer mehr auf die Möglichkeiten moderner CAS, um effektiv mit mathematischen Formeln
zu arbeiten, zum Beispiel, indem sie diese umformen, lösen oder auch visualisieren. Die mo-
mentanen Ansätze, welche eine Übersetzung von LATEX zu CAS erlauben, wie beispielsweise
interne Import-Funktionen einiger CAS, sind jedoch häufig auf einfache arithmetische Aus-
drücke beschränkt und daher nur wenig hilfreich im realen Arbeitsalltag. Infolgedessen ist die
Arbeit moderner Wissenschaftler in den MINT Disziplinen häufig geprägt von zeitraubenden
und fehleranfälligen manuellen Übersetzungen zwischen LATEX und CAS.
Die vorliegende Dissertation leistet die folgenden Beiträge, um das Problem des Übersetzens
von mathematischen Ausdrücken zwischen LATEX und CAS zu lösen.
Zunächst ist LATEX ein Format, welches lediglich die visuelle Präsentation mathematischer Aus-
drücke kodiert, nicht jedoch deren semantische Informationen. Die semantischen Informationen
sind jedoch notwendig für CAS, welche keine mehrdeutigen Eingaben erlauben. Daher führt
die vorliegende Arbeit als ersten Schritt für eine Übersetzung eine sogenannte Semantifizierung
mathematischer Ausdrücke ein. Diese Semantifizierung extrahiert semantische Informationen
aus dem Kontext und den Bestandteilen der Formel, um Rückschlüsse auf ihre Bedeutung zu
ziehen. Da die Semantifizierung eine klassische Aufgabe auf dem Gebiet der mathematischen
Informationsgewinnung darstellt, leistet dieser Teil der Dissertation auch Beiträge zu verwand-
ten Themengebieten. So sind die hier vorgestellten Ansätze auch nützlich für pädagogische
Programme, Frage-Antwort Systeme, Suchmaschinen und die digitale Plagiatserkennung.
Als zweiten Beitrag, stellt die vorliegende Dissertation das erste kontextbezogene LATEX zu
CAS Übersetzungsprogramm vor, genannt LACAST. LACAST nutzt die zuvor eingeführte Seman-
tifizierung, um LATEX in ein Zwischenformat zu transformieren, welches die semantischen
Informationen explizit darstellt. Dieses Format wird semantisches LATEX genannt, da es eine
technische Erweiterung von LATEX ist. Die weitere Übersetzung zu CAS wird durch heuristi-
sche Übersetzungsmuster für mathematische Funktionen realisiert. Diese Übersetzungsmuster
wurden in Zusammenarbeit mit Mathematikern definiert, um eine korrekte Übersetzung in
diesem letzten Schritt zu gewährleisten. Um die Vorzüge einer kontextbezogenen Übersetzung
besser zu verstehen, stellt diese Arbeit zum Vergleich auch eine Maschinenübersetzung auf
neuronalen Netzen vor, welche den Kontext einer Formel nicht berücksichtigt.
Der dritte Beitrag dieser Dissertation führt eine neue Methode zur Evaluierung von mathe-
matischen Übersetzungen ein, welche es erlaubt, auch eine große Anzahl an Übersetzungen
auf ihre Korrektheit hin zu überprüfen. Diese Methode folgt dem Ansatz, dass Gleichungen
xv
in mathematischen Bibliotheken auch nach der Übersetzung in ein CAS noch korrekt sein
müssten. Ist dies nicht der Fall, ist entweder die Ausgangsgleichung, die Übersetzung, oder
das CAS fehlerhaft. Hierbei ist zu beachten, dass jede Fehlerquelle einen Mehrwert für das
jeweilige System darstellt. Zusätzlich zu dieser automatischen Evaluierung, erfolgt noch eine
manuelle Analyse von Übersetzungen auf Basis englischer Wikipedia Artikel.
Zusammenfassend ermöglicht das kontextbezogene Übersetzungsprogramm LACAST eine effizi-
entere Arbeitsweise mit CAS. Mit Hilfe dieser Übersetzungen konnten auch mehrere Probleme,
wie falsche Informationen oder Vorzeichenfehler, in der Digital Library of Mathematical Func-
tions (DLMF) sowie Fehler in den kommerziell vertriebenen CAS Maple und Mathematica
automatisch aufgedeckt und behoben werden.
Die hier vorgestellte Grundlagenforschung zum semantischen Anreichern mathematischer
Ausdrücke, hat zudem etliche Beiträge zu verwandten Forschungsthemen geleistet. Zum Bei-
spiel hat die Analyse der Verteilung von mathematischen Notationen in großen Datensätzen
neue Ansätze in der digitalen Plagiatserkennung ermöglicht. Des Weiteren wird zurzeit daran
gearbeitet, die Übersetzungen von LACAST in kommende Versionen von Wikipedia und der DLMF
zu integrieren.
xvi Zusammenfassung
FRONT MATTER
Acknowledgements
This thesis would not have been possible without the tremendous help and support from nu-
merous family members, friends, colleagues, supervisors, and several international institutions.
In the following, I want to take the opportunity to thank all the individuals and organizations
that helped me along the way to make this work possible.
My first sincere wishes go to my prodigious doctoral advisers Bela Gipp and Akiko Aizawa.
Their continuous support and counsel enabled me to realize this thesis at marvelous places and
together with numerous wonderful people from all over the world. Their enduring encourage-
ment and assistance, Bela’s abiding and infectious positivity, and Akiko’s steadfast and kind
endorsement empowered my personal and professional life. Both of their competent and sincere
guidance helped me to find my way in the intricate maze of research and career decisions and
turned my often onerous time into a joyful and memorable experience.
Moreover, I am very grateful to my adviser and friend Moritz Schubotz, who supported and
guided me throughout the entire time of my doctoral thesis and even beyond. Our fruitful and
always engaging discussions, even when exhausting, enriched and positively affected most, if
not all, of my work. It is not an exaggeration to admit that my career, including my Master’s
thesis and this doctoral thesis, would not have been possible and nearly as successful and joyful
as it has been without his continuous and sincere support over the years. I am wholeheartedly
thankful for all the years we worked together.
I further wish to gratefully acknowledge my friends, colleagues, and advisers Howard Cohl,
Abdou Youssef, and Bruce Miller at the National Institute of Standards and Technology (NIST) for
their valuable advice, continuous drive to perfection, and our rewarding collaborations. I thank
Jürgen Gerhard at Maplesoft, who kindly provided me access and support for Maple on several
occasions. I am just as thankful for the assistance and support from Norman Meuschke, who
always helped me to overcome governmental and organizational hurdles, Corinna Breitinger,
who never failed to refit my gibberish, and my colleagues and friends Terry Lima Ruas and
Philipp Scharpf for many visionary discussions. I also thank all my collaborators and colleagues
with whom I had the distinct opportunity to work together, including Takuto Asakura, Fabian
Müller, Olaf Teschke, William Grosky, Marjorie McClain, Yusuke Miyao, Malte Ostendorff,
Bonita Saunders, Kenichi Iwatsuki, Takuma Udagawa, Anastasia Zhukova, and Felix Hamborg.
I further want to thank the students I worked with, including Avi Trost, Rajen Dey, Joon Bang,
Kevin Chen, and Felix Petersen. I especially appreciate the help and assistance from people
at the National Institute of Informatics (NII) to overcome governmental and daily life issues. I
wish to especially thank Rie Ayuzawa, Noriko Katsu, Akiko Takenaka, and Goran Topic.
My genuine gratitude also goes to my host organizations and those that provided financial
support for my research. I am thankful for the German Academic Exchange Program (DAAD)
for enabling two research stays at the NII in Tokyo, the NII for providing me a wonderful
work environment, the German Research Foundation (DFG) to financially support many of
my projects, the NIST for hosting me as a guest researcher, and Maplesoft for offering me
xvii
an internship during my preliminary research project on the Digital Library of Mathematical
Functions (DLMF). I finally thank the ACM Special Interest Group on Information Retrieval
(SIGIR), the University of Konstanz, the University of Wuppertal, and Maplesoft for supporting
several conference participations.
My last and most crucial gratitude goes to my family and friends, who always cheered me in
good and bad times and constantly backed and supported me so that I could selfishly pursue my
dreams. I am deeply grateful for my lovely parents Rolf & Regina, who have always been on
my side and make all this possible behind the scenes. I am also tremendously thankful for the
enduring personal support from my dear friends Kevin, Lena, Vici, Dong, Peter, Vitor, Ayuko,
and uncountably more. Finally, I thank my lovely partner Aimi for brightening even the darkest
times and pushing every possible obstacle aside. I dedicate this thesis to my lovely parents, my
dear friends, and my enchanting girlfriend.
xviii Acknowledgements
I went to the woods because I wanted to live deliberately. I wanted
to live deep and suck out all the marrow of life. To put to rout all
that was not life; and not, when I had come to die, discover that I
had not lived.
Neil Perry - Dead Poet Society
CHAPTER 1
Introduction
Contents
1.1 Motivation & Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Research Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
This thesis addresses the issue of translating mathematical expressions from LATEX to the syntax
of Computer Algebra Systems (CAS), which is typically a time-consuming and error-prone task
in the modern life of many researchers. A reliable and comprehensive translation approach
requires analyzing the textual context of mathematical formulae. In turn, research advances
in translating LATEX contribute directly towards related tasks in the Mathematical Information
Retrieval (MathIR) arena. In this chapter, I provide an introduction to the topic. Section 1.1
introduces my motivation and provides an overview of the problem. Section 1.2 summarizes
the research gap. In Section 1.3, I define the research objective and research tasks of this thesis.
Section 1.4 concludes with an outline of the thesis including an overview of the publications
that contributed to the goals of this thesis and the research path that led to these publications.
In order to analyze this new equation, e.g., to validate it, she wants to use CAS. CAS are
powerful mathematical software tools with numerous applications [207]. Today’s most widely
1
https://en.wikipedia.org/wiki/Jacobi_polynomials [accessed 2021-10-01].
Hereafter, dates follow the ISO 8601 standard. i.e., YYYY-MM-DD.
System Representation
(α,β)
Rendered Version Pn (cos(aΘ))
Generic LATEX P_n^{(\alpha, \beta)}(\cos(a\Theta))
used CAS include Maple [36], Mathematica [393], and MATLAB [246]. Scientists use CAS2 to
simplify, manipulate, evaluate, compute, or even visualize mathematical expressions. Thus,
CAS play a crucial role in the modern era for pure and applied mathematics [8, 184, 207, 262]
and even found their way into classrooms [237, 363, 365, 389, 390]. In turn, CAS are the perfect
tool for the researcher in our example to examine the formula further. In order to use a CAS,
she needs to translate the expression into the correct CAS syntax.
Table 1.1 illustrates the differences between computable and presentational encodings for a
Jacobi polynomial. While the rendered version and the LATEX [220] encoding only provide
visual information, semantic LATEX [403] and the CAS encodings explicitly encode the meaning,
i.e., the semantics, of the formula. On the one hand, LATEX3 has become the de-facto standard
to typeset mathematics in scientific publications [129, 248, 402], especially in the domain of
Science, Technology, Engineering, and Mathematics (STEM). On the other hand, computational
advances make CAS an essential asset in the modern workflow of experimenting and publishing
in the Sciences. Translating expressions between LATEX and CAS syntaxes is, therefore, a
typical task in the everyday life of our hypothetical researcher. Despite this common need, no
reliable translation from a presentational format, such as LATEX, to a computable format, such as
Mathematica, is available to date. The only option our hypothetical researcher has is to manually
translate the expression in the specific syntax of a CAS. This process is time-consuming and
often error-prone.
If a translation between LATEX and CAS is so essential, why are there no translation tools
available? As is often the case in research, the reasons for this are diversified. First, there are
translation approaches available. Some CAS, such as Mathematica and SymPy, allow to import
LATEX expressions. Most CAS support at least the Mathematical Markup Language (MathML),
since it is the current web standard to encode mathematical formulae. With numerous tools
available to transfer LATEX to MathML [18], a translation from LATEX to CAS syntaxes should
not be a difficult task. However, none of these available translation techniques are reliable
2
In the sequel, the acronym CAS is used interchangeably with its plural.
3
https://www.latex-project.org/ [accessed 2021-10-01]
2 Chapter 1
Introduction
Section 1.1. Motivation & Problem
and comprehensive. Table 1.2 illustrates how Mathematica, one of the major proprietary CAS,
fails to import even simple formulae. Another option is SnuggleTeX [251], a LATEX to MathML
converter which also supports translations to Maxima [324]. SnuggleTeX fail to translate all
expressions in Table 1.2. Alternative translations via MathML as an intermediate format perform
similarly (as we will show later in Section 2.3).
While the simple cases shown in Table 1.2 could be solved with a more comprehensive and flex-
ible parser and mapping strategy, such a solution would ignore the real challenge of translating
mathematics to CAS, the ambiguity. The interpretation of the majority of mathematical expres-
sions is context-dependent, i.e., the same formula may refer to different concepts in different
contexts. Take the expressions π(x + y) as an example. In number theory, the expression most
likely refers to the number of primes less than or equal to x + y. In another context, however,
it may just refer to a multiplication πx + πy. Without considering the context, an appropriate
translation of this ambiguous expression is infeasible. Today’s translation solutions, however,
do not consider the context of an input. Instead, they translate the expression based on internal
decisions, which are often not transparent to a user.
Table 1.3 shows the results of importing π(x + y) to different CAS. Each CAS in Table 1.3
interprets π as a function call but does not associate it with the prime counting function (nor
any other predefined function). Only SnuggleTeX translated π as the mathematical constant
to Maxima syntax. However, Maxima does not contain a prime counting function. The CAS
import functions consider the expression as a generic function with the name π. Mathematica
surprisingly links π still with the mathematical constant which results in a peculiar behaviour
for numeric evaluations. The expression N[Pi[x+y]] (numeric evaluation of the imported
expression) is evaluated to 3.14159[x + y]. Associating the variables x and y with numbers,
say x, y = 1, would result in the rather odd expression 3.14159[2].
Chapter 1 3
Introduction
Section 1.1. Motivation & Problem
Table 1.3: The results of importing π(x + y) in different CAS. For Maple, a MathML rep-
resentation was used. Content MathML was not tested, since there is no content dictionary
available that defines the prime counting function. SnuggleTeX translated the expression to
the CAS Maxima. The two right most columns show the expected expressions in the context
of the prime counting function or a multiplication. None of the CAS choose any of the two
expected interpretations. Note that the prime counting function in Maple can also be written
with pi(x+y) and requires to pre-load the extra package NumberTheory. Nonetheless, this
function pi(x+y) is still different to the actual imported expression Pi(x+y). Note further
that Maxima does not define a prime counting function.
Why do existing translation techniques not allow to specify a context? Mainly because it
is an open research question of what this context is or needs to be. The exact information
needs to perform translation to CAS syntaxes, and where to find them is unlcear [11]. Some
required information is indeed encoded in the structure of the expression itself. Consider a
simple fraction 12 . This expression is context-independent and can be directly translated. The
(α,β)
expression Pn (x) in the context of OPSF is also often unambiguous for general-purpose
CAS. Since Mathematica supports no other formula with this presentational structure, i.e.,
P followed by a subscript and superscript with paranthesis, Mathematica is able to correctly
(•,•)
associate P• (•), where • are wildcards, with the function JacobiP. In other cases, the
immediate textual context of the formula provides sufficient information to disambiguate the
expression [54, 329]. Consider, an author explicitly declares π(x) as the prime counting function
right before she uses it with π(x + y). In this case, it might be sufficient to scan the surrounding
context for key phrases [183, 214, 329], like ‘prime counting function’ in order to map π to, for
instance, NPrimes in Mathematica.
Often, the semantic explanations of mathematical objects in an article are scattered around in
the context or absent entirely [394]. An interested reader needs to retrieve sufficient seman-
tic explanations and correctly link them with mathematical objects in order to comprehend
the meaning of a complex formula. Sometimes, an author presumes the interpretation of an
expression can be considered as common knowledge and, therefore, does not require further
explanations. Consider π(x + y) refers to a multiplication between π and (x + y). In general,
an author may consider π (the mathematical constant) as common knowledge and does not
explicitly declare its meaning. The same could be true for scientific articles, where the length is
often limited. An article about prime numbers probably not explicitely declare the meaning of
π(x + y) because the author presumes the semantics are unambiguis given the overall context
of the article.
4 Chapter 1
Introduction
Section 1.1. Motivation & Problem
In other cases, the information needs go beyond a simple text analysis. Consider π(x + y)
as a generic function that was previously defined in the article and simply has no name. An
appropriate translation would require to retrieve the definition of the function from the context.
But even if a function is well-known and supported by a CAS, a direct translation might be
inappropriate because the definition in the CAS is not what our researcher expected [3, 13].
Legendre’s incomplete elliptic integral of the first kind F (φ, k), for example, is defined with
the amplitude φ as its first argument in the DLMF and Mathematica. In Maple, however, one
needs to use the sine of the amplitude sin(φ) for the first argument4 . In turn, an appropriate
translation to Maple might be EllipticF(sin(phi), k) rather than EllipticF(phi, k)
depending on the source of the original expression. The English Wikipedia article about elliptic
integrals5 contains both versions and refers to them with F (φ, k) and F (x; k) respectively.
Even though both versions in Wikipedia refer to the same function, correct translations to
Maple of F (φ, k) and F (x; k) are not the same.
In cases of multi-valued functions, transla-
tions between different systems can become Table 1.4: Different computation results for
eminently more complex [83, 91, 172]. Even arccot(−1) (inspired by [84]).
for simple cases, such as the arccotangent
function arccot(x), the behavior of different System or Source arccot(−1)
CAS might be confusing. For example, since
[276] 1st printing 3π/4
arccot(x) is multi-valued, there are multiple
solutions of arccot(−1). CAS, like any gen- [276] 9th printing −π/4
eral calculator too, only compute values on Maple [36] v.2020.2 3π/4
the principle branches and, therefore, return Mathematica [393] v.12.1.1 −π/4
only a single value. The principle branches,
however, are not necessarily uniformly po- SymPy [252] v.1.5.1 −π/4
sitioned among multiple systems [84, 172]. Axiom [173] v.Aug.2014 3π/4
In turn, the returned value of a multi-valued Reduce [151] v.5865 3π/4
function may depends on the system, see Ta-
MATLAB [246] v.R2021a −π/4
ble 1.4. A translation of arccot(x) from the
DLMF to arccot(x) in Maple would be only
correct for x > 0. Finally, CAS may also compute irrational looking expressions without
objections, e.g., arccot 10 returns 1.5708 in MATLAB6 . Even for field experts, it can be chal-
lenging to keep track of every property and characteristic of CAS [20, 100].
In combination, all of the issues underline that an accurate manual translation to the syntax of
CAS is challenging, time-consuming, error-prone, and requires deep and substantial knowledge
about the target system. Especially with the increasing complexity of the translated expressions,
errors during the translation process might be inevitable. Real-world scenarios often include
4
https://www.maplesoft.com/support/help/maple/view.aspx?path=EllipticF
[accessed 2021-10-01]
5
https://en.wikipedia.org/wiki/Elliptic_integral [accessed 2021-10-01]
6
MATLAB evaluates 10 to infinity and the limit in positive infinity of the arccotangent function is π2 (or roughly
1.5708). Yet, the interpretation of the division by zero is not wrong, since it follows the official IEEE 754 standard
for floating-point arithmetic [170].
Chapter 1 5
Introduction
Section 1.2. Research Gap
much more complicated formulae compared to the expressions in Table 1.2 or even equation (1.1).
Moreover, if an error occurs, the cause of the error can be very challenging to detect and traced
back to its origin. The issue of translating arccot(x) to Maple, for example, may remain
undiscovered until a user calculates negative values. If the function is embedded into a more
complex equation, even experts can lose track of potential issues. In combination with unreliable
translation tools, working with CAS may even be frustrating. Mathematica, for example, is able
to import our test expression (1.1) mentioned earlier without throwing an error7 . However,
investigating the imported expression reveals an incorrect translation due to an issue with
factorials. To productively work with CAS, our hypothetical researcher from above needs to
carefully evaluate if the automatically imported expression was correct. As a consequence,
existing translation approaches are not practically useful.
In this thesis, I will focus on discovering the information needs to perform correct translations
from presentational formats, here mainly LATEX, to computational formats, here mainly CAS
syntaxes. My personal motivation is to improve the workflow of researchers by providing them a
reliable translation tool that offers crucial additional information about the translation process.
Further, I limit the support of such a translation tool to general-purpose CAS, since many
general mathematical expressions simply cannot be translated to appropriate CAS expressions
for task-specific CAS (or other mathematical software, such as theorem provers). The focus on
general-purpose CAS allows me to provide a broad solution to a general audience. Note further
that, in this thesis, I mostly focus on the two major CAS Maple and Mathematica. However,
the goal is to provide a translation tool that is easy to extend and support more CAS.
Further, the real-world applications of such a translation tool go far beyond an improved work-
flow with CAS. A computable formula can be automatically verified with CAS [51, 52, 2,
8, 13, 153, 184, 414, 415], translated to other semantically enhanced formats, such as Open-
Math [53, 57, 119, 152, 303, 361], content MathML [59, 60, 159, 270, 318, 342] or other CAS
syntaxes [110, 361], imported to theorem prover [35, 57, 152, 163, 338, 375], or embedded in
interactive documents [85, 131, 150, 162, 201, 284]. Since an appropriate translation is generally
context-dependent, a translator must use MathIR [141] techniques to access sufficient semantic
information. Hence, advances in translating LATEX to CAS syntaxes also contribute directly
towards related MathIR tasks, including entity linking [150, 208, 212, 316, 319, 321, 322], math
search engines [92, 181, 182, 203, 211, 236, 274], semantic tagging of math formulae [71, 402],
recommendation systems [30, 31, 50, 319], type assistance systems [103, 106, 14, 321, 400], and
even plagiarism detection platforms [253, 254, 334].
6 Chapter 1
Introduction
Section 1.3. Research Objective
3. inflexible, i.e., slight changes in the notation can cause the translation to fail (see the
integral imports from Table 1.2); and
4. limited to simple expression due to missing mappings between function definition sources,
i.e., even with semantic information, a translation often fails.
Issue 4 raises from the fact that there are semantically enhanced data formats that have been
specifically developed to make expressions between CAS interchangeable, such as Open-
Math [119, 303, 361] and content MathML [318, 343]. Nonetheless, most CAS do not support
OpenMath natively [303] and the support for content MathML is limited to school mathemat-
ics [318]. The reason is that such translation requires a database that maps functions between
different semantic sources. As discussed above, creating such a comprehensive database can be
time-consuming due to slight differences between the systems (e.g., positions of branch cuts,
different supported domains, etc.) [361]. Hence, for economic reasons, crafting and maintaining
such a library is unreasonable. Translations between semantic enhanced formats, e.g., between
CAS syntaxes, OpenMath, or content MathML, are consequentially often unreliable.
In previous research, I was focusing on the issues 2-4 by developing a rule-based LATEX to
CAS translator, called LACAST. Originally, LACAST performs translations from semantic LATEX to
Maple. Relying on semantic LATEX allows LACAST to largely ignore the ambiguity Issue 1 and
focus on the other problems. For this thesis, I continued to develop LACAST to further mitigate
the limitation and inflexibility issues 3 and 4. Further, I focused on extending LACAST to become
the first context-aware translator to tackle the context-independency issue 1.
Research Objective
Develop and evaluate an automated context-sensitive process that makes presentational
mathematical expressions computable via computer algebra systems.
Research Tasks
I Analyze the strengths and weaknesses of existing semantification approaches for
translating mathematical expressions to computable formats.
II Develop a semantification process that will improve on the weaknesses of current
approaches.
III Implement a system for the automated semantification of mathematical expressions
in scientific documents.
IV Implement an extension of the system to provide translations to computer algebra
systems.
V Evaluate the effectiveness of the developed semantification and translation system.
Chapter 1 7
Introduction
Section 1.4. Thesis Outline
1.4.1 Publications
Most parts of this thesis were published in international peer-reviewed conferences and journals.
Table 1.5 provides an overview of the publications that are reused in this thesis. The first column
identifies the chapter a publication contributed to. The venue rating was taken from the Core
ranking8 for conferences and the Scimago Journal Rank (SJR)9 for journal articles. Each rank
8
http : / / portal . core . edu . au / conf - ranks/ with the ranks: A* – flagship conference (top 5%),
A – excellent conference (top 15%), B – good conference (top 27%), and C – remaining conferences [accessed
2021-10-01].
9
https://www.scimagojr.com/ with the ranks Q1 – Q4 where Q1 refer to the best 25% of journals in the
field, Q2 to the second best quarter, and so on [accessed 2021-10-01].
8 Chapter 1
Introduction
Section 1.4. Thesis Outline
was retrieved for the year of publication (or year of submission, in case the paper has not been
published yet). Table 1.6 similarly shows publications that partially contributed towards the goal
of this thesis but are not reused within a chapter. Note that the publication [3] (in Table 1.6) was
part of my Master’s thesis and contributed towards this doctoral thesis as a preliminary project.
The Journal publication [13] (also in Table 1.6) is an extended and (with new results) updated
version of the thesis and the mentioned article [3]. The venue abbreviations in both tables are
explained in the glossary. Lastly, note that the TPAMI journal [11] is reused in Chapter 4 (for
the methodology) and in Chapter 5 (for the evaluation) to provide a coherent structure. My
publications, talks, and submissions are separated from the general bibliography in the back
matter and can be found on page 171.
Table 1.6: Overview of secondary publications that partially contributed to this thesis.
Author Venue
Year Venue Type Length Position Rating Ref.
CLEF Workshop Full 4 of 6 n/a [16]
2020
EMNLP Workshop Full 2 of 4 Core A [1]
2019 AJIM Journal Full 1 of 4 SJR Q1 [13]
2018 CICM Conference Short 1 of 4 n/a [12]
2017 CICM Conference Full 4 of 9 n/a [3]
Preliminary Work I had the first contact with the problem of translating LATEX to CAS
syntaxes during my undergraduate studies in mathematics. During that time, I regularly used
10
The methodology part of this journal is reused in Chapter 4 while the evaluation part is reused in Chapter 5.
Chapter 1 9
Introduction
Section 1.4. Thesis Outline
CAS like MATLAB and SymPy for numeric simulations and for plotting results. At the same
time, we were required to hand in our homework as LATEX files. While exporting content from the
CAS to LATEX files was rather straight forward, the other way around, i.e., importing LATEX into
the CAS, required manual conversions. I decided to explore the reasons for this shortcoming in
my Master’s thesis. During that time, I developed the first version of a semantic LATEX to CAS
translator, which was later coined LACAST11 . The results from this first study were published at
the Conference of Intelligent Computer Mathematics (CICM) in 2017.
This first version of LACAST focused specifically on the CAS Maple but was designed modularly
to allow later extensions to other CAS. The main limitation of LACAST, however, was the re-
quirement of using semantic LATEX macros to disambiguate mathematical expressions manually.
An automatic disambiguation process did not exist at the time. Moreover, only a few previous
projects focused on a semantification for translating mathematical formats. Hence, I continued
my research in this direction.
In the following, I will use ‘we’ rather than ‘I’ in the subsequent parts of this thesis, since none
of the presented contributions would have been possible without the tremendous and fruitful
discussions and help from advisors, colleagues, students, and friends.
Chapter 2 — [18]
11
LaTeX to CAS Translator.
10 Chapter 1
Introduction
Section 1.4. Thesis Outline
We discovered that three of the nine tools were able to generate content MathML but with
insufficient accuracy. None of the available tools were capable of analyzing a context for a
given formula. Hence, the converters were unable to conclude the correct semantic information
for most of the symbols and functions. In our study, we proposed a manual semantification
approach that semantically enriches the translation process of existing converters by feeding
them semantic information from the surrounding context of a formula. The enrichment process
was manually illustrated via the converter LATExml, which allowed us to add custom semantic
macros to improve the generated MathML data. In fact, we used this manual approach to create
the entries of MathMLben in the first place.
Naturally, our next goal was to automatically retrieve semantic information from the context
of a given formula. Around this time, word embeddings [256] began to gain interest in the
MathIR community [121, 215, 242, 400, 404]. It seems that vector representations were able to
capture some semantic properties of tokens in natural languages. Can we create such semantic
vector representations of mathematical expressions too? Unfortunately, we discovered that
the related work in this new area of interest did not discuss a crucial underlying issue with
embedding mathematical expressions. In math expressions, certain symbols or entire groups of
tokens are fixed, such as the red tokens in the Gamma function Γ(x) or the Jacobi polynomial
Pn (α,β) (x), while other may vary (gray). Inspired by words in natural languages, we call these
fixed tokens the stem of a mathematical object or operation. Unfortunately, in mathematics, this
stem is context-dependent. If π is a function, the red tokens are its stem π(x + y). However,
if π is not a function, the stem is just the symbol itself π(x + y). If we do not know the stem
of a mathematical object, how can we group them so that a trained model understands the
connection between variations like Γ(z) and Γ(x)? The answer is: we cannot. The only
alternative is to use context-independent representations, e.g., we only embed the identifiers or
the entire expression. Each of these approaches has advantages and disadvantages. We shared
our discussion with the community at the BIRNDL Workshop at the conference on Research
and Development in Information Retrieval (SIGIR) in 2019.
Chapter 2 — [9]
Chapter 1 11
Introduction
Section 1.4. Thesis Outline
Chapter 3 — [15]
Unfortunately, this sets us back to the beginning, where we need manually crafted semantic
LATEX. We started to investigate the issue of interpreting the semantics of mathematical expres-
sions from a different perspective. As we will see later in Section 2.2.4, humans tend to visualize
mathematical expressions in a tree structure, where operators, functions, or relations are parent
nodes of their components. Identifiers and other terminal symbols are the leaves of these trees.
The MathML tree data structure comes close to these so-called expression trees (see Section 2.2.4)
but does not strictly follow the same idea [331]. The two aforementioned context-independent
approaches to embed mathematical expressions take either the leaves or the roots of such trees.
The subtrees in between are the context-dependent mathematical objects we need. Not all
subtrees, however, are meaningful, and the mentioned expression trees are only theoretical
interpretations. In searching for an approach to discover meaningful subexpressions, which we
call Mathematical Objects of Interest (MOI), we performed the first large-scale study of mathe-
matical notations on real-world scientific articles. In this study, we followed the assumption
that every subexpression with at least one identifier can be semantically important. Hence, we
split every formula into their MathML subtrees and analyzed their frequency in the corpora.
Overall, we analyzed over 2.5 Billion subexpressions in 300 Million documents and showed
that the frequency distribution of mathematical subexpressions is similar to words in natural
language corpora. By applying known frequency-based ranking functions, such as BM25, we
were also able to discover topic-relevant notations. We published these results at The Web
Conference (WWW) in 2020.
Chapter 3 — [14]
The applications that we derived from simply counting mathematical notations were surpris-
ingly versatile. For example, with the large set of indexed math notations, we implemented the
first type assistant system for math equations, developed a new faceted search engine for zb-
MATH, and enabled new approaches to measure potential plagiarism in equations. Besides these
practical applications, it also gave us the confidence to continue focusing on subexpressions for
our LATEX semantification. Previous projects that aimed to semantically enrich mathematical
expressions with information from the surrounding context primarily focused on one of the
earlier mentioned extremes, i.e., the leaves or roots in expression trees [139, 214, 279, 329, 330].
Our study also revealed that the majority of unique mathematical formulae are neither single
identifier nor highly complex mathematical expressions. Hence, we concluded that we should
12 Chapter 1
Introduction
Section 1.4. Thesis Outline
focus on semantically enriching subexpressions (subtrees) rather than the roots or leaves. We
proposed a novel context-sensitive translation approach based on semantically annotated MOI
and shared our theoretical concept with the community at the International Conference on
Mathematical Software (ICMS) in 2020.
Chapter 3 — [10]
Afterward, we started to realize the proposed pipeline with a specific focus on Wikipedia. We
focused on this encyclopedia for two reasons. First, Wikipedia is a free and community-driven
encyclopedia and, therefore, (a) less strict on writing styles and (b) more descriptive compared to
scientific articles. Second, Wikipedia can actively benefit from our contribution since additional
semantic information about mathematical formulae can support users of all experience levels
to read and comprehend articles more efficiently [150]. Moreover, a successful translation from
a formula in Wikipedia to a CAS makes the formula computable which enables numerous of
additional applications. In theory, a mathematical article could be turned into an interactive
document to some degree with our translations. However, the most valuable application of a
translation of formulae in Wikipedia would be the ability to check equations for their plausi-
bility. With the help of CAS, we are able to analyze if an equation is semantically correct or
suspicious. This evaluation would enable existing quality measures in Wikipedia to incorporate
mathematical equations for the first time. The results from our novel context-sensitive transla-
tor including the plausibility check algorithms have been accepted for publication in the IEEE
Transactions on Pattern Analysis and Machine Intelligence (TPAMI) journal and are currently
in press.
Currently, we are also actively working on extending the backbone of Wikipedia itself for
presenting additional semantic information about mathematical expressions by hovering over
or clicking on the formula. This new feature helps Wikipedia users to better understand the
meaning of mathematical formulae by providing details on the elements of formulae. Moreover,
it paves the way towards an interface to actively interact with mathematical content in Wikipedia
articles. We presented our progress and discussed our plans in the poster session at the JCDL
in 2020.
Chapter 1 13
Introduction
Section 1.4. Thesis Outline
Chapters 6 — [17]
Evaluating Digital Mathematical Libraries Alongside this main research path, we contin-
uously improved and extended LACAST with new features and new supported CAS. Our first goal
was to verify the translated, now computable, formulae in the DLMF. The primary motivation
behind this approach was to quantitatively measure the accuracy of LACAST translations. How
can we very if a translation was correct? The well-established Bilingual Evaluation Understudy
(BLEU) [282] measure in natural language translations is not directly applicable for mathemati-
cal languages because an expression may contain entirely different tokens but is still equivalent
to the gold standard. Since the translation is computable, however, we can take advantage of
the power of CAS to verify a translation. The basic idea is that a human-verified equation in
one system must remain valid in the target system. If this is not the case, only three sources
of errors are possible: either the source equation, the translation, or the CAS verification was
incorrect. With the assumption that equations in the DLMF and major proprietary CAS are
mostly error-free, we can translate equations from the DLMF to discover issues within LACAST.
First, we focused on symbolic verifications, i.e., we used the CAS to symbolically simplify the
difference between left- and right-hand side of an equation. If the simplified difference is 0,
the CAS symbolically verified the equivalence of the left- and right-hand side and confirmed a
correct translation via LACAST. Additionally, we extended the verification approach to include
more precise numeric evaluations. If a symbolic manipulation failed to return 0, it could also
mean the CAS was unable to simplify the expression. We numerically calculate the difference
on specific test values and check if the difference is below a given threshold to overcome this
issue. If all test calculations are below the threshold, we consider it numerically verified. Even
though this approach cannot verify equivalence, it is very effective in discovering disparity. We
published the first paper with this new verification approach based on Maple at the CICM in
2018.
Chapter 5 — [2]
The extension of the system and the new results led us to an extended journal version of the
initial LACAST publication [3]. This extended version mostly covered parts of my Master’s thesis
and is not reused in this thesis. For technical details about LACAST, see the journal publication [13].
In Appendix D available in the electronic supplementary material, we summarized all significant
issues and reported bugs we discovered via LACAST. The section also includes new issues that we
14 Chapter 1
Introduction
Section 1.4. Thesis Outline
discovered during the work on the journal publication. This journal version was published in
the Aslib Journal of Information Management in 2019.
Appendix D — [13]
It turned out that LACAST translations on semantic LATEX were so stable that we can use the
same approach for verifying translations also to specifically search for errors in the DLMF
and issues in CAS. To maximize the number of supported DLMF formulae, we implemented
additional heuristics to LACAST, such as a logic to identify the end of a sum or to correctly
interpret prime notations as derivatives. Additionally, we added support for translations to
Mathematica and SymPy. We extended the support for Mathematica even further to perform
the same verifications in Maple also in Mathematica. The Mathematica support finally allows
us to identify computational differences in two major proprietary CAS. Moreover, we extended
the previously introduced symbolic and numeric evaluation pipeline with more sophisticated
variable extraction algorithms, more comprehensive numeric test values, resolved substitutions,
and improved constraint-awareness. All discovered issues are summarized in Appendix D
available in the electronic supplementary material. We further made all translations of the
DLMF formulae publicly available, including the symbolic and numeric verification results. The
results of this recent study have been published at the international conference on Tools and
Algorithms for the Construction and Analysis of Systems (TACAS).
Chapter 5 — [8]
We also applied the same verification technique to the Wikipedia articles we mentioned ear-
lier, which enabled LACAST to symbolically and numerically verify even complex equations in
Wikipedia articles. This evaluation is also part of the TPAMI submission.
Chapter 1 15
Introduction
Discovering Diverse Content Through
Random Scribd Documents
while he talked, for it appeared that she had been reared in utter
ignorance of his writings, did not know that he had lived beneath
that very roof, nor that he lay buried in the church whose spire could
be seen from the mole. He waxed eloquent as he told her how the
gilded rank and fashion of London had found comfort in silence—
how Tom Moore, long since become one of its complacent satellites,
had read its wishes well: how he had stood in a locked room and
given the smug seal of his approbation while secret flame destroyed
the self-justification of a dead man’s name, the Memoirs which had
been a last bequest to a living daughter.
The shower passed, the sun came out rejoicing—still the master of
the Abbey talked. When he had finished he showed his listener a
portrait, painted by the American, Benjamin West. When she turned
from this, her face was oddly white; she was thinking of another
portrait hidden by a curtain, which had been one of the unsolved
mysteries of her childhood.
On her departure her host drove with her to Hucknall church, and
standing in the empty chancel she read the marble tablet set into
the wall:
IN THE VAULT BENEATH
LIE THE REMAINS OF
Could he whose ashes lay beneath that recording stone have seen
the look on the girl’s face as she listened—could he have seen her
shrink that night from a woman’s contained kiss—he would have
known that his lips had been touched with prophecy when he said:
“The smiles of her youth have been her mother’s, but the tears of
her maturity shall be mine!”
TRANSCRIBER’S NOTES:
Obvious typographical errors have been corrected.
Inconsistencies in hyphenation have been
standardized.
Archaic or variant spelling has been retained.
*** END OF THE PROJECT GUTENBERG EBOOK THE CASTAWAY ***
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.