0% found this document useful (0 votes)
129 views

Download Making Presentation Math Computable A Context Sensitive Approach for Translating LaTeX to Computer Algebra Systems 1st Edition André Greiner-Petter ebook All Chapters PDF

Making

Uploaded by

yonassjicks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views

Download Making Presentation Math Computable A Context Sensitive Approach for Translating LaTeX to Computer Algebra Systems 1st Edition André Greiner-Petter ebook All Chapters PDF

Making

Uploaded by

yonassjicks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Download Full Version ebook - Visit ebookmeta.

com

Making Presentation Math Computable A Context


Sensitive Approach for Translating LaTeX to
Computer Algebra Systems 1st Edition André
Greiner-Petter
https://ebookmeta.com/product/making-presentation-math-
computable-a-context-sensitive-approach-for-translating-
latex-to-computer-algebra-systems-1st-edition-andre-greiner-
petter/

OR CLICK HERE

DOWLOAD NOW

Discover More Ebook - Explore Now at ebookmeta.com


Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...

Start reading on any device today!

Corporate Compliance and Conformity A Convenience Theory


Approach to Executive Deviance 1st Edition Petter
Gottschalk
https://ebookmeta.com/product/corporate-compliance-and-conformity-a-
convenience-theory-approach-to-executive-deviance-1st-edition-petter-
gottschalk/
ebookmeta.com

The LATEX Graphics Companion Tools and Techniques for


Computer Typesetting 2nd Edition Michel Goossens

https://ebookmeta.com/product/the-latex-graphics-companion-tools-and-
techniques-for-computer-typesetting-2nd-edition-michel-goossens/

ebookmeta.com

Basic Math & Pre-Algebra All-in-One For Dummies Mark


Zegarelli

https://ebookmeta.com/product/basic-math-pre-algebra-all-in-one-for-
dummies-mark-zegarelli/

ebookmeta.com

Guide to the Rabbit Hole For Those Who Just Took the Red
Pill 1st Edition Anonymous

https://ebookmeta.com/product/guide-to-the-rabbit-hole-for-those-who-
just-took-the-red-pill-1st-edition-anonymous/

ebookmeta.com
Crop Pollination by Bees Evolution Ecology Conservation
and Management Keith S Delaplane

https://ebookmeta.com/product/crop-pollination-by-bees-evolution-
ecology-conservation-and-management-keith-s-delaplane/

ebookmeta.com

What is a Man 3 000 Years of Wisdom on the Art of Manly


Virtue 1st Edition Waller Randy Newell

https://ebookmeta.com/product/what-is-a-man-3-000-years-of-wisdom-on-
the-art-of-manly-virtue-1st-edition-waller-randy-newell/

ebookmeta.com

Docker Demystified 1st Edition Saibal Ghosh

https://ebookmeta.com/product/docker-demystified-1st-edition-saibal-
ghosh/

ebookmeta.com

The Bank War and the Partisan Press Stephen W. Campbell

https://ebookmeta.com/product/the-bank-war-and-the-partisan-press-
stephen-w-campbell/

ebookmeta.com

Covid-19 And Capitalism: Success And Failure Of The Legal


Methods For Dealing With A Pandemic 1st Edition Koen
Byttebier
https://ebookmeta.com/product/covid-19-and-capitalism-success-and-
failure-of-the-legal-methods-for-dealing-with-a-pandemic-1st-edition-
koen-byttebier/
ebookmeta.com
Urban Air Mobility UAM Ivana Semanjski Antonio Pratelli
Massimiliano Pieraccini Silvio Semanjski Massimiliano
Petri Sidharta Gautama
https://ebookmeta.com/product/urban-air-mobility-uam-ivana-semanjski-
antonio-pratelli-massimiliano-pieraccini-silvio-semanjski-
massimiliano-petri-sidharta-gautama/
ebookmeta.com
André Greiner-Petter

Making
Presentation
Math Computable
A Context-Sensitive Approach
for Translating LaTeX to Computer
Algebra Systems
Making Presentation Math Computable
André Greiner-Petter

Making Presentation
Math Computable
A Context-Sensitive Approach for
Translating LaTeX to Computer
Algebra Systems
André Greiner-Petter
Berlin, Germany

ISBN 978-3-658-40472-7 ISBN 978-3-658-40473-4 (eBook)


https://doi.org/10.1007/978-3-658-40473-4

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this book are included in the book’s Creative Commons
license, unless indicated otherwise in a credit line to the material. If material is not included in the
book’s Creative Commons license and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with
regard to jurisdictional claims in published maps and institutional affiliations.

This Springer Vieweg imprint is published by the registered company Springer Fachmedien
Wiesbaden GmbH, part of Springer Nature.
The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany
Front Matter

Contents

FRONT MATTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Zusammenfassung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation & Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Research Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

CHAPTER 2
Mathematical Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Background and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Mathematical Formats and Their Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Web Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Word Processor Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Computable Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.4 Images and Tree Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.5 Math Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 From Presentation to Content Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.2 Benchmarking MathML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.3 Evaluation of Context-Agnostic Conversion Tools . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.4 Summary of MathML Converters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4 Mathematical Information Retrieval for LaTeX Translations . . . . . . . . . . . . . . . . . . . . . 51

v
CHAPTER 3
Semantification of Mathematical LaTeX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.1 Semantification via Math-Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.1 Foundations and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1.2 Semantic Knowledge Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.3 On Overcoming the Issues of Knowledge Extraction Approaches . . . . . . . 68
3.1.4 The Future of Math Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2 Semantification with Mathematical Objects of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.3 Frequency Distributions of Mathematical Formulae . . . . . . . . . . . . . . . . . . . . . 76
3.2.4 Relevance Ranking for Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2.6 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3 Semantification with Textual Context Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3.1 Semantification, Translation & Evaluation Pipeline . . . . . . . . . . . . . . . . . . . . . . 91

CHAPTER 4
From LaTeX to Computer Algebra Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1 Context-Agnostic Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.1.1 Training Datasets & Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.1.3 Evaluation of the Convolutional Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Context-Sensitive Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.3 Formal Mathematical Language Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.4 Document Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.5 Annotated Dependency Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.6 Semantic Macro Replacement Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

CHAPTER 5
Qualitative and Quantitative Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1 Evaluations on the Digital Library of Mathematical Functions . . . . . . . . . . . . . . . . . . . 114
5.1.1 The DLMF dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1.2 Semantic LaTeX to CAS translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.3 Evaluation of the DLMF using CAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.1.5 Conclude Quantitative Evaluations on the DLMF . . . . . . . . . . . . . . . . . . . . . . . . 131
5.2 Evaluations on Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.2.1 Symbolic and Numeric Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2.2 Benchmark Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2.4 Error Analysis & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.2.5 Conclude Qualitative Evaluations on Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . 139

vi Contents
CHAPTER 6
Conclusion and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.2 Contributions and Impact of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3.1 Improved Translation Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.3.2 Improve LaTeX to MathML Converters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.3.3 Enhanced Formulae in Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3.4 Language Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

BACK MATTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161


Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Bibliography of Publications, Submissions & Talks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Contents vii
Front Matter

List of Figures

2.1 Reference map of mathematical formats and translations between them. . . . 20


2.2 The math template editor of Microsoft’s Word [395]. . . . . . . . . . . . . . . . 32
2.3 An expression tree representation of the explicit Jacobi polynomial definition
in terms of the hypergeometric function. . . . . . . . . . . . . . . . . . . . . . 35
2.4 GUI to support the creation of our gold standard MathMLben. . . . . . . . . . 46
2.5 Overview of the MathML tree edit distances to the gold standard. . . . . . . . 50
2.6 Runtime performances of LATEX to MathML conversion tools. . . . . . . . . . . 51
2.7 Four different layers of math objects in a single mathematical expression. . . . 53

3.1 t-SNE plot of top-1000 closest vectors of the identifier f . . . . . . . . . . . . . 68


3.2 Unique subexpressions for each complexity in arXiv and zbMATH. . . . . . . 77
3.3 Frequency and Complexity Distributions of Math Expressions. . . . . . . . . . 78
3.4 Most frequent math expressions in arXiv. . . . . . . . . . . . . . . . . . . . . . 80
3.5 Comparison plot of most frequent expressions in arXiv and zbMATH. . . . . . 82
3.6 Top-20 search results from topic-specific subsets. . . . . . . . . . . . . . . . . . 85
3.7 Search results for the query ‘Jacobi polynomial.’ . . . . . . . . . . . . . . . . . 90
3.8 Pipeline of the proposed context-sensitive conversion process. . . . . . . . . . 92

4.1 Mathematical semantic annotation in Wikipedia. . . . . . . . . . . . . . . . . . 102


4.2 The workflow of our context-sensitive translation pipeline. . . . . . . . . . . . 106

5.1 Example argument identifications for sums. . . . . . . . . . . . . . . . . . . . 121


5.2 The workflow of the evaluation engine and the overall results. . . . . . . . . . 123
5.3 The numeric test values and global constraints. . . . . . . . . . . . . . . . . . . 126

6.1 Layers of a mathematical expression with mathematical objects (MOI). . . . . 143


6.2 The annotated defining formula of Jacobi polynomials in the English Wikipedia
article. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3 Translation information for equation (6.3). . . . . . . . . . . . . . . . . . . . . 149
6.4 Proposed pipeline to improve existing LATEX to MathML converters. . . . . . . 155
6.5 Semantic enhancement of the formula E = mc2 . . . . . . . . . . . . . . . . . 157

ix
Front Matter

List of Tables

1.1 Different representations of a Jacobi polynomial. . . . . . . . . . . . . . . . . . 2


1.2 Examples of Mathematica’s LATEX import function. . . . . . . . . . . . . . . . . 3
1.3 The results of importing π(x + y) in different CAS. . . . . . . . . . . . . . . . 4
1.4 Different computation results for arccot(−1) (inspired by [84]). . . . . . . . . 5
1.5 Overview of the primary publications. . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Overview of secondary publications. . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Overview table of available mathematical format translations. . . . . . . . . . 21


2.2 LATEX to CAS translation comparison. . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Special content symbols added to LATExml. . . . . . . . . . . . . . . . . . . . . . 44

3.1 Find the Term where Term is a word that is to X what Y is to Z. . . . . . . . . . 65


3.2 The cosine distances of f regarding to the hits in Table 3.1. . . . . . . . . . . . 66
3.3 Descriptive terms for f in a given context. . . . . . . . . . . . . . . . . . . . . 67
3.4 Mathematical Objects of Interests Dataset Overview. . . . . . . . . . . . . . . 77
3.5 Settings for the retrieval experiments. . . . . . . . . . . . . . . . . . . . . . . . 84
3.6 Top s(t, D) scored expressions in zbMATH. . . . . . . . . . . . . . . . . . . . . 86
3.7 Most frequent expressions on topic-specific subsets of zbMATH. . . . . . . . . 88
3.8 Suggested autocompleted math expressions. . . . . . . . . . . . . . . . . . . . 89

4.1 Results of our neural machine translations. . . . . . . . . . . . . . . . . . . . . 98


4.2 Comparison between Mathematica and our machine translation. . . . . . . . . 99
4.3 Machine translations on 100 random DLMF samples. . . . . . . . . . . . . . . 99
4.4 Examples of our machine translations from LATEX to Mathematica. . . . . . . . 100
4.5 Mappings and likelihoods for a semantic LATEX macro. . . . . . . . . . . . . . . 111

5.1 Examples blueprints for subscripts of sums and products. . . . . . . . . . . . . 120


5.2 Translations for the prime derivative of the Hurwitz zeta function. . . . . . . . 122
5.3 The symbolic and numeric evaluations on Wikipedia. . . . . . . . . . . . . . . 134
5.4 Performance of description extractions via MLP. . . . . . . . . . . . . . . . . . 136
5.5 Performance of semantification from LATEX to semantic LATEX. . . . . . . . . . . 137
5.6 Performance comparison for translating LATEX to Mathematica. . . . . . . . . . 138

xi
FRONT MATTER

Abstract

This thesis addresses the issue of translating mathematical expressions from LATEX to the syntax
of Computer Algebra Systems (CAS). Over the past decades, especially in the domain of Science,
Technology, Engineering, and Mathematics (STEM), LATEX has become the de-facto standard
to typeset mathematical formulae in publications. Since scientists are generally required to
publish their work, LATEX has become an integral part of today’s publishing workflow. On the
other hand, modern research increasingly relies on CAS to simplify, manipulate, compute, and
visualize mathematics. However, existing LATEX import functions in CAS are limited to simple
arithmetic expressions and are, therefore, insufficient for most use cases. Consequently, the
workflow of experimenting and publishing in the Sciences often includes time-consuming and
error-prone manual conversions between presentational LATEX and computational CAS formats.
To address the lack of a reliable and comprehensive translation tool between LATEX and CAS,
this thesis makes the following three contributions.
First, it provides an approach to semantically enhance LATEX expressions with sufficient semantic
information for translations into CAS syntaxes. This, so called, semantification process analyzes
the structure of the formula and its textual context to conclude semantic information. The
research for this semantification process additionally contributes towards related Mathematical
Information Retrieval (MathIR) tasks, such as mathematical education assistance, math recom-
mendation and question answering systems, search engines, automatic plagiarism detection,
and math type assistance systems.
Second, this thesis demonstrates the first context-aware LATEX to CAS translation framework
LACAST. LACAST uses the developed semantification approach to transform LATEX expressions
into an intermediate semantic LATEX format, which is then further translated to CAS based
on translation patterns. These patterns were manually crafted by mathematicians to assure
accurate and reliable translations. In comparison, this thesis additionally elaborates a non-
context aware neural machine translation approach trained on a mathematical library generated
by Mathematica.
Third, the thesis provides a novel approach to evaluate the performance for LATEX to CAS
translations on large-scaled datasets with an automatic verification of equations in digital math-
ematical libraries. This evaluation approach is based on the assumption that equations in digital
mathematical libraries can be computationally verified by CAS, if a translation between both
systems exists. In addition, the thesis provides an in-depth manual evaluation on mathematical
articles from the English Wikipedia.
The presented context-aware translation framework LACAST increases the efficiency and reliability
of translations to CAS. Via LACAST, we strengthened the Digital Library of Mathematical Functions
(DLMF) by identifying numerous of issues, from missing or wrong semantic annotations to sign
errors. Further, via LACAST, we were able to discover several issues with the commercial CAS
Maple and Mathematica. The fundamental approaches to semantically enhance mathematics
developed in this thesis additionally contributed towards several related MathIR tasks. For

xiii
instance, the large-scale analysis of mathematical notations and the studies on math-embeddings
motivated new approaches for math plagiarism detection systems, search engines, and allow
typing assistance for mathematical inputs. Finally, LACAST translations will have a direct real-
world impact, as they are scheduled to be integrated into upcoming versions of the DLMF and
Wikipedia.

xiv Abstract
FRONT MATTER

Zusammenfassung

Diese Dissertation befasst sich mit der Problematik von Übersetzungen mathematischer For-
meln zwischen LATEX und Computeralgebrasystemen (CAS). Im Laufe des digitalen Zeitalters
wurde LATEX zum Quasistandard für das Schreiben mathematischer Formeln auf dem Computer,
insbesondere in den Disziplinen Mathematik, Informatik, Naturwissenschaften und Technik
(MINT). Da Wissenschaftler gemeinhin ihre Arbeit publizieren, ist LATEX zu einem integralen
Bestandteil moderner Forschung geworden. Gleichermaßen verlassen sich Wissenschaftler
immer mehr auf die Möglichkeiten moderner CAS, um effektiv mit mathematischen Formeln
zu arbeiten, zum Beispiel, indem sie diese umformen, lösen oder auch visualisieren. Die mo-
mentanen Ansätze, welche eine Übersetzung von LATEX zu CAS erlauben, wie beispielsweise
interne Import-Funktionen einiger CAS, sind jedoch häufig auf einfache arithmetische Aus-
drücke beschränkt und daher nur wenig hilfreich im realen Arbeitsalltag. Infolgedessen ist die
Arbeit moderner Wissenschaftler in den MINT Disziplinen häufig geprägt von zeitraubenden
und fehleranfälligen manuellen Übersetzungen zwischen LATEX und CAS.
Die vorliegende Dissertation leistet die folgenden Beiträge, um das Problem des Übersetzens
von mathematischen Ausdrücken zwischen LATEX und CAS zu lösen.
Zunächst ist LATEX ein Format, welches lediglich die visuelle Präsentation mathematischer Aus-
drücke kodiert, nicht jedoch deren semantische Informationen. Die semantischen Informationen
sind jedoch notwendig für CAS, welche keine mehrdeutigen Eingaben erlauben. Daher führt
die vorliegende Arbeit als ersten Schritt für eine Übersetzung eine sogenannte Semantifizierung
mathematischer Ausdrücke ein. Diese Semantifizierung extrahiert semantische Informationen
aus dem Kontext und den Bestandteilen der Formel, um Rückschlüsse auf ihre Bedeutung zu
ziehen. Da die Semantifizierung eine klassische Aufgabe auf dem Gebiet der mathematischen
Informationsgewinnung darstellt, leistet dieser Teil der Dissertation auch Beiträge zu verwand-
ten Themengebieten. So sind die hier vorgestellten Ansätze auch nützlich für pädagogische
Programme, Frage-Antwort Systeme, Suchmaschinen und die digitale Plagiatserkennung.
Als zweiten Beitrag, stellt die vorliegende Dissertation das erste kontextbezogene LATEX zu
CAS Übersetzungsprogramm vor, genannt LACAST. LACAST nutzt die zuvor eingeführte Seman-
tifizierung, um LATEX in ein Zwischenformat zu transformieren, welches die semantischen
Informationen explizit darstellt. Dieses Format wird semantisches LATEX genannt, da es eine
technische Erweiterung von LATEX ist. Die weitere Übersetzung zu CAS wird durch heuristi-
sche Übersetzungsmuster für mathematische Funktionen realisiert. Diese Übersetzungsmuster
wurden in Zusammenarbeit mit Mathematikern definiert, um eine korrekte Übersetzung in
diesem letzten Schritt zu gewährleisten. Um die Vorzüge einer kontextbezogenen Übersetzung
besser zu verstehen, stellt diese Arbeit zum Vergleich auch eine Maschinenübersetzung auf
neuronalen Netzen vor, welche den Kontext einer Formel nicht berücksichtigt.
Der dritte Beitrag dieser Dissertation führt eine neue Methode zur Evaluierung von mathe-
matischen Übersetzungen ein, welche es erlaubt, auch eine große Anzahl an Übersetzungen
auf ihre Korrektheit hin zu überprüfen. Diese Methode folgt dem Ansatz, dass Gleichungen

xv
in mathematischen Bibliotheken auch nach der Übersetzung in ein CAS noch korrekt sein
müssten. Ist dies nicht der Fall, ist entweder die Ausgangsgleichung, die Übersetzung, oder
das CAS fehlerhaft. Hierbei ist zu beachten, dass jede Fehlerquelle einen Mehrwert für das
jeweilige System darstellt. Zusätzlich zu dieser automatischen Evaluierung, erfolgt noch eine
manuelle Analyse von Übersetzungen auf Basis englischer Wikipedia Artikel.
Zusammenfassend ermöglicht das kontextbezogene Übersetzungsprogramm LACAST eine effizi-
entere Arbeitsweise mit CAS. Mit Hilfe dieser Übersetzungen konnten auch mehrere Probleme,
wie falsche Informationen oder Vorzeichenfehler, in der Digital Library of Mathematical Func-
tions (DLMF) sowie Fehler in den kommerziell vertriebenen CAS Maple und Mathematica
automatisch aufgedeckt und behoben werden.
Die hier vorgestellte Grundlagenforschung zum semantischen Anreichern mathematischer
Ausdrücke, hat zudem etliche Beiträge zu verwandten Forschungsthemen geleistet. Zum Bei-
spiel hat die Analyse der Verteilung von mathematischen Notationen in großen Datensätzen
neue Ansätze in der digitalen Plagiatserkennung ermöglicht. Des Weiteren wird zurzeit daran
gearbeitet, die Übersetzungen von LACAST in kommende Versionen von Wikipedia und der DLMF
zu integrieren.

xvi Zusammenfassung
FRONT MATTER

Acknowledgements

This thesis would not have been possible without the tremendous help and support from nu-
merous family members, friends, colleagues, supervisors, and several international institutions.
In the following, I want to take the opportunity to thank all the individuals and organizations
that helped me along the way to make this work possible.
My first sincere wishes go to my prodigious doctoral advisers Bela Gipp and Akiko Aizawa.
Their continuous support and counsel enabled me to realize this thesis at marvelous places and
together with numerous wonderful people from all over the world. Their enduring encourage-
ment and assistance, Bela’s abiding and infectious positivity, and Akiko’s steadfast and kind
endorsement empowered my personal and professional life. Both of their competent and sincere
guidance helped me to find my way in the intricate maze of research and career decisions and
turned my often onerous time into a joyful and memorable experience.
Moreover, I am very grateful to my adviser and friend Moritz Schubotz, who supported and
guided me throughout the entire time of my doctoral thesis and even beyond. Our fruitful and
always engaging discussions, even when exhausting, enriched and positively affected most, if
not all, of my work. It is not an exaggeration to admit that my career, including my Master’s
thesis and this doctoral thesis, would not have been possible and nearly as successful and joyful
as it has been without his continuous and sincere support over the years. I am wholeheartedly
thankful for all the years we worked together.
I further wish to gratefully acknowledge my friends, colleagues, and advisers Howard Cohl,
Abdou Youssef, and Bruce Miller at the National Institute of Standards and Technology (NIST) for
their valuable advice, continuous drive to perfection, and our rewarding collaborations. I thank
Jürgen Gerhard at Maplesoft, who kindly provided me access and support for Maple on several
occasions. I am just as thankful for the assistance and support from Norman Meuschke, who
always helped me to overcome governmental and organizational hurdles, Corinna Breitinger,
who never failed to refit my gibberish, and my colleagues and friends Terry Lima Ruas and
Philipp Scharpf for many visionary discussions. I also thank all my collaborators and colleagues
with whom I had the distinct opportunity to work together, including Takuto Asakura, Fabian
Müller, Olaf Teschke, William Grosky, Marjorie McClain, Yusuke Miyao, Malte Ostendorff,
Bonita Saunders, Kenichi Iwatsuki, Takuma Udagawa, Anastasia Zhukova, and Felix Hamborg.
I further want to thank the students I worked with, including Avi Trost, Rajen Dey, Joon Bang,
Kevin Chen, and Felix Petersen. I especially appreciate the help and assistance from people
at the National Institute of Informatics (NII) to overcome governmental and daily life issues. I
wish to especially thank Rie Ayuzawa, Noriko Katsu, Akiko Takenaka, and Goran Topic.
My genuine gratitude also goes to my host organizations and those that provided financial
support for my research. I am thankful for the German Academic Exchange Program (DAAD)
for enabling two research stays at the NII in Tokyo, the NII for providing me a wonderful
work environment, the German Research Foundation (DFG) to financially support many of
my projects, the NIST for hosting me as a guest researcher, and Maplesoft for offering me

xvii
an internship during my preliminary research project on the Digital Library of Mathematical
Functions (DLMF). I finally thank the ACM Special Interest Group on Information Retrieval
(SIGIR), the University of Konstanz, the University of Wuppertal, and Maplesoft for supporting
several conference participations.
My last and most crucial gratitude goes to my family and friends, who always cheered me in
good and bad times and constantly backed and supported me so that I could selfishly pursue my
dreams. I am deeply grateful for my lovely parents Rolf & Regina, who have always been on
my side and make all this possible behind the scenes. I am also tremendously thankful for the
enduring personal support from my dear friends Kevin, Lena, Vici, Dong, Peter, Vitor, Ayuko,
and uncountably more. Finally, I thank my lovely partner Aimi for brightening even the darkest
times and pushing every possible obstacle aside. I dedicate this thesis to my lovely parents, my
dear friends, and my enchanting girlfriend.

xviii Acknowledgements
I went to the woods because I wanted to live deliberately. I wanted
to live deep and suck out all the marrow of life. To put to rout all
that was not life; and not, when I had come to die, discover that I
had not lived.
Neil Perry - Dead Poet Society

CHAPTER 1

Introduction

Contents
1.1 Motivation & Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Research Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

This thesis addresses the issue of translating mathematical expressions from LATEX to the syntax
of Computer Algebra Systems (CAS), which is typically a time-consuming and error-prone task
in the modern life of many researchers. A reliable and comprehensive translation approach
requires analyzing the textual context of mathematical formulae. In turn, research advances
in translating LATEX contribute directly towards related tasks in the Mathematical Information
Retrieval (MathIR) arena. In this chapter, I provide an introduction to the topic. Section 1.1
introduces my motivation and provides an overview of the problem. Section 1.2 summarizes
the research gap. In Section 1.3, I define the research objective and research tasks of this thesis.
Section 1.4 concludes with an outline of the thesis including an overview of the publications
that contributed to the goals of this thesis and the research path that led to these publications.

1.1 Motivation & Problem


Consider a researcher is working on Jacobi polynomials and examines the existing English
Wikipedia article about the topic1 . While she might be familiar with the Digital Library of
Mathematical Functions (DLMF) [98], a standard resource for Orthogonal Polynomials and
Special Functions (OPSF), the equation 1.1 from the article might be new to her
   m
Γ(α + n + 1) n
n Γ(α + β + n + m + 1) z−1
Pn(α,β) (x) = . (1.1)
n! Γ(α + β + n + 1) m=0 m Γ(α + m + 1) 2

In order to analyze this new equation, e.g., to validate it, she wants to use CAS. CAS are
powerful mathematical software tools with numerous applications [207]. Today’s most widely
1
https://en.wikipedia.org/wiki/Jacobi_polynomials [accessed 2021-10-01].
Hereafter, dates follow the ISO 8601 standard. i.e., YYYY-MM-DD.

Supplementary Information The online version contains supplementary material available at


https://doi.org/10.1007/978-3-658-40473-4_1.

© The Author(s) 2023


A. Greiner-Petter, Making Presentation Math Computable, 1
https://doi.org/10.1007/978-3-658-40473-4_1
Section 1.1. Motivation & Problem

Table 1.1: Different representations of a Jacobi polynomial.

System Representation
(α,β)
Rendered Version Pn (cos(aΘ))
Generic LATEX P_n^{(\alpha, \beta)}(\cos(a\Theta))

Semantic LAT EX \JacobipolyP{\alpha}{\beta}{n}@{\cos@{a\Theta}}

Maple [36] JacobiP(n, alpha, beta, cos(a*Theta))

Mathematica [393] JacobiP[n, \[Alpha], \[Beta], Cos[a \[CapitalTheta]]]

SymPy [252] jacobi(n,Symbol(’alpha’),Symbol(’beta’),cos(a*Symbol(’Theta’)))

used CAS include Maple [36], Mathematica [393], and MATLAB [246]. Scientists use CAS2 to
simplify, manipulate, evaluate, compute, or even visualize mathematical expressions. Thus,
CAS play a crucial role in the modern era for pure and applied mathematics [8, 184, 207, 262]
and even found their way into classrooms [237, 363, 365, 389, 390]. In turn, CAS are the perfect
tool for the researcher in our example to examine the formula further. In order to use a CAS,
she needs to translate the expression into the correct CAS syntax.
Table 1.1 illustrates the differences between computable and presentational encodings for a
Jacobi polynomial. While the rendered version and the LATEX [220] encoding only provide
visual information, semantic LATEX [403] and the CAS encodings explicitly encode the meaning,
i.e., the semantics, of the formula. On the one hand, LATEX3 has become the de-facto standard
to typeset mathematics in scientific publications [129, 248, 402], especially in the domain of
Science, Technology, Engineering, and Mathematics (STEM). On the other hand, computational
advances make CAS an essential asset in the modern workflow of experimenting and publishing
in the Sciences. Translating expressions between LATEX and CAS syntaxes is, therefore, a
typical task in the everyday life of our hypothetical researcher. Despite this common need, no
reliable translation from a presentational format, such as LATEX, to a computable format, such as
Mathematica, is available to date. The only option our hypothetical researcher has is to manually
translate the expression in the specific syntax of a CAS. This process is time-consuming and
often error-prone.

Problem: No reliable translation from a presentational mathematical format to a



computable mathematical format exists to date.

If a translation between LATEX and CAS is so essential, why are there no translation tools
available? As is often the case in research, the reasons for this are diversified. First, there are
translation approaches available. Some CAS, such as Mathematica and SymPy, allow to import
LATEX expressions. Most CAS support at least the Mathematical Markup Language (MathML),
since it is the current web standard to encode mathematical formulae. With numerous tools
available to transfer LATEX to MathML [18], a translation from LATEX to CAS syntaxes should
not be a difficult task. However, none of these available translation techniques are reliable
2
In the sequel, the acronym CAS is used interchangeably with its plural.
3
https://www.latex-project.org/ [accessed 2021-10-01]

2 Chapter 1
Introduction
Section 1.1. Motivation & Problem

Table 1.2: Examples of Mathematica’s LATEX import function ToExpression["x", TeXForm].


Tested with Mathematica [393] v.12.1.1. The second sum in row 8 (marked with ?) is only
partially correct. Since the second summand contains the summation index n, the second
summand should be part of the sum.

LATEX Rendering Import Result


b
\int_a^b x dx a xdx Error 
b
\int_a^b x \mathrm{d}x a xdx Error 
b
\int_a^b x\, dx a x dx Integrate[x, {x, a, b}] 
b
\int_a^b x\; dx a x dx Error 
b
\int_a^b x\, \mathrm{d}x a x dx Error 
 b dx
\int_a^b \frac{dx}{x} a x Error 
N
\sum_{n=0}^N n^2 n=0 n
2 Sum[n^2, {n, 0, N}] 
N 2
\sum_{n=0}^N n^2 + n n=0 n + n Sum[n^2, {n, 0, N}] + n ?
{n \choose m} m
n
Error 
\binom{n}{m} m
n
Binomial[n, m] 

and comprehensive. Table 1.2 illustrates how Mathematica, one of the major proprietary CAS,
fails to import even simple formulae. Another option is SnuggleTeX [251], a LATEX to MathML
converter which also supports translations to Maxima [324]. SnuggleTeX fail to translate all
expressions in Table 1.2. Alternative translations via MathML as an intermediate format perform
similarly (as we will show later in Section 2.3).
While the simple cases shown in Table 1.2 could be solved with a more comprehensive and flex-
ible parser and mapping strategy, such a solution would ignore the real challenge of translating
mathematics to CAS, the ambiguity. The interpretation of the majority of mathematical expres-
sions is context-dependent, i.e., the same formula may refer to different concepts in different
contexts. Take the expressions π(x + y) as an example. In number theory, the expression most
likely refers to the number of primes less than or equal to x + y. In another context, however,
it may just refer to a multiplication πx + πy. Without considering the context, an appropriate
translation of this ambiguous expression is infeasible. Today’s translation solutions, however,
do not consider the context of an input. Instead, they translate the expression based on internal
decisions, which are often not transparent to a user.
Table 1.3 shows the results of importing π(x + y) to different CAS. Each CAS in Table 1.3
interprets π as a function call but does not associate it with the prime counting function (nor
any other predefined function). Only SnuggleTeX translated π as the mathematical constant
to Maxima syntax. However, Maxima does not contain a prime counting function. The CAS
import functions consider the expression as a generic function with the name π. Mathematica
surprisingly links π still with the mathematical constant which results in a peculiar behaviour
for numeric evaluations. The expression N[Pi[x+y]] (numeric evaluation of the imported
expression) is evaluated to 3.14159[x + y]. Associating the variables x and y with numbers,
say x, y = 1, would result in the rather odd expression 3.14159[2].

Chapter 1 3
Introduction
Section 1.1. Motivation & Problem

Table 1.3: The results of importing π(x + y) in different CAS. For Maple, a MathML rep-
resentation was used. Content MathML was not tested, since there is no content dictionary
available that defines the prime counting function. SnuggleTeX translated the expression to
the CAS Maxima. The two right most columns show the expected expressions in the context
of the prime counting function or a multiplication. None of the CAS choose any of the two
expected interpretations. Note that the prime counting function in Maple can also be written
with pi(x+y) and requires to pre-load the extra package NumberTheory. Nonetheless, this
function pi(x+y) is still different to the actual imported expression Pi(x+y). Note further
that Maxima does not define a prime counting function.

Translated Expected Expression


System Expression Number of primes Multip.
Maple [36] v.2019 Pi(x+y) PrimeCounting(x+y) Pi*(x+y)
Mathematica [393] v.12.1.1 Pi[x+y] NPrimes[x+y] Pi*(x+y)
SymPy [252] v.1.8 pi(x+y) primepi(x+y) pi*(x+y)
SnuggleTeX [251] v.1.2.2 %pi*(x+y) - %pi*(x+y)

Why do existing translation techniques not allow to specify a context? Mainly because it
is an open research question of what this context is or needs to be. The exact information
needs to perform translation to CAS syntaxes, and where to find them is unlcear [11]. Some
required information is indeed encoded in the structure of the expression itself. Consider a
simple fraction 12 . This expression is context-independent and can be directly translated. The
(α,β)
expression Pn (x) in the context of OPSF is also often unambiguous for general-purpose
CAS. Since Mathematica supports no other formula with this presentational structure, i.e.,
P followed by a subscript and superscript with paranthesis, Mathematica is able to correctly
(•,•)
associate P• (•), where • are wildcards, with the function JacobiP. In other cases, the
immediate textual context of the formula provides sufficient information to disambiguate the
expression [54, 329]. Consider, an author explicitly declares π(x) as the prime counting function
right before she uses it with π(x + y). In this case, it might be sufficient to scan the surrounding
context for key phrases [183, 214, 329], like ‘prime counting function’ in order to map π to, for
instance, NPrimes in Mathematica.
Often, the semantic explanations of mathematical objects in an article are scattered around in
the context or absent entirely [394]. An interested reader needs to retrieve sufficient seman-
tic explanations and correctly link them with mathematical objects in order to comprehend
the meaning of a complex formula. Sometimes, an author presumes the interpretation of an
expression can be considered as common knowledge and, therefore, does not require further
explanations. Consider π(x + y) refers to a multiplication between π and (x + y). In general,
an author may consider π (the mathematical constant) as common knowledge and does not
explicitly declare its meaning. The same could be true for scientific articles, where the length is
often limited. An article about prime numbers probably not explicitely declare the meaning of
π(x + y) because the author presumes the semantics are unambiguis given the overall context
of the article.

4 Chapter 1
Introduction
Section 1.1. Motivation & Problem

In other cases, the information needs go beyond a simple text analysis. Consider π(x + y)
as a generic function that was previously defined in the article and simply has no name. An
appropriate translation would require to retrieve the definition of the function from the context.
But even if a function is well-known and supported by a CAS, a direct translation might be
inappropriate because the definition in the CAS is not what our researcher expected [3, 13].
Legendre’s incomplete elliptic integral of the first kind F (φ, k), for example, is defined with
the amplitude φ as its first argument in the DLMF and Mathematica. In Maple, however, one
needs to use the sine of the amplitude sin(φ) for the first argument4 . In turn, an appropriate
translation to Maple might be EllipticF(sin(phi), k) rather than EllipticF(phi, k)
depending on the source of the original expression. The English Wikipedia article about elliptic
integrals5 contains both versions and refers to them with F (φ, k) and F (x; k) respectively.
Even though both versions in Wikipedia refer to the same function, correct translations to
Maple of F (φ, k) and F (x; k) are not the same.
In cases of multi-valued functions, transla-
tions between different systems can become Table 1.4: Different computation results for
eminently more complex [83, 91, 172]. Even arccot(−1) (inspired by [84]).
for simple cases, such as the arccotangent
function arccot(x), the behavior of different System or Source arccot(−1)
CAS might be confusing. For example, since
[276] 1st printing 3π/4
arccot(x) is multi-valued, there are multiple
solutions of arccot(−1). CAS, like any gen- [276] 9th printing −π/4
eral calculator too, only compute values on Maple [36] v.2020.2 3π/4
the principle branches and, therefore, return Mathematica [393] v.12.1.1 −π/4
only a single value. The principle branches,
however, are not necessarily uniformly po- SymPy [252] v.1.5.1 −π/4
sitioned among multiple systems [84, 172]. Axiom [173] v.Aug.2014 3π/4
In turn, the returned value of a multi-valued Reduce [151] v.5865 3π/4
function may depends on the system, see Ta-
MATLAB [246] v.R2021a −π/4
ble 1.4. A translation of arccot(x) from the
DLMF to arccot(x) in Maple would be only
correct for x > 0. Finally, CAS may also compute irrational looking expressions without
objections, e.g., arccot 10 returns 1.5708 in MATLAB6 . Even for field experts, it can be chal-
lenging to keep track of every property and characteristic of CAS [20, 100].

Problem: Existing LATEX to CAS converters are context-agnostic, inflexible, limited



to simple expressions, and nontransparent.

In combination, all of the issues underline that an accurate manual translation to the syntax of
CAS is challenging, time-consuming, error-prone, and requires deep and substantial knowledge
about the target system. Especially with the increasing complexity of the translated expressions,
errors during the translation process might be inevitable. Real-world scenarios often include
4
https://www.maplesoft.com/support/help/maple/view.aspx?path=EllipticF
[accessed 2021-10-01]
5
https://en.wikipedia.org/wiki/Elliptic_integral [accessed 2021-10-01]
6
MATLAB evaluates 10 to infinity and the limit in positive infinity of the arccotangent function is π2 (or roughly
1.5708). Yet, the interpretation of the division by zero is not wrong, since it follows the official IEEE 754 standard
for floating-point arithmetic [170].

Chapter 1 5
Introduction
Section 1.2. Research Gap

much more complicated formulae compared to the expressions in Table 1.2 or even equation (1.1).
Moreover, if an error occurs, the cause of the error can be very challenging to detect and traced
back to its origin. The issue of translating arccot(x) to Maple, for example, may remain
undiscovered until a user calculates negative values. If the function is embedded into a more
complex equation, even experts can lose track of potential issues. In combination with unreliable
translation tools, working with CAS may even be frustrating. Mathematica, for example, is able
to import our test expression (1.1) mentioned earlier without throwing an error7 . However,
investigating the imported expression reveals an incorrect translation due to an issue with
factorials. To productively work with CAS, our hypothetical researcher from above needs to
carefully evaluate if the automatically imported expression was correct. As a consequence,
existing translation approaches are not practically useful.
In this thesis, I will focus on discovering the information needs to perform correct translations
from presentational formats, here mainly LATEX, to computational formats, here mainly CAS
syntaxes. My personal motivation is to improve the workflow of researchers by providing them a
reliable translation tool that offers crucial additional information about the translation process.
Further, I limit the support of such a translation tool to general-purpose CAS, since many
general mathematical expressions simply cannot be translated to appropriate CAS expressions
for task-specific CAS (or other mathematical software, such as theorem provers). The focus on
general-purpose CAS allows me to provide a broad solution to a general audience. Note further
that, in this thesis, I mostly focus on the two major CAS Maple and Mathematica. However,
the goal is to provide a translation tool that is easy to extend and support more CAS.
Further, the real-world applications of such a translation tool go far beyond an improved work-
flow with CAS. A computable formula can be automatically verified with CAS [51, 52, 2,
8, 13, 153, 184, 414, 415], translated to other semantically enhanced formats, such as Open-
Math [53, 57, 119, 152, 303, 361], content MathML [59, 60, 159, 270, 318, 342] or other CAS
syntaxes [110, 361], imported to theorem prover [35, 57, 152, 163, 338, 375], or embedded in
interactive documents [85, 131, 150, 162, 201, 284]. Since an appropriate translation is generally
context-dependent, a translator must use MathIR [141] techniques to access sufficient semantic
information. Hence, advances in translating LATEX to CAS syntaxes also contribute directly
towards related MathIR tasks, including entity linking [150, 208, 212, 316, 319, 321, 322], math
search engines [92, 181, 182, 203, 211, 236, 274], semantic tagging of math formulae [71, 402],
recommendation systems [30, 31, 50, 319], type assistance systems [103, 106, 14, 321, 400], and
even plagiarism detection platforms [253, 254, 334].

1.2 Research Gap


Existing translation approaches from presentational formats to computable formats share the
same issues. Currently, these translation approaches are
1. context-independent, i.e., a translation of an expression is unique regardless of the context
from where the expression came from (see the π(x + y) example mentioned earlier);
2. nontransparent, i.e., the internal translation decisions are not communicated to the user,
which makes the translation untrustworthy and errors challenging to trace or detect;
7
If the binomial is given with the \binom macro rather than \choose.

6 Chapter 1
Introduction
Section 1.3. Research Objective

3. inflexible, i.e., slight changes in the notation can cause the translation to fail (see the
integral imports from Table 1.2); and
4. limited to simple expression due to missing mappings between function definition sources,
i.e., even with semantic information, a translation often fails.
Issue 4 raises from the fact that there are semantically enhanced data formats that have been
specifically developed to make expressions between CAS interchangeable, such as Open-
Math [119, 303, 361] and content MathML [318, 343]. Nonetheless, most CAS do not support
OpenMath natively [303] and the support for content MathML is limited to school mathemat-
ics [318]. The reason is that such translation requires a database that maps functions between
different semantic sources. As discussed above, creating such a comprehensive database can be
time-consuming due to slight differences between the systems (e.g., positions of branch cuts,
different supported domains, etc.) [361]. Hence, for economic reasons, crafting and maintaining
such a library is unreasonable. Translations between semantic enhanced formats, e.g., between
CAS syntaxes, OpenMath, or content MathML, are consequentially often unreliable.
In previous research, I was focusing on the issues 2-4 by developing a rule-based LATEX to
CAS translator, called LACAST. Originally, LACAST performs translations from semantic LATEX to
Maple. Relying on semantic LATEX allows LACAST to largely ignore the ambiguity Issue 1 and
focus on the other problems. For this thesis, I continued to develop LACAST to further mitigate
the limitation and inflexibility issues 3 and 4. Further, I focused on extending LACAST to become
the first context-aware translator to tackle the context-independency issue 1.

1.3 Research Objective


This doctoral thesis aims to:

 Research Objective
Develop and evaluate an automated context-sensitive process that makes presentational
mathematical expressions computable via computer algebra systems.

Hereafter, I consider the semantic information of a mathematical expression as sufficient if a


translation of the expression into the syntax of a CAS becomes feasible. To achieve the research
objective, I define the following five research tasks:

 Research Tasks
I Analyze the strengths and weaknesses of existing semantification approaches for
translating mathematical expressions to computable formats.
II Develop a semantification process that will improve on the weaknesses of current
approaches.
III Implement a system for the automated semantification of mathematical expressions
in scientific documents.
IV Implement an extension of the system to provide translations to computer algebra
systems.
V Evaluate the effectiveness of the developed semantification and translation system.

Chapter 1 7
Introduction
Section 1.4. Thesis Outline

1.4 Thesis Outline


Chapter 1 provides an introduction for translating presentational mathematical expressions
into computable formats. The chapter further defines the research gap for such translations and
defines the research objective and tasks this thesis addresses. Finally, it outlines the structure
of the thesis and briefly summarizes the main publications.
Chapter 2 provides an overview of related work by examining existing mathematical formats
and translation approaches between them. This chapter focuses on Research Task I by ana-
lyzing the strengths and weaknesses of existing translation approaches with the main focus on
the standard formats LATEX and MathML.
Chapter 3 addresses Research Task II by studying the capability of math embeddings, intro-
ducing a new concept to describe the nested structure of mathematical objects, and presenting
a novel context-sensitive semantification process for LATEX expressions.
Chapter 4 presents the first context-sensitive LATEX to CAS translator: LACAST. In particular, this
chapter focuses on Research Tasks III and IV by implementing the previously introduced
semantification process and integrates it into the rule-based semantic LATEX to CAS translator
LACAST. In addition, the chapter briefly summarizes a context-independent neural machine
translation approach to estimate how much structural information is encoded in mathematical
expressions.
Chapter 5 evaluates the new translation tool LACAST and, therefore, contributes mainly towards
Research Task V. In particular, it introduces the novel evaluation concept of equation veri-
fications to estimate the appropriateness of translated CAS expressions. Our new evaluation
concept not only detects issues in the translation pipeline but is also able to identify errors
in the source equation, e.g., from the DLMF or Wikipedia, and the target CAS, e.g., Maple or
Mathematica. In order to maximize the number of verifiable DLMF equations via our novel eval-
uation technique, this chapter also introduces some heuristic extensions to the LACAST pipeline.
Hence, this chapter partially contributes to Research Task IV too.
Chapter 6 concludes the thesis by summarizing contributions and their impact on the MathIR
community. It further provides a brief overview of the remaining issues and future work.
An Appendix is available in the electronic supplementary material and provides additional
information about certain aspects of this thesis including an extended error analysis, result
tables, and a summary of bugs and issues we discovered with the help of LACAST in the DLMF,
Maple, Mathematica, and Wikipedia.

1.4.1 Publications
Most parts of this thesis were published in international peer-reviewed conferences and journals.
Table 1.5 provides an overview of the publications that are reused in this thesis. The first column
identifies the chapter a publication contributed to. The venue rating was taken from the Core
ranking8 for conferences and the Scimago Journal Rank (SJR)9 for journal articles. Each rank
8
http : / / portal . core . edu . au / conf - ranks/ with the ranks: A* – flagship conference (top 5%),
A – excellent conference (top 15%), B – good conference (top 27%), and C – remaining conferences [accessed
2021-10-01].
9
https://www.scimagojr.com/ with the ranks Q1 – Q4 where Q1 refer to the best 25% of journals in the
field, Q2 to the second best quarter, and so on [accessed 2021-10-01].

8 Chapter 1
Introduction
Section 1.4. Thesis Outline

was retrieved for the year of publication (or year of submission, in case the paper has not been
published yet). Table 1.6 similarly shows publications that partially contributed towards the goal
of this thesis but are not reused within a chapter. Note that the publication [3] (in Table 1.6) was
part of my Master’s thesis and contributed towards this doctoral thesis as a preliminary project.
The Journal publication [13] (also in Table 1.6) is an extended and (with new results) updated
version of the thesis and the mentioned article [3]. The venue abbreviations in both tables are
explained in the glossary. Lastly, note that the TPAMI journal [11] is reused in Chapter 4 (for
the methodology) and in Chapter 5 (for the evaluation) to provide a coherent structure. My
publications, talks, and submissions are separated from the general bibliography in the back
matter and can be found on page 171.

Table 1.5: Overview of the primary publications in this thesis.


Author Venue
Ch. Venue Year Type Length Position Rating Ref.
SIGIR 2019 Workshop Full 1 of 6 Core A* [9]
2
JCDL 2018 Conference Full 2 of 6 Core A* [18]
Scientometrics 2020 Journal Full 1 of 7 SJR Q1 [15]
3 WWW 2020 Conference Full 1 of 7 Core A* [14]
ICMS 2020 Conference Full 1 of 4 n/a [10]
4 TPAMI10 2021 Journal Full 1 of 6 SJR Q1 [11]
TACAS 2021 Conference Full 1 of 8 Core A [8]
5
CICM 2018 Conference Full 2 of 3 n/a [2]
6 JCDL 2020 Conference Poster 2 of 5 Core A* [17]

Table 1.6: Overview of secondary publications that partially contributed to this thesis.
Author Venue
Year Venue Type Length Position Rating Ref.
CLEF Workshop Full 4 of 6 n/a [16]
2020
EMNLP Workshop Full 2 of 4 Core A [1]
2019 AJIM Journal Full 1 of 4 SJR Q1 [13]
2018 CICM Conference Short 1 of 4 n/a [12]
2017 CICM Conference Full 4 of 9 n/a [3]

1.4.2 Research Path


This section provides a brief overview of my research path that led to this thesis, i.e., it discusses
the primary publications and the motivations behind them. Every publication is marked with
the associated chapter and a reference. This research path is logically (not chronologically)
divided into three sections: preliminary work, the semantification of LATEX, and the evaluation
of translations.

Preliminary Work I had the first contact with the problem of translating LATEX to CAS
syntaxes during my undergraduate studies in mathematics. During that time, I regularly used
10
The methodology part of this journal is reused in Chapter 4 while the evaluation part is reused in Chapter 5.

Chapter 1 9
Introduction
Section 1.4. Thesis Outline

CAS like MATLAB and SymPy for numeric simulations and for plotting results. At the same
time, we were required to hand in our homework as LATEX files. While exporting content from the
CAS to LATEX files was rather straight forward, the other way around, i.e., importing LATEX into
the CAS, required manual conversions. I decided to explore the reasons for this shortcoming in
my Master’s thesis. During that time, I developed the first version of a semantic LATEX to CAS
translator, which was later coined LACAST11 . The results from this first study were published at
the Conference of Intelligent Computer Mathematics (CICM) in 2017.

 “Semantic Preserving Bijective Mappings of Mathematical Formulae Between


Document Preparation Systems and Computer Algebra Systems” by Howard
S. Cohl, Moritz Schubotz, Abdou Youssef, André Greiner-Petter, Jürgen
Gerhard, Bonita Saunders, Marjorie McClain, Joon Bang, and Kevin Chen. In:
Proceedings of the International Conference of Intelligent Computer Mathematics
(CICM), 2017.
Not Reused — [3]

This first version of LACAST focused specifically on the CAS Maple but was designed modularly
to allow later extensions to other CAS. The main limitation of LACAST, however, was the re-
quirement of using semantic LATEX macros to disambiguate mathematical expressions manually.
An automatic disambiguation process did not exist at the time. Moreover, only a few previous
projects focused on a semantification for translating mathematical formats. Hence, I continued
my research in this direction.
In the following, I will use ‘we’ rather than ‘I’ in the subsequent parts of this thesis, since none
of the presented contributions would have been possible without the tremendous and fruitful
discussions and help from advisors, colleagues, students, and friends.

Semantification of LATEX As an alternative for semantic LATEX, we closely investigated exist-


ing converters for MathML first (see Section 2.2.1). Since MathML was (and still is) the standard
encoding for mathematical expressions in the web, most CAS support MathML. MathML uses
two markups, presentation and content MathML. The former visualizes a formula, while the
latter describes the semantic content. Hence, content MathML can disambiguate math much
like semantic LATEX. Since MathML is the official web standard and LATEX the de-facto standard
for writing math, there are numerous of converters available that translate LATEX to MathML.
As our first contribution, we developed MathMLben, a benchmark dataset for measuring the
quality of MathML markup that appears in a textual context. With this benchmark, we evaluated
nine state-of-the-art LATEX to MathML converters, including Mathematica as a major CAS. We
published our results in the Joint Conference on Digital Libraries (JCDL) in 2018.

 “Improving the Representation and Conversion of Mathematical Formulae by


Considering their Textual Context” by Moritz Schubotz, André Greiner-
Petter, Philipp Scharpf, Norman Meuschke, Howard S. Cohl, and Bela Gipp.
In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
(JCDL), 2018.

Chapter 2 — [18]

11
LaTeX to CAS Translator.

10 Chapter 1
Introduction
Section 1.4. Thesis Outline

We discovered that three of the nine tools were able to generate content MathML but with
insufficient accuracy. None of the available tools were capable of analyzing a context for a
given formula. Hence, the converters were unable to conclude the correct semantic information
for most of the symbols and functions. In our study, we proposed a manual semantification
approach that semantically enriches the translation process of existing converters by feeding
them semantic information from the surrounding context of a formula. The enrichment process
was manually illustrated via the converter LATExml, which allowed us to add custom semantic
macros to improve the generated MathML data. In fact, we used this manual approach to create
the entries of MathMLben in the first place.
Naturally, our next goal was to automatically retrieve semantic information from the context
of a given formula. Around this time, word embeddings [256] began to gain interest in the
MathIR community [121, 215, 242, 400, 404]. It seems that vector representations were able to
capture some semantic properties of tokens in natural languages. Can we create such semantic
vector representations of mathematical expressions too? Unfortunately, we discovered that
the related work in this new area of interest did not discuss a crucial underlying issue with
embedding mathematical expressions. In math expressions, certain symbols or entire groups of
tokens are fixed, such as the red tokens in the Gamma function Γ(x) or the Jacobi polynomial
Pn (α,β) (x), while other may vary (gray). Inspired by words in natural languages, we call these
fixed tokens the stem of a mathematical object or operation. Unfortunately, in mathematics, this
stem is context-dependent. If π is a function, the red tokens are its stem π(x + y). However,
if π is not a function, the stem is just the symbol itself π(x + y). If we do not know the stem
of a mathematical object, how can we group them so that a trained model understands the
connection between variations like Γ(z) and Γ(x)? The answer is: we cannot. The only
alternative is to use context-independent representations, e.g., we only embed the identifiers or
the entire expression. Each of these approaches has advantages and disadvantages. We shared
our discussion with the community at the BIRNDL Workshop at the conference on Research
and Development in Information Retrieval (SIGIR) in 2019.

 “Why Machines Cannot Learn Mathematics, Yet” by André Greiner-Petter,


Terry Raus, Moritz Schubotz, Akiko Aizawa, William I. Grosky, and Bela
Gipp. In: Proceedings of the 4th Joint Workshop on Bibliometric-Enhanced
Information Retrieval and Natural Language Processing for Digital Libraries
(BIRNDL@SIGIR), 2019.

Chapter 2 — [9]

Nonetheless, context-independent math embeddings still have many valuable applications.


Search engines, for example, can profit from a vector representation that represents a mathe-
matical expression in a particular context. Such a trained model would still be unable to tell us
what the expression is, but it can tell us efficiently if the expression is semantically similar (e.g.,
because the surrounding text is similar) to another expression. Further, embedding semantic
LATEX allows us to overcome the issue of unknown stems for most functions since the macro
unambiguously defines the stem. Youssef and Miller [404] trained such a model on the DLMF
formulae. Later, we published an extended version of our workshop paper together with Youssef
and Miller in the Scientometrics journal.

Chapter 1 11
Introduction
Section 1.4. Thesis Outline

 “Math-Word Embedding in Math Search and Semantic Extraction” by An-


dré Greiner-Petter, Abdou Youssef, Terry Raus, Bruce R. Miller, Moritz
Schubotz, Akiko Aizawa, and Bela Gipp. In: Scientometrics 125(3): 3017-3046,
2020.

Chapter 3 — [15]

Unfortunately, this sets us back to the beginning, where we need manually crafted semantic
LATEX. We started to investigate the issue of interpreting the semantics of mathematical expres-
sions from a different perspective. As we will see later in Section 2.2.4, humans tend to visualize
mathematical expressions in a tree structure, where operators, functions, or relations are parent
nodes of their components. Identifiers and other terminal symbols are the leaves of these trees.
The MathML tree data structure comes close to these so-called expression trees (see Section 2.2.4)
but does not strictly follow the same idea [331]. The two aforementioned context-independent
approaches to embed mathematical expressions take either the leaves or the roots of such trees.
The subtrees in between are the context-dependent mathematical objects we need. Not all
subtrees, however, are meaningful, and the mentioned expression trees are only theoretical
interpretations. In searching for an approach to discover meaningful subexpressions, which we
call Mathematical Objects of Interest (MOI), we performed the first large-scale study of mathe-
matical notations on real-world scientific articles. In this study, we followed the assumption
that every subexpression with at least one identifier can be semantically important. Hence, we
split every formula into their MathML subtrees and analyzed their frequency in the corpora.
Overall, we analyzed over 2.5 Billion subexpressions in 300 Million documents and showed
that the frequency distribution of mathematical subexpressions is similar to words in natural
language corpora. By applying known frequency-based ranking functions, such as BM25, we
were also able to discover topic-relevant notations. We published these results at The Web
Conference (WWW) in 2020.

 “Discovering Mathematical Objects of Interest — A Study of Mathematical No-


tations” by André Greiner-Petter, Moritz Schubotz, Fabien Müller, Corinna
Bretinger, Howard S. Cohl, Akiko Aizawa, and Bela Gipp. In: Proceedings of
the Web Conference (WWW), 2020.

Chapter 3 — [14]

The applications that we derived from simply counting mathematical notations were surpris-
ingly versatile. For example, with the large set of indexed math notations, we implemented the
first type assistant system for math equations, developed a new faceted search engine for zb-
MATH, and enabled new approaches to measure potential plagiarism in equations. Besides these
practical applications, it also gave us the confidence to continue focusing on subexpressions for
our LATEX semantification. Previous projects that aimed to semantically enrich mathematical
expressions with information from the surrounding context primarily focused on one of the
earlier mentioned extremes, i.e., the leaves or roots in expression trees [139, 214, 279, 329, 330].
Our study also revealed that the majority of unique mathematical formulae are neither single
identifier nor highly complex mathematical expressions. Hence, we concluded that we should

12 Chapter 1
Introduction
Section 1.4. Thesis Outline

focus on semantically enriching subexpressions (subtrees) rather than the roots or leaves. We
proposed a novel context-sensitive translation approach based on semantically annotated MOI
and shared our theoretical concept with the community at the International Conference on
Mathematical Software (ICMS) in 2020.

 “Making Presentation Math Computable: Proposing a Context Sensitive Ap-


proach for Translating LaTeX to Computer Algebra Systems” by André
Greiner-Petter, Moritz Schubotz, Akiko Aizawa, and Bela Gipp. In: Pro-
ceedings of the International Conference on Mathematical Software (ICMS),
2020.

Chapter 3 — [10]

Afterward, we started to realize the proposed pipeline with a specific focus on Wikipedia. We
focused on this encyclopedia for two reasons. First, Wikipedia is a free and community-driven
encyclopedia and, therefore, (a) less strict on writing styles and (b) more descriptive compared to
scientific articles. Second, Wikipedia can actively benefit from our contribution since additional
semantic information about mathematical formulae can support users of all experience levels
to read and comprehend articles more efficiently [150]. Moreover, a successful translation from
a formula in Wikipedia to a CAS makes the formula computable which enables numerous of
additional applications. In theory, a mathematical article could be turned into an interactive
document to some degree with our translations. However, the most valuable application of a
translation of formulae in Wikipedia would be the ability to check equations for their plausi-
bility. With the help of CAS, we are able to analyze if an equation is semantically correct or
suspicious. This evaluation would enable existing quality measures in Wikipedia to incorporate
mathematical equations for the first time. The results from our novel context-sensitive transla-
tor including the plausibility check algorithms have been accepted for publication in the IEEE
Transactions on Pattern Analysis and Machine Intelligence (TPAMI) journal and are currently
in press.

 “Do the Math: Making Mathematics in Wikipedia Computable.” André


Greiner-Petter, Moritz Schubotz, Corinna Bretinger, Philipp Scharpf, Akiko
Aizawa, and Bela Gipp. In press: IEEE Transactions on Pattern Analysis and
Machine Intelligence (TPAMI), 2021.

Chapter 4 & 5 — [11]

Currently, we are also actively working on extending the backbone of Wikipedia itself for
presenting additional semantic information about mathematical expressions by hovering over
or clicking on the formula. This new feature helps Wikipedia users to better understand the
meaning of mathematical formulae by providing details on the elements of formulae. Moreover,
it paves the way towards an interface to actively interact with mathematical content in Wikipedia
articles. We presented our progress and discussed our plans in the poster session at the JCDL
in 2020.

Chapter 1 13
Introduction
Section 1.4. Thesis Outline

 “Mathematical Formulae in Wikimedia Projects 2020.” Moritz Schubotz, André


Greiner-Petter, Norman Meuschke, Olaf Teschke, and Bela Gipp. In: Poster
Session at the ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2020.

Chapters 6 — [17]

Evaluating Digital Mathematical Libraries Alongside this main research path, we contin-
uously improved and extended LACAST with new features and new supported CAS. Our first goal
was to verify the translated, now computable, formulae in the DLMF. The primary motivation
behind this approach was to quantitatively measure the accuracy of LACAST translations. How
can we very if a translation was correct? The well-established Bilingual Evaluation Understudy
(BLEU) [282] measure in natural language translations is not directly applicable for mathemati-
cal languages because an expression may contain entirely different tokens but is still equivalent
to the gold standard. Since the translation is computable, however, we can take advantage of
the power of CAS to verify a translation. The basic idea is that a human-verified equation in
one system must remain valid in the target system. If this is not the case, only three sources
of errors are possible: either the source equation, the translation, or the CAS verification was
incorrect. With the assumption that equations in the DLMF and major proprietary CAS are
mostly error-free, we can translate equations from the DLMF to discover issues within LACAST.
First, we focused on symbolic verifications, i.e., we used the CAS to symbolically simplify the
difference between left- and right-hand side of an equation. If the simplified difference is 0,
the CAS symbolically verified the equivalence of the left- and right-hand side and confirmed a
correct translation via LACAST. Additionally, we extended the verification approach to include
more precise numeric evaluations. If a symbolic manipulation failed to return 0, it could also
mean the CAS was unable to simplify the expression. We numerically calculate the difference
on specific test values and check if the difference is below a given threshold to overcome this
issue. If all test calculations are below the threshold, we consider it numerically verified. Even
though this approach cannot verify equivalence, it is very effective in discovering disparity. We
published the first paper with this new verification approach based on Maple at the CICM in
2018.

 “Automated Symbolic and Numerical Testing of DLMF Formulae Using Com-


puter Algebra Systems” by Howard S. Cohl, André Greiner-Petter, and
Moritz Schubotz. In: Proceedings of the International Conference on Intelligent
Computer Mathematics (CICM), 2018.

Chapter 5 — [2]

The extension of the system and the new results led us to an extended journal version of the
initial LACAST publication [3]. This extended version mostly covered parts of my Master’s thesis
and is not reused in this thesis. For technical details about LACAST, see the journal publication [13].
In Appendix D available in the electronic supplementary material, we summarized all significant
issues and reported bugs we discovered via LACAST. The section also includes new issues that we

14 Chapter 1
Introduction
Section 1.4. Thesis Outline

discovered during the work on the journal publication. This journal version was published in
the Aslib Journal of Information Management in 2019.

 “Semantic preserving bijective mappings for expressions involving special func-


tions between computer algebra systems and document preparation systems”
by André Greiner-Petter, Howard S. Cohl, Moritz Schubotz, and Bela Gipp.
In: Aslib Journal of Information Management 71(3): 415-439, 2019.

Appendix D — [13]

It turned out that LACAST translations on semantic LATEX were so stable that we can use the
same approach for verifying translations also to specifically search for errors in the DLMF
and issues in CAS. To maximize the number of supported DLMF formulae, we implemented
additional heuristics to LACAST, such as a logic to identify the end of a sum or to correctly
interpret prime notations as derivatives. Additionally, we added support for translations to
Mathematica and SymPy. We extended the support for Mathematica even further to perform
the same verifications in Maple also in Mathematica. The Mathematica support finally allows
us to identify computational differences in two major proprietary CAS. Moreover, we extended
the previously introduced symbolic and numeric evaluation pipeline with more sophisticated
variable extraction algorithms, more comprehensive numeric test values, resolved substitutions,
and improved constraint-awareness. All discovered issues are summarized in Appendix D
available in the electronic supplementary material. We further made all translations of the
DLMF formulae publicly available, including the symbolic and numeric verification results. The
results of this recent study have been published at the international conference on Tools and
Algorithms for the Construction and Analysis of Systems (TACAS).

 “Comparative Verification of the Digital Library of Mathematical Functions


and Computer Algebra Systems” by André Greiner-Petter, Howard S. Cohl,
Abdou Youssef, Moritz Schubotz, Avi Trost, Rajen Dey, Akiko Aizawa, and
Bela Gipp. In: Tools and Algorithms for the Construction and Analysis of
Systems (TACAS), 2022.

Chapter 5 — [8]

We also applied the same verification technique to the Wikipedia articles we mentioned ear-
lier, which enabled LACAST to symbolically and numerically verify even complex equations in
Wikipedia articles. This evaluation is also part of the TPAMI submission.

Chapter 1 15
Introduction
Discovering Diverse Content Through
Random Scribd Documents
while he talked, for it appeared that she had been reared in utter
ignorance of his writings, did not know that he had lived beneath
that very roof, nor that he lay buried in the church whose spire could
be seen from the mole. He waxed eloquent as he told her how the
gilded rank and fashion of London had found comfort in silence—
how Tom Moore, long since become one of its complacent satellites,
had read its wishes well: how he had stood in a locked room and
given the smug seal of his approbation while secret flame destroyed
the self-justification of a dead man’s name, the Memoirs which had
been a last bequest to a living daughter.
The shower passed, the sun came out rejoicing—still the master of
the Abbey talked. When he had finished he showed his listener a
portrait, painted by the American, Benjamin West. When she turned
from this, her face was oddly white; she was thinking of another
portrait hidden by a curtain, which had been one of the unsolved
mysteries of her childhood.
On her departure her host drove with her to Hucknall church, and
standing in the empty chancel she read the marble tablet set into
the wall:
IN THE VAULT BENEATH
LIE THE REMAINS OF

GEORGE GORDON, LORD BYRON

THE AUTHOR OF “CHILDE HAROLD’S PILGRIMAGE”.


HE WAS BORN IN LONDON ON THE 22nd OF
JANUARY, 1788.
HE DIED AT MISSOLONGHI IN WESTERN GREECE,
ON THE 19th OF APRIL, 1824,
ENGAGED IN THE GLORIOUS ATTEMPT TO
RESTORE THAT
COUNTRY TO HER ANCIENT FREEDOM AND
RENOWN.
HIS SISTER PLACED THIS TABLET TO HIS
MEMORY.
A long time the girl stood silent, her features quivering with some
strange emotion of reproach and pain. Behind her she heard her
escort’s voice. He was repeating lines from the book she had been
reading an hour before:

“My hopes of being remembered are entwined


With my land’s language: if too fond and far
These aspirations in their scope inclined—
If my fame should be, as my fortunes are,
Of hasty blight, and dull Oblivion bar
My name from out the temple where the dead
Are honored by the nations—let it be—
And light the laurels on a loftier head!
Meantime, I seek no sympathies, nor need;
The thorns which I have reaped are of the tree
I planted. They have torn me—and I bleed.

My task is done—my song hath ceased—my theme


Has died into an echo; it is fit
The spell should break of this protracted dream.
The torch shall be extinguished which hath lit
My midnight lamp—and what Is writ, is writ.
Farewell! a word that must be, and hath been—
A sound which makes us linger;—yet—farewell
Ye! who have traced the Pilgrim to the scene
Which is his last, if in your memories dwell
A thought which once was his, if on ye swell
A single recollection, not in vain
He wore his sandal-shoon, and scallop-shell!”

Could he whose ashes lay beneath that recording stone have seen
the look on the girl’s face as she listened—could he have seen her
shrink that night from a woman’s contained kiss—he would have
known that his lips had been touched with prophecy when he said:
“The smiles of her youth have been her mother’s, but the tears of
her maturity shall be mine!”
TRANSCRIBER’S NOTES:
Obvious typographical errors have been corrected.
Inconsistencies in hyphenation have been
standardized.
Archaic or variant spelling has been retained.
*** END OF THE PROJECT GUTENBERG EBOOK THE CASTAWAY ***

Updated editions will replace the previous one—the old editions


will be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States
copyright in these works, so the Foundation (and you!) can copy
and distribute it in the United States without permission and
without paying copyright royalties. Special rules, set forth in the
General Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the


free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree
to abide by all the terms of this agreement, you must cease
using and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only


be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project Gutenberg™
works in compliance with the terms of this agreement for
keeping the Project Gutenberg™ name associated with the
work. You can easily comply with the terms of this agreement
by keeping this work in the same format with its attached full
Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project


Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:

This eBook is for the use of anyone anywhere in the United


States and most other parts of the world at no cost and
with almost no restrictions whatsoever. You may copy it,
give it away or re-use it under the terms of the Project
Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country
where you are located before using this eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is


derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of
the copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is


posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute


this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or


providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project


Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite these
efforts, Project Gutenberg™ electronic works, and the medium
on which they may be stored, may contain “Defects,” such as,
but not limited to, incomplete, inaccurate or corrupt data,
transcription errors, a copyright or other intellectual property
infringement, a defective or damaged disk or other medium, a
computer virus, or computer codes that damage or cannot be
read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except


for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU AGREE
THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT
LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT
EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE
THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person
or entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set


forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you
do or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status
by the Internal Revenue Service. The Foundation’s EIN or
federal tax identification number is 64-6221541. Contributions
to the Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500


West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact
Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws


regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or determine
the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current


donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.

You might also like