Domain Specific Languages in R Advanced Statistical Programming 1st Edition Thomas Mailund 2024 scribd download
Domain Specific Languages in R Advanced Statistical Programming 1st Edition Thomas Mailund 2024 scribd download
com
https://textbookfull.com/product/domain-specific-languages-
in-r-advanced-statistical-programming-1st-edition-thomas-
mailund-2/
OR CLICK BUTTON
DOWNLOAD NOW
https://textbookfull.com/product/domain-specific-languages-in-r-
advanced-statistical-programming-1st-edition-thomas-mailund-2/
textboxfull.com
https://textbookfull.com/product/functional-data-structures-in-r-
advanced-statistical-programming-in-r-thomas-mailund/
textboxfull.com
https://textbookfull.com/product/functional-data-structures-in-r-
advanced-statistical-programming-in-r-mailund/
textboxfull.com
https://textbookfull.com/product/domain-specific-languages-made-easy-
meap-v04-meinte-boersma/
textboxfull.com
https://textbookfull.com/product/practical-scala-dsls-real-world-
applications-using-domain-specific-languages-1st-edition-pierluigi-
riti/
textboxfull.com
https://textbookfull.com/product/the-joys-of-hashing-hash-table-
programming-with-c-1st-edition-thomas-mailund/
textboxfull.com
Domain-Specific
Languages in R
Advanced Statistical Programming
—
Thomas Mailund
Domain-Specific
Languages in R
Advanced Statistical
Programming
Thomas Mailund
Domain-Specific Languages in R: Advanced Statistical Programming
Thomas Mailund
Aarhus N, Staden København, Denmark
Chapter 1: Introduction������������������������������������������������������������������������1
Who This Book Is For���������������������������������������������������������������������������������������������1
Domain-Specific Languages���������������������������������������������������������������������������������2
iii
Table of Contents
Operator Overloading������������������������������������������������������������������������������������63
Group Generics����������������������������������������������������������������������������������������������68
Precedence and Evaluation Order�����������������������������������������������������������������������70
Code Blocks��������������������������������������������������������������������������������������������������������73
iv
Table of Contents
References����������������������������������������������������������������������������������������249
Index�������������������������������������������������������������������������������������������������251
v
About the Author
Thomas Mailund is an associate professor in bioinformatics at Aarhus
University, Denmark. He has a background in math and computer science.
For the past decade, his main focus has been on genetics and evolutionary
studies, particularly comparative genomics, speciation, and gene flow
between emerging species. He has published Beginning Data Science in
R, Functional Programming in R, and Metaprogramming in R, all from
Apress, as well as other books.
vii
About the Technical Reviewer
Colin Fay works for ThinkR, a French agency
focused on everything R-related.
During the day, he helps companies
take full advantage of the power of R by
providing training (from beginner to
expert), tools (packages, shiny apps…), and
infrastructure. His main areas of expertise
are software engineering, analytics, and data
visualization.
During the night, Colin is a hyperactive open source developer and an
open data advocate. You can find a lot of his work on his GitHub account
(https://github.com/ColinFay).
He is also active in the data science community in France, especially
in his home town Rennes, where he founded the data-blogging web site
Data-Bzh.fr, co-founded the Breizh Data Club association, and organizes
the Breizh Data Club Meetups.
You can learn more about Colin via his web site at https://colinfay.me,
and you can find him on Twitter at https://twitter.com/_ColinFay.
To learn more about ThinkR, please visit www.thinkr.fr/,
https://github.com/ThinkR-open, and https://twitter.com/Thinkr_FR.
ix
CHAPTER 1
Introduction
This book introduces embedded domain-specific languages in R. The
term domain-specific languages, or DSL, refers to programming languages
specialized for a particular purpose, as opposed to general-purpose
programming languages. Domain-specific languages ideally give you
a precise way of specifying tasks you want to do and goals you want to
achieve, within a specific context. Regular expressions are one example
of a domain-specific language, where you have a specialized notation
to express patterns of text. You can use this domain-specific language to
define text strings to search for or specify rules to modify text. Regular
expressions are often considered very hard to read, but they do provide
a useful language for describing text patterns. Another example of a
domain-specific language is SQL—a language specialized for extracting
from and modifying a relational database. With SQL, you have an
expressive domain-specific language in which you can specify rules as to
which data points in a database you want to access or modify.
D
omain-Specific Languages
With domain-specific languages we often distinguish between “external”
and “embedded” languages. Regular expressions and SQL are typically
specified as strings when you use them in a program, and these strings
must be parsed and interpreted when your program runs. In a sense,
they are languages separated from the programming language you use
them in. They need to be compiled separately and then called by the
main programming language. They are therefore considered “external”
languages. In contrast, embedded domain-specific languages provide
domain-specific languages expressed in the general-purpose language
in which they are used. In R, the grammar of graphics implemented in
ggplot2 or the data transformation operations implemented in dplyr
provides small languages—domain-specific languages—that you can use
from within R, and you write the programs for these languages in R as well.
Embedded DSLs extend the programming language in which you are
working. They provide more than what you usually find in a framework
in the form of functions and classes as they offer a high level of flexibility
in what you can do with them. They are programming languages, after
all, and you can express complex ideas and tasks in them. They provide a
language for expressing thoughts in a specific domain, so they do not give
you a general programming language as the language you use them from,
but they do extend that surrounding language with new expressive power.
However, being embedded in the general-purpose language means that
they will follow the rules you are familiar with there—or mostly, at least,
2
Chapter 1 Introduction
3
Chapter 1 Introduction
then the product will be computed from left to right, like this:
For each multiplication, you produce a new matrix that will be used in
the next multiplication.
Now, matrix multiplication is associative, so you should be able to
set the parentheses in any way, as long as you respect the left-to-right
order of the matrices (matrix multiplication is not commutative, after
all), and you will get the same result. The running time, however, will
not be the same. We can do a small experiment to see this using the
microbenchmark package.
library(microbenchmark)
res <- microbenchmark(A %*% B %*% C %*% D,
((A %*% B) %*% C) %*% D,
(A %*% (B %*% C)) %*% D,
(A %*% B) %*% (C %*% D),
A %*% (B %*% (C %*% D)),
A %*% ((B %*% C) %*% D))
options(microbenchmark.unit="relative")
print(res, signif = 3, order = "mean")
4
Chapter 1 Introduction
## Unit: relative
## expr min lq mean median
## (A %*% B) %*% (C %*% D) 1.00 1.00 1.00 1.00
## A %*% (B %*% (C %*% D)) 3.92 3.87 3.49 3.84
## A %*% B %*% C %*% D 6.13 6.06 5.42 6.03
## ((A %*% B) %*% C) %*% D 6.12 6.05 5.51 6.04
## A %*% ((B %*% C) %*% D) 7.71 7.62 6.75 7.57
## (A %*% (B %*% C)) %*% D 9.88 9.76 8.73 9.69
## uq max neval
## 1.00 1.00 100
## 3.62 1.41 100
## 5.57 2.06 100
## 5.61 2.35 100
## 7.00 2.30 100
## 8.99 3.71 100
Here, I’ve computed the matrix product in the five different possible
ways. There are six expressions, but the first two will compute the matrix
multiplication in the same order. With microbenchmark we compute each
expression 100 times and collect the time each evaluation takes. We collect
the time it takes to compute each expression, and here I have displayed
the running time relative to the fastest expression, sorted by the mean
evaluation time.
On average, there is almost a factor of ten between the fastest and
the slowest evaluation (for the slowest evaluations in the two cases the
difference is a factor of two, which is still a substantial relative difference).
There is something to be gained by setting parentheses optimally if we
multiply together several large matrices. The dimensions of matrices are
not necessarily known before runtime, however, so ideally we want to set
the parentheses when we evaluate expressions in an optimal way.
5
Chapter 1 Introduction
m <- function(data) {
structure(data,
nrow = nrow(data),
ncol = ncol(data),
class = c("matrix_expr", class(data)))
}
6
Chapter 1 Introduction
v <- function(expr)
eval_matrix_expr(rearrange_matrix_expr(expr))
options(microbenchmark.unit="relative")
print(res, signif = 3, order = "mean")
7
Chapter 1 Introduction
## Unit: relative
## expr min lq mean
## (A %*% B) %*% (C %*% D) 1.00 1.00 1.00
## v(m(A) * m(B) * m(C) * m(D)) 1.13 1.19 1.37
## A %*% B %*% C %*% D 6.13 6.09 5.65
## median uq max neval
## 1.00 1.00 1.00 100
## 1.23 1.26 1.19 100
## 6.08 5.99 2.19 100
The automatic solution is only slightly slower than the optimal solution
and about a factor of six better than the default evaluation.
8
CHAPTER 2
Matrix Expressions
In the next chapter we discuss computer languages in a more theoretical
way, but here we will consider a concrete case—the matrix expressions
mentioned in Chapter 1. This example is a relatively simple domain-
specific language, but parsing matrix expressions, optimizing them, and
then evaluating them are all the phases we usually have to implement in
any DSL, and the implementation will also have examples of most of the
techniques we will cover in more detail later. The example will use some
tricks that I will not explain until later in the book, so some aspects might
not be evident at this point, but the broader strokes should be and will
ideally serve as a sneak peek of what follows in future chapters.
Our goal for writing a language for matrix expressions is to improve
upon the default performance of the built-in matrix expressions. We
achieve this by taking a more global view of expressions than R does—R
will handle each operator one at a time from left to right, but we will
analyze expressions and rearrange them to improve performance. These
are the steps we must take to do this:
library(microbenchmark)
P
arsing Expressions
To keep things simple, we will only consider matrix multiplication and
matrix addition. We do not include scalar multiplication or inverting or
transposing matrices or any other functionality. Adding more components
of the expression language in the example will follow the same ideas as
we need for multiplication and addition. It will not teach us anything new
regarding embedding DSLs in R. When you understand the example, you
will be able to do this easily yourself.
With these restrictions, we can say that a matrix expression is either
just a matrix, the product of two matrix expressions, or the sum of two
matrix expressions. We can represent this as a class hierarchy with one
(abstract) superclass representing expressions and three (concrete)
subclasses for actual data, products, and sums. If you are not familiar with
object-oriented programming in R, we will have a short guide to everything
you need to know in Chapter 4. Constructors for creating objects of the
three concrete classes can look like these:
m <- function(data) {
structure(list(data = data),
nrow = nrow(data),
ncol = ncol(data),
def_expr = deparse(substitute(data)),
class = c("matrix_data", "matrix_expr"))
}
matrix_mult <- function(A, B) {
10
Chapter 2 Matrix Expressions
11
Chapter 2 Matrix Expressions
cat(toString(x), "\n")
}
12
Chapter 2 Matrix Expressions
You might be wondering why we need the m function. After all, it does
not contribute anything to expressions instead of just wrapping matrices.
Could we just use the matrices directly? The answer is no, and it has to
do with how we use operator overloading. For * and + to be the matrix
expression versions, we need the first arguments given to them to be a
matrix expression. If we wrote simply this:
A * B + C
we would be invoking the operators for R’s matrix class instead. And since
* is not matrix multiplication (for that you need to use %*% because the *
operator is component-wise multiplication), you get an error.
We need a way of bootstrapping us from R’s matrices to the matrices in
our expression language. That is what we use m for.
13
Chapter 2 Matrix Expressions
M
eta-Programming Parsing
Using an explicit function such as m to bootstrap us into the matrix
expression language is the simplest way to use R’s own parser for our
benefits, but it is not the only way. In R, we can manipulate expressions as
if they were data, a feature known as meta-programming and something we
return to in Chapter 5. For now, it suffices to know that an expression can
be explored recursively. We can use the predicate is.name to check whether
the expression refers to a variable, and we can use the predicate is.call
to check whether it is a function call—and all operators are function calls.
So, given an expression that does not use the m function and thus does not
enter our DSL, we can transform it into one that goes like this:
if (is.call(expr)) {
if (expr[[1]] == as.name("("))
return(build_matrix_expr(expr[[2]]))
if (expr[[1]] == as.name("*") ||
expr[[1]] == as.name("%*%")) {
return(call('*',
build_matrix_expr(expr[[2]]),
build_matrix_expr(expr[[3]])))
}
if (expr[[1]] == as.name("+")) {
return(call('+',
build_matrix_expr(expr[[2]]),
build_matrix_expr(expr[[3]])))
}
}
14
Chapter 2 Matrix Expressions
build_matrix_expr(A * B)
build_matrix_expr(quote(A * B))
## m(A) * m(B)
15
Chapter 2 Matrix Expressions
parse_matrix_expr(A * B)
## m(A) * m(B)
This isn’t a perfect solution, and there are some pitfalls, among which
is that you cannot use this function from other functions directly. The
substitute function can be difficult to work with. The further problem
is that we are creating a new expression, but it’s an R expression and not
the data structure we want in our matrix expression language. You can
think of the R expression as a literate piece of code; it is not yet evaluated
to become the result we want. For that, we need the eval function, and
we need to evaluate the expression in the right context. Working with
expressions, especially evaluating expressions in different environments,
is among the more advanced aspects of R programming, so if it looks
complicated right now, do not despair. We cover it in detail in Chapter 7.
For now, we will just use this function:
parse_matrix_expr(A * B)
## ([A] * [B])
16
Chapter 2 Matrix Expressions
Expression Manipulation
Our goal for writing this matrix DSL is to optimize evaluation of these
matrix expressions. There are several optimizations we can consider, but
R’s matrix implementation is reasonably efficient already. It is hard to beat
if we try to replace any computations by our own implementations—at
least as long as we implement our alternatives in R. Therefore, it makes
sense to focus on the arithmetic rewriting of expressions.
18
Chapter 2 Matrix Expressions
19
Chapter 2 Matrix Expressions
Optimizing Multiplication
Before we start rewriting multiplication expressions, though, we should
figure out how to find the optimal order of multiplication. Let’s assume
that we have this matrix multiplication: A1 × A2 × … × An. We need to set
parentheses somewhere, say (A1 × A2 × …Ai) × (Ai+1…× An), to select the last
matrix multiplication. If we first multiply together, in some order, the first i
and the last n – i matrices, the last multiplication we have to do is the product
of those two. If the dimensions of (A1 × …Ai) are n × k and the dimensions of
(Ai+1…× An) are k × m, then this approach will involve n × k × m operations
plus how long it takes to produce the two matrices. Assuming that the best
possible way of multiplying the first i matrices involves N1,i operations and
assuming the best possible way of multiplying the last n – i matrices together
involves Ni+1,n operations, then the best possible solution that involves setting
the parentheses where we just did involves N1,i + Ni+1,n +n × k × m operations.
Obviously, to get the best performance, we must pick the best i for setting
the parentheses at the top level, so we must minimize this expression for i.
Recursively, we can then solve for the sequences 1 to i and i + 1 to n to get
the best performance.
Put another way, the minimum number of operations we need to
multiply matrices Ai,Ai+1,…,Aj can be computed recursively as Ni,j = 0 when
i = j and
{ }
N i , j = min N i ,k + N k +1, j + nrow ( Ai ) ´ ncol ( Ak ) ´ ncol ( A j )
k
20
Chapter 2 Matrix Expressions
21
Random documents with unrelated
content Scribd suggests to you:
XXIV.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com