Introduction To Compilers and Language Design 2nd Edition Douglas Thain 2024 Scribd Download
Introduction To Compilers and Language Design 2nd Edition Douglas Thain 2024 Scribd Download
com
https://ebookmeta.com/product/introduction-to-compilers-and-
language-design-2nd-edition-douglas-thain/
OR CLICK BUTTON
DOWLOAD NOW
https://ebookmeta.com/product/an-introduction-to-design-
science-2nd-edition-paul-johannesson/
https://ebookmeta.com/product/an-introduction-to-emergency-
exercise-design-and-evaluation-2nd-edition-robert-mccreight/
https://ebookmeta.com/product/how-to-design-programs-an-
introduction-to-programming-and-computing-2nd-edition-matthias-
felleisen/
https://ebookmeta.com/product/chemical-engineering-design-and-
analysis-an-introduction-2nd-edition-duncan/
Design and Analysis of Experiments Douglas C.
Montgomery
https://ebookmeta.com/product/design-and-analysis-of-experiments-
douglas-c-montgomery/
https://ebookmeta.com/product/scenography-expanded-an-
introduction-to-contemporary-performance-design-2nd-edition-
joslin-mckinney-editor/
https://ebookmeta.com/product/language-assessment-principles-and-
classroom-practice-second-edition-h-douglas-brown/
https://ebookmeta.com/product/speech-and-language-processing-an-
introduction-to-natural-language-processing-computational-
linguistics-and-speech-recognition-3rd-edition-dan-jurafsky/
Introduction to Compilers
and Language Design
Second Edition
Anyone is free to download and print the PDF edition of this book for per-
sonal use. Commercial distribution, printing, or reproduction without the
author’s consent is expressly prohibited. All other rights are reserved.
You can find the latest version of the PDF edition, and purchase inexpen-
sive hardcover copies at http://compilerbook.org
iii
iv
iv
v
Contributions
v
vi
vi
CONTENTS vii
Contents
1 Introduction 1
1.1 What is a compiler? . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Why should you study compilers? . . . . . . . . . . . . . . . 2
1.3 What’s the best way to learn about compilers? . . . . . . . . 2
1.4 What language should I use? . . . . . . . . . . . . . . . . . . 2
1.5 How is this book different from others? . . . . . . . . . . . . 3
1.6 What other books should I read? . . . . . . . . . . . . . . . . 4
2 A Quick Tour 5
2.1 The Compiler Toolchain . . . . . . . . . . . . . . . . . . . . . 5
2.2 Stages Within a Compiler . . . . . . . . . . . . . . . . . . . . 6
2.3 Example Compilation . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Scanning 11
3.1 Kinds of Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 A Hand-Made Scanner . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Finite Automata . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.1 Deterministic Finite Automata . . . . . . . . . . . . . 16
3.4.2 Nondeterministic Finite Automata . . . . . . . . . . . 17
3.5 Conversion Algorithms . . . . . . . . . . . . . . . . . . . . . . 19
3.5.1 Converting REs to NFAs . . . . . . . . . . . . . . . . . 19
3.5.2 Converting NFAs to DFAs . . . . . . . . . . . . . . . . 22
3.5.3 Minimizing DFAs . . . . . . . . . . . . . . . . . . . . . 24
3.6 Limits of Finite Automata . . . . . . . . . . . . . . . . . . . . 26
3.7 Using a Scanner Generator . . . . . . . . . . . . . . . . . . . . 26
3.8 Practical Considerations . . . . . . . . . . . . . . . . . . . . . 28
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Parsing 35
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Context Free Grammars . . . . . . . . . . . . . . . . . . . . . 36
vii
viii CONTENTS
5 Parsing in Practice 69
5.1 The Bison Parser Generator . . . . . . . . . . . . . . . . . . . 70
5.2 Expression Validator . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Expression Interpreter . . . . . . . . . . . . . . . . . . . . . . 74
5.4 Expression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7 Semantic Analysis 99
7.1 Overview of Type Systems . . . . . . . . . . . . . . . . . . . . 100
7.2 Designing a Type System . . . . . . . . . . . . . . . . . . . . . 103
7.3 The B-Minor Type System . . . . . . . . . . . . . . . . . . . . 106
7.4 The Symbol Table . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.5 Name Resolution . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.6 Implementing Type Checking . . . . . . . . . . . . . . . . . . 113
7.7 Error Messages . . . . . . . . . . . . . . . . . . . . . . . . . . 117
viii
CONTENTS ix
ix
x CONTENTS
12 Optimization 195
12.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
12.2 Optimization in Perspective . . . . . . . . . . . . . . . . . . . 196
12.3 High Level Optimizations . . . . . . . . . . . . . . . . . . . . 197
12.3.1 Constant Folding . . . . . . . . . . . . . . . . . . . . . 197
12.3.2 Strength Reduction . . . . . . . . . . . . . . . . . . . . 199
12.3.3 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . 199
12.3.4 Code Hoisting . . . . . . . . . . . . . . . . . . . . . . . 200
12.3.5 Function Inlining . . . . . . . . . . . . . . . . . . . . . 201
12.3.6 Dead Code Detection and Elimination . . . . . . . . . 202
12.4 Low-Level Optimizations . . . . . . . . . . . . . . . . . . . . 204
12.4.1 Peephole Optimizations . . . . . . . . . . . . . . . . . 204
12.4.2 Instruction Selection . . . . . . . . . . . . . . . . . . . 204
12.5 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . 207
12.5.1 Safety of Register Allocation . . . . . . . . . . . . . . 208
12.5.2 Priority of Register Allocation . . . . . . . . . . . . . . 208
12.5.3 Conflicts Between Variables . . . . . . . . . . . . . . . 209
12.5.4 Global Register Allocation . . . . . . . . . . . . . . . . 210
12.6 Optimization Pitfalls . . . . . . . . . . . . . . . . . . . . . . . 211
12.7 Optimization Interactions . . . . . . . . . . . . . . . . . . . . 212
12.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
12.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . 215
x
CONTENTS xi
Index 229
xi
xii CONTENTS
xii
LIST OF FIGURES xiii
List of Figures
xiii
xiv LIST OF FIGURES
xiv
1
Chapter 1 – Introduction
1
2 CHAPTER 1. INTRODUCTION
The best way to learn about compilers is to write your own compiler from
beginning to end. While that may sound daunting at first, you will find
that this complex task can be broken down into several stages of moder-
ate complexity. The typical undergraduate computer science student can
write a complete compiler for a simple language in a semester, broken
down into four or five independent stages.
Without question, you should use the C programming language and the
X86 assembly language, of course!
Ok, maybe the answer isn’t quite that simple. There is an ever-increasing
number of programming languages that all have different strengths and
weaknesses. Java is simple, consistent, and portable, albeit not high per-
formance. Python is easy to learn and has great library support, but is
weakly typed. Rust offers exceptional static type-safety, but is not (yet)
2
1.5. HOW IS THIS BOOK DIFFERENT FROM OTHERS? 3
Most books on compilers are very heavy on the abstract theory of scan-
ners, parsers, type systems, and register allocation, and rather light on
how the design of a language affects the compiler and the runtime. Most
are designed for use by a graduate survey of optimization techniques.
This book takes a broader approach by giving a lighter dose of opti-
mization, and introducing more material on the process of engineering a
compiler, the tradeoffs in language design, and considerations for inter-
pretation and translation.
You will also notice that this book doesn’t contain a whole bunch of
fiddly paper-and-pencil assignments to test your knowledge of compiler
algorithms. (Ok, there are a few of those in Chapters 3 and 4.) If you want
to test your knowledge, then write some working code. To that end, the
exercises at the end of each chapter ask you to take the ideas in the chapter,
and either explore some existing compilers, or write parts of your own. If
you do all of them in order, you will end up with a working compiler,
summarized in the final appendix.
3
4 CHAPTER 1. INTRODUCTION
4
5
Headers
(stdio.h)
Object Code
(prog.o)
Dynamic Static
Running Executable
Linker Linker
Process (prog)
(ld.so) (ld)
Libraries
Dynamic Libraries (libc.a)
(libc.so)
• The preprocessor prepares the source code for the compiler proper.
In the C and C++ languages, this means consuming all directives that
start with the # symbol. For example, an #include directive causes
the preprocessor to open the named file and insert its contents into
the source code. A #define directive causes the preprocessor to
substitute a value wherever a macro name is encountered. (Not all
languages rely on a preprocessor.)
5
6 CHAPTER 2. A QUICK TOUR
other semantic routines, optimizes the code, and then produces as-
sembly language as the output. This part of the toolchain is the main
focus of this book.
• The linker consumes one or more object files and library files and
combines them into a complete, executable program. It selects the
final memory locations where each piece of code and data will be
loaded, and then “links” them together by writing in the missing ad-
dress information. For example, an object file that calls the printf
function does not initially know the address of the function. An
empty (zero) address will be left where the address must be used.
Once the linker selects the memory location of printf, it must go
back and write in the address at every place where printf is called.
In this book, our focus will be primarily on the compiler proper, which is
the most interesting component in the toolchain. The compiler itself can
be divided into several stages:
Abstract
Character Semantic Intermediate Code Assembly
Scanner Tokens Parser Syntax
Stream Routines Representation Generator Code
Tree
Optimizers
• The scanner consumes the plain text of a program, and groups to-
gether individual characters to form complete tokens. This is much
like grouping characters into words in a natural language.
6
2.3. EXAMPLE COMPILATION 7
• The parser consumes tokens and groups them together into com-
plete statements and expressions, much like words are grouped into
sentences in a natural language. The parser is guided by a grammar
which states the formal rules of composition in a given language.
The output of the parser is an abstract syntax tree (AST) that cap-
tures the grammatical structures of the program. The AST also re-
members where in the source file each construct appeared, so it is
able to generate targeted error messages, if needed.
• The semantic routines traverse the AST and derive additional mean-
ing (semantics) about the program from the rules of the language
and the relationship between elements of the program. For exam-
ple, we might determine that x + 10 is a float expression by ob-
serving the type of x from an earlier declaration, then applying the
language rule that addition between int and float values yields
a float. After the semantic routines, the AST is often converted into
an intermediate representation (IR) which is a simplified form of
assembly code suitable for detailed analysis. There are many forms
of IR which we will discuss in Chapter 8.
• One or more optimizers can be applied to the intermediate represen-
tation, in order to make the program smaller, faster, or more efficient.
Typically, each optimizer reads the program in IR format, and then
emits the same IR format, so that each optimizer can be applied in-
dependently, in arbitrary order.
• Finally, a code generator consumes the optimized IR and transforms
it into a concrete assembly language program. Typically, a code gen-
erator must perform register allocation to effectively manage the
limited number of hardware registers, and instruction selection and
sequencing to order assembly instructions in the most efficient form.
The first stage of the compiler (the scanner) will read in the text of
the source code character by character, identify the boundaries between
symbols, and emit a series of tokens. Each token is a small data structure
that describes the nature and contents of each symbol:
At this stage, the purpose of each token is not yet clear. For example,
factor and foo are simply known to be identifiers, even though one is
7
8 CHAPTER 2. A QUICK TOUR
the name of a function, and the other is the name of a variable. Likewise,
we do not yet know the type of width, so the + could potentially rep-
resent integer addition, floating point addition, string concatenation, or
something else entirely.
The next step is to determine whether this sequence of tokens forms
a valid program. The parser does this by looking for patterns that match
the grammar of a language. Suppose that our compiler understands a
language with the following grammar:
Grammar G1
1. expr → expr + expr
2. expr → expr * expr
3. expr → expr = expr
4. expr → id ( expr )
5. expr → ( expr )
6. expr → id
7. expr → int
Each line of the grammar is called a rule, and explains how various
parts of the language are constructed. Rules 1-3 indicate that an expression
can be formed by joining two expressions with operators. Rule 4 describes
a function call. Rule 5 describes the use of parentheses. Finally, rules 6 and
7 indicate that identifiers and integers are atomic expressions. 1
The parser looks for sequences of tokens that can be replaced by the
left side of a rule in our grammar. Each time a rule is applied, the parser
creates a node in a tree, and connects the sub-expressions into the abstract
syntax tree (AST). The AST shows the structural relationships between
each symbol: addition is performed on width and 56, while a function
call is applied to factor and foo.
With this data structure in place, we are now prepared to analyze the
meaning of the program. The semantic routines traverse the AST and de-
rive additional meaning by relating parts of the program to each other, and
to the definition of the programming language. An important component
of this process is typechecking, in which the type of each expression is
determined, and checked for consistency with the rest of the program. To
keep things simple here, we will assume that all of our variables are plain
integers.
To generate linear intermediate code, we perform a post-order traver-
sal of the AST and generate an IR instruction for each node in the tree. A
typical IR looks like an abstract assembly language, with load/store in-
structions, arithmetic operations, and an infinite number of registers. For
example, this is a possible IR representation of our example program:
1 The careful reader will note that this example grammar has ambiguities. We will discuss
8
2.3. EXAMPLE COMPILATION 9
ASSIGN
ID
MUL
height
ADD CALL
ID INT ID ID
width 56 factor foo
9
10 CHAPTER 2. A QUICK TOUR
writing the same IR) so that they can be enabled and disabled indepen-
dently. A retargetable compiler contains multiple code generators, so that
the same IR can be emitted for a variety of microprocessors.
2.4 Exercises
2. Determine how to change the optimization level for your local com-
piler. Find a non-trivial source program and compile it at multiple
levels of optimization. How does the compile time, program size,
and run time vary with optimization levels?
3. Search the internet for the formal grammars for three languages that
you are familiar with, such as C++, Ruby, and Rust. Compare them
side by side. Which language is inherently more complex? Do they
share any common structures?
10
11
Chapter 3 – Scanning
Scanning is the process of identifying tokens from the raw text source code
of a program. At first glance, scanning might seem trivial – after all, iden-
tifying words in a natural language is as simple as looking for spaces be-
tween letters. However, identifying tokens in source code requires the
language designer to clarify many fine details, so that it is clear what is
permitted and what is not.
Most languages will have tokens in these categories:
• Keywords are words in the language structure itself, like while or
class or true. Keywords must be chosen carefully to reflect the
natural structure of the language, without interfering with the likely
names of variables and other identifiers.
• Identifiers are the names of variables, functions, classes, and other
code elements chosen by the programmer. Typically, identifiers are
arbitrary sequences of letters and possibly numbers. Some languages
require identifiers to be marked with a sentinel (like the dollar sign
in Perl) to clearly distinguish identifiers from keywords.
• Numbers could be formatted as integers, or floating point values, or
fractions, or in alternate bases such as binary, octal or hexadecimal.
Each format should be clearly distinguished, so that the programmer
does not confuse one with the other.
• Strings are literal character sequences that must be clearly distin-
guished from keywords or identifiers. Strings are typically quoted
with single or double quotes, but also must have some facility for
containing quotations, newlines, and unprintable characters.
• Comments and whitespace are used to format a program to make it
visually clear, and in some cases (like Python) are significant to the
structure of a program.
When designing a new language, or designing a compiler for an exist-
ing language, the first job is to state precisely what characters are permit-
ted in each type of token. Initially, this could be done informally by stating,
11
12 CHAPTER 3. SCANNING
for example, “An identifier consists of a letter followed by any number of letters
and numerals.”, and then assigning a symbolic constant (TOKEN IDENTIFIER)
for that kind of token. As we will see, an informal approach is often am-
biguous, and a more rigorous approach is needed.
Figure 3.1 shows how one might write a scanner by hand, using simple
coding techniques. To keep things simple, we only consider just a few
tokens: * for multiplication, ! for logical-not, != for not-equal, and se-
quences of letters and numbers for identifiers.
The basic approach is to read one character at a time from the input
stream (fgetc(fp)) and then classify it. Some single-character tokens are
easy: if the scanner reads a * character, it immediately returns
TOKEN MULTIPLY, and the same would be true for addition, subtraction,
and so forth.
However, some characters are part of multiple tokens. If the scanner
encounters !, that could represent a logical-not operation by itself, or it
could be the first character in the != sequence representing not-equal-to.
12
3.3. REGULAR EXPRESSIONS 13
Upon reading !, the scanner must immediately read the next character. If
the next character is =, then it has matched the sequence != and returns
TOKEN NOT EQUAL.
But, if the character following ! is something else, then the non-matching
character needs to be put back on the input stream using ungetc, because
it is not part of the current token. The scanner returns TOKEN NOT and will
consume the put-back character on the next call to scan token.
In a similar way, once a letter has been identified by isalpha(c), then
the scanner keeps reading letters or numbers, until a non-matching char-
acter is found. The non-matching character is put back, and the scanner
returns TOKEN IDENTIFIER.
(We will see this pattern come up in every stage of the compiler: an
unexpected item doesn’t match the current objective, so it must be put
back for later. This is known more generally as backtracking.)
As you can see, a hand-made scanner is rather verbose. As more to-
ken types are added, the code can become quite convoluted, particularly
if tokens share common sequences of characters. It can also be difficult
for a developer to be certain that the scanner code corresponds to the de-
sired definition of each token, which can result in unexpected behavior on
complex inputs. That said, for a small language with a limited number of
tokens, a hand-made scanner can be an appropriate solution.
For a complex language with a large number of tokens, we need a more
formalized approach to defining and scanning tokens. A formal approach
will allow us to have a greater confidence that token definitions do not
conflict and the scanner is implemented correctly. Further, a formal ap-
proach will allow us to make the scanner compact and high performance
– surprisingly, the scanner itself can be the performance bottleneck in a
compiler, since every single character must be individually considered.
The formal tools of regular expressions and finite automata allow us
to state very precisely what may appear in a given token type. Then, auto-
mated tools can process these definitions, find errors or ambiguities, and
produce compact, high performance code.
13
14 CHAPTER 3. SCANNING
Rule #3 is known as the Kleene closure and has the highest precedence.
Rule #2 is known as concatenation. Rule #1 has the lowest precedence and
is known as alternation. Parentheses can be added to adjust the order of
operations in the usual way.
Here are a few examples using just the basic rules. (Note that a finite
RE can indicate an infinite set.)
Regular Expression s Language L(s)
hello { hello }
d(o|i)g { dog,dig }
moo* { mo,moo,mooo,... }
(moo)* { ,moo,moomoo,moomoomoo,... }
a(b|a)*a { aa,aaa,aba,aaaa,aaba,abaa,... }
The syntax described so far is entirely sufficient to write any regular
expression. But, it is also handy to have a few helper operations built on
top of the basic syntax:
14
3.4. FINITE AUTOMATA 15
15
16 CHAPTER 3. SCANNING
as the input symbol. Some states of the FA are known as accepting states
and are indicated by a double circle. If the FA is in an accepting state after
all input is consumed, then we say that the FA accepts the input. We say
that the FA rejects the input string if it ends in a non-accepting state, or if
there is no edge corresponding to the current input symbol.
Every RE can be written as an FA, and vice versa. For a simple regular
expression, one can construct an FA by hand. For example, here is an FA
for the keyword for:
f o r
0 1 2 3
a-z
0-9
a-z
a-z 0-9
0 1 2
0-9
0-9
1-9 1 2
0
0
3
16
3.4. FINITE AUTOMATA 17
The transitions between states are represented by a matrix (M [s, i]) which
encodes the next state, given the current state and input symbol. (If the
transition is not allowed, we mark it with E to indicate an error.) For each
symbol, we compute c = M [s, i] until all the input is consumed, or an error
state is reached.
[a-z]
i n g
0 1 2 3
Now consider how this automaton would consume the word sing. It
could proceed in two different ways. One would be to move to state 0 on
s, state 1 on i, state 2 on n, and state 3 on g. But the other, equally valid
way would be to stay in state 0 the whole time, matching each letter to the
[a-z] transition. Both ways obey the transition rules, but one results in
acceptance, while the other results in rejection.
The problem here is that state 0 allows for two different transitions on
the symbol i. One is to stay in state 0 matching [a-z] and the other is to
move to state 1 matching i.
Moreover, there is no simple rule by which we can pick one path or
another. If the input is sing, the right solution is to proceed immediately
from state zero to state one on i. But if the input is singing, then we
should stay in state zero for the first ing and proceed to state one for the
second ing .
An NFA can also have an (epsilon) transition, which represents the
empty string. This transition can be taken without consuming any input
symbols at all. For example, we could represent the regular expression
a*(ab|ac) with this NFA:
17
18 CHAPTER 3. SCANNING
a
a b 3
ε 1 2
0 ε
4 a c
5 6
States Action
0, 1, 4 consume a
0, 1, 2, 4, 5 consume a
0, 1, 2, 4, 5 consume a
0, 1, 2, 4, 5 consume c
6 accept
In principle, one can implement an NFA in software or hardware by
simply keeping track of all of the possible states. But this is inefficient.
In the worst case, we would need to evaluate all states for all characters
on each input transition. A better approach is to convert the NFA into an
equivalent DFA, as we show below.
18
3.5. CONVERSION ALGORITHMS 19
Regular expressions and finite automata are all equally powerful. For ev-
ery RE, there is an FA, and vice versa. However, a DFA is by far the most
straightforward of the three to implement in software. In this section, we
will show how to convert an RE into an NFA, then an NFA into a DFA,
and then to optimize the size of the DFA.
The NFA for any character a is: The NFA for an transition is:
a ε
Now, suppose that we have already constructed NFAs for the regular
expressions A and B, indicated below by rectangles. Both A and B have
a single start state (on the left) and accepting state (on the right). If we
write the concatenation of A and B as AB, then the corresponding NFA is
simply A and B connected by an transition. The start state of A becomes
the start state of the combination, and the accepting state of B becomes the
accepting state of the combination:
A ε B
19
20 CHAPTER 3. SCANNING
A
ε ε
ε ε
B
ε ε
ε
ε
c o w
c a t
20
3.5. CONVERSION ALGORITHMS 21
c a t
ε ε
ε ε
c o w
c o w
ε ε
ε ε ε ε
c a t
c w
o
ε ε
ε
a ε ε ε ε
c a t
You can easily see that the NFA resulting from the construction algo-
rithm, while correct, is quite complex and contains a large number of ep-
silon transitions. An NFA representing the tokens for a complete language
could end up having thousands of states, which would be very impractical
to implement. Instead, we can convert this NFA into an equivalent DFA.
21
22 CHAPTER 3. SCANNING
Epsilon closure.
−closure(n) is the set of NFA states reachable from NFA state n by zero
or more transitions.
Now we define the subset construction algorithm. First, we create a
start state D0 corresponding to the −closure(N0 ). Then, for each outgo-
ing character c from the states in D0 , we create a new state containing the
epsilon closure of the states reachable by c. More precisely:
22
3.5. CONVERSION ALGORITHMS 23
c a t
ε N8 N9 N10 N11 ε
a ε ε ε
N0 N1 N2 N3 N12 N13
ε ε
c o w
N4 N5 N6 N7
D3:
N6
w
D4:
o N7, N12, N13,
N2, N3, N4, N8
c
D1:
D0: a N1, N2, N3,
c D2:
N0
N4, N8, N13
N5, N9 c
D6:
a N11, N12, N13,
N2,N3, N4, N8
t
D5:
N10
Example. Let’s work out the algorithm on the NFA in Figure 3.4. This
is the same NFA corresponding to the RE a(cat|cow)* with each of the
states numbered for clarity.
23
24 CHAPTER 3. SCANNING
7. Remove D4 from the work list, and observe that the only outgoing
transition c leads to states N5 and N9 which already exist as state D2 ,
c
so simply add a transition D4 → − D2 .
c
8. Remove D6 from the work list and, in a similar way, add D6 →
− D2 .
9. The work list is empty, so we are done.
24
3.5. CONVERSION ALGORITHMS 25
3
b
b
1 a
a
b
b
a 4 5
2
a
a
b
1,2,3,4 a 5
b
Now, we ask whether this graph is consistent with respect to all possi-
ble inputs, by referring back to the original DFA. For example, we observe
that, if we are in super-state (1,2,3,4) then an input of a always goes to
state 2, which keeps us within the super-state. So, this DFA is consistent
with respect to a. However, from super-state (1,2,3,4) an input of b can
either stay within the super-state or go to super-state (5). So, the DFA is
inconsistent with respect to b.
To fix this, we try splitting out one of the inconsistent states (4) into a
new super-state, taking the transitions with it:
b b
4 5
1,2,3
a
a,b
25
26 CHAPTER 3. SCANNING
b a
b
b 4 5
a
1,3 2 a
a
b
Again, we examine each super-state and observe that each possible in-
put is consistent with respect to the super-state, and therefore we have the
minimal DFA.
Regular expressions and finite automata are powerful and effective at rec-
ognizing simple patterns in individual words or tokens, but they are not
sufficient to analyze all of the structures in a problem. For example, could
you use a finite automaton to match an arbitrary number of nested paren-
theses?
It’s not hard to write out an FA that could match, say, up to three pairs
of nested parentheses, like this:
0
( 1
( 2
( 3
) ) )
26
3.7. USING A SCANNER GENERATOR 27
%{
(C Preamble Code)
%}
(Character Classes)
%%
(Regular Expression Rules)
%%
(Additional Code)
27
28 CHAPTER 3. SCANNING
match any character at all, which is helpful for catching error conditions.
Figure 3.7 shows a simple but complete example to get you started.
This specification describes just a few tokens: a single character addition
(which must be escaped with a backslash), the while keyword, an iden-
tifier consisting of one or more letters, and a number consisting of one or
more digits. As is typical in a scanner, any other type of character is an
error, and returns an explicit token type for that purpose.
Flex generates the scanner code, but not a complete program, so you
must write a main function to go with it. Figure 3.8 shows a simple driver
program that uses this scanner. First, the main program must declare as
extern the symbols it expects to use in the generated scanner code: yyin
is the file from which text will be read, yylex is the function that imple-
ments the scanner, and the array yytext contains the actual text of each
token discovered. Finally, we must have a consistent definition of the to-
ken types across the parts of the program, so into token.h we put an
enumeration describing the new type token t. This file is included in
both scanner.flex and main.c.
Figure 3.10 shows how all the pieces come together. scanner.flex is
converted into scanner.c by invoking flex -o scanner.c
scanner.flex. Then, both main.c and scanner.c are compiled to
produce object files, which are linked together to produce the complete
program.
28
3.8. PRACTICAL CONSIDERATIONS 29
%{
#include "token.h"
%}
DIGIT [0-9]
LETTER [a-zA-Z]
%%
(" "|\t|\n) /* skip whitespace */
\+ { return TOKEN_ADD; }
while { return TOKEN_WHILE; }
{LETTER}+ { return TOKEN_IDENT; }
{DIGIT}+ { return TOKEN_NUMBER; }
. { return TOKEN_ERROR; }
%%
int yywrap() { return 1; }
#include "token.h"
#include <stdio.h>
int main() {
yyin = fopen("program.c","r");
if(!yyin) {
printf("could not open program.c!\n");
return 1;
}
while(1) {
token_t t = yylex();
if(t==TOKEN_EOF) break;
printf("token: %d text: %s\n",t,yytext);
}
}
29
30 CHAPTER 3. SCANNING
typedef enum {
TOKEN_EOF=0,
TOKEN_WHILE,
TOKEN_ADD,
TOKEN_IDENT,
TOKEN_NUMBER,
TOKEN_ERROR
} token_t;
token.h
Linker scanner.exe
main.c main.o
Compiler
30
3.9. EXERCISES 31
minimum amount of invalid text (using the dot rule) and return an explicit
token type indicating an error. The code that invokes the scanner can then
emit a suitable message, and then ask for the next token.
3.9 Exercises
1. Write regular expressions for the following entities. You may find it
necessary to justify what is and is not allowed within each expres-
sion:
3. Test the regular expressions you wrote in the previous two problems
by translating them into your favorite programming language that
has native support for regular expressions. (Perl and Python are two
good choices.) Evaluate the correctness of your program by writing
test cases that should (and should not) match.
5. Convert the NFAs in the previous problem into DFAs using the sub-
set construction method.
31
Other documents randomly have
different content
d'étoiles. Dans le nord vacillent des fusées d'aurore boréale toujours
changeantes et mobiles, jamais en repos, absolument comme l'âme
humaine. Et, sans y prendre garde, mes pensées reviennent toujours
à mes chers adorés… Je songe au retour; notre tâche est
maintenant accomplie, le Fram remonte à toute vitesse le fjord. La
terre aimée de la patrie nous sourit dans un gai soleil, et… les
souffrances poignantes, les longues angoisses sont oubliées dans un
moment d'inexprimable joie. Oh! non, c'est trop pénible! A grands
pas je me promène pour chasser cette hantise déprimante.
De plus en plus décourageant le résultat des observations. Nous
sommes aujourd'hui par 77°43′ et 138°8′ de Long. Est. Jamais
encore nous n'avions rétrogradé aussi loin. Depuis le 29 septembre
nous avons été repoussés de 83 milles vers le sud. Toute la théorie
dont la vérité me paraissait indiscutable, s'écroule comme un
château de cartes détruit par la plus légère brise. Imaginez les plus
ingénieuses hypothèses, bientôt les faits les auront réduits à néant.
Suis-je véritablement sincère en écrivant ces tristes réflexions? Oui,
sur le moment, car elles sont le résultat de l'amertume de mon
découragement. Après tout, si nous sommes dans une mauvaise
voie, à quoi cela aboutira-t-il? A la déception d'espérances humaines,
tout simplement. Et si nous périssons dans cette entreprise, quelle
influence cela aura-t-il sur les cycles infinis de l'éternité?
9 novembre.—Pris dans la journée une série de températures et
d'échantillons d'eau de 10 en 10 mètres, depuis la surface jusqu'au
fond, situé à une profondeur de 53 mètres. Partout la mer a une
température uniforme de −1°,5, la même température que j'ai
observée à une latitude plus méridionale. Il n'y a donc ici que de
l'eau originaire du bassin polaire. La salure est très faible. L'apport
des fleuves sibériens fait sentir son influence jusqu'ici.
11 novembre.—La «jeune glace» autour du navire atteint une
épaisseur de 0m,39. Dure à la surface, elle devient en dessous
poreuse et friable. Cette couche date de quinze jours. Dès la
première nuit, elle a atteint une épaisseur de 0m,078; les deux nuits
suivantes, elle a seulement augmenté de 0,052, et, pendant les
douze nuits suivantes, de 0,26. L'accroissement d'une couche de la
glace se ralentit donc à mesure que son épaisseur augmente, et
cesse même complètement lorsqu'elle a atteint une certaine hauteur.
19 novembre.—Toujours la même vie monotone. Depuis une
semaine, vent du sud; aujourd'hui, par exception, brise légère de
nord-nord-ouest. La banquise reste calme, hermétiquement fermée
autour du navire. Depuis la dernière pression violente, le Fram a
certainement sous sa quille une épaisseur de glace de 3 à 7
mètres [12] . A notre grande joie, l'observation d'hier constate un gain
de 44 milles vers le nord depuis le 8. Nous avons également fait un
pas considérable vers l'est. Que seulement la dérive nous porte dans
cette direction!
[ Plus tard nous creusâmes la glace jusqu'à une
12] profondeur de 10 mètres sans réussir à atteindre
l'eau.
LA LECTURE DE L'ANÉMOMÈTRE
STRATIFICATION DE LA GLACE
STRATIFICATION DE LA GLACE
Le 19, forages dans la glace. A bâbord, son épaisseur est de
1m,875 et à l'avant de 2m,08; elle n'est donc pas très grande, si l'on
songe qu'elle est «vieille» d'un mois, et que pendant ce mois la
température est descendue à −50°. La plaque sur laquelle se trouve
installé le piège à ours atteint une profondeur de 3m,45; de plus,
quelques glaçons adhèrent à sa face immergée. Elle présente une
sorte de stratification rappelant celle d'un glacier, rendue apparente
par des dépôts de matières noires colorées d'organismes rougeâtres,
qui se trouvent à la surface de chaque couche. En différents
endroits, les strates sont plissées et même brisées comme dans une
coupe géologique; plissements et fractures proviennent évidemment
des pressions exercées latéralement dans les chocs des glaçons.
Cette disposition était particulièrement frappante près d'un grand
toross formé par la dernière convulsion de la banquise. (Voir les
figures précédentes.) La plaque, épaisse de plus de 3 mètres, avait
été plissée sans se briser, notamment au pied du monticule
amoncelé à sa surface. Sous le poids de cette surcharge, la surface
du glaçon était, en certains endroits, descendue jusqu'au niveau de
la mer, tandis qu'ailleurs, pressé par des blocs qui avaient été
poussés sous elle, cette flaque s'élevait à 0m,50 au-dessus de l'eau.
En dépit du froid, cette glace est donc très plastique. A cette
époque, la température de la banquise, à une très petite profondeur,
devait varier de −30° à −20°.
4 mars.—Toujours les mêmes alternatives de progrès et de recul.
Le 24 février, après vingt-quatre heures seulement de vent de sud,
nous sommes repoussés au 79°54′; nous dérivons ensuite dans l'est,
puis au nord-est. Le 27, nous atteignons le 80° 10′; maintenant nous
sommes de nouveau repoussés par un vent de sud-est.
Hier et aujourd'hui, le thermomètre descend à −37° et à −38°.
Actuellement, le vent du nord détermine un abaissement de
température, et celui du sud une hausse du thermomètre. Au
commencement de l'hiver, c'était le contraire.
12 mars.—Toujours en dérive vers le sud. Je commence à être
découragé. N'en ai-je pas le droit? L'une après l'autre, toutes mes
espérances s'évanouissent. Et pendant ce temps, indifférente à tous
nos sentiments, la nature poursuit impassible son cycle.
Temps très froid; le 8 au soir, le thermomètre descend à −48°,5,
le 11 à −50°, et dans la soirée à −51°,2. Néanmoins, chaque jour
nous faisons des excursions. Quoique nous ne soyons pas plus
couverts que d'habitude [15] , nous ne sommes nullement
incommodés par cette basse température. Tout au contraire, elle
nous semble très agréable. Nous nous sentons seulement froid au
ventre et aux jambes; mais il suffit de battre la semelle pour se
réchauffer. Très certainement on pourrait supporter une température
encore plus basse, de 10°, 20° et même 30°. Les sensations
éprouvent des modifications très curieuses. En Norvège, j'ose à
peine mettre le nez dehors par une température de −20°, alors
même que l'air est calme; ici, par un froid de −50° et avec du vent,
je n'hésite pas à sortir.
[ Les uns étaient vêtus d'une chemise et d'une
15] peau de loup, les autres d'une jaquette de laine et
d'une blouse légère en peau de phoque.
Enfin, elle est arrivée, cette saison qu'en Norvège nous appelons
le printemps, la saison de la joie et de la vie, le réveil de la nature
après le long assoupissement hivernal. Ici elle n'a apporté aucun
changement. C'est toujours la même plaine de glace.
Suivant que la dérive nous porte dans le nord ou dans le sud,
nous sommes pleins d'espoir ou découragés, et, comme toujours
dans ces alternatives, je fais des plans d'avenir. Un jour, il me semble
que mon plan se réalisera. Le 17 avril, comme nous sommes
poussés dans le nord, je suis persuadé de l'existence d'un courant à
travers le bassin polaire. Vingt-quatre heures du vent du nord nous
ont fait gagner 9 milles. Nous en avons fini, sans doute, avec cette
énervante dérive vers le sud. La présence de couches d'eau ayant
une température relativement élevée en est à mes yeux un indice
favorable.
Pendant le printemps, nos progrès furent plus satisfaisants que
durant l'hiver, comme le montre la carte des pages 56–57. Le 1er
mai, nous étions presque au 81°, et le 18 juin nous touchions au
83°, puis en juillet et en août nous revînmes en arrière. Le 1er
septembre nous avions rétrogradé au 81°14′. Somme toute, c'était
toujours le même genre de locomotion; le Fram avançait à la façon
d'un crabe. Chaque fois qu'il avait fait un pas vers le nord, il reculait
ensuite. C'était, comme le disait l'un de nous, politicien ardent, une
lutte constante entre la droite et la gauche, entre les progressistes et
les réactionnaires. Toujours après une période de vent progressiste
et de dérive encourageante vers le nord, l'extrême droite l'emportait
de nouveau; le navire restait alors immobile, ou même était ramené
en arrière, au grand désespoir d'Amundsen.
Pendant toute la dérive, l'avant du Fram fut tourné vers le sud,
généralement vers le S. ¼ S.-O., et le navire ne dévia que très peu
de cette direction. Il marchait vers le nord, qui était son but, le nez
toujours dirigé vers le sud. Il se refusait, semble-t-il, à augmenter la
distance entre lui et le monde habité, paraissant soupirer après les
rivages méridionaux, tandis qu'une puissance invisible l'entraînait
vers l'inconnu.
Pendant le printemps, en vue de préparer mon excursion
projetée vers le nord, j'étudiais les conditions de viabilité de la
banquise dans des excursions journalières, soit sur les ski, soit en
traîneau.
En avril, la glace devint très praticable pour les chiens. Sous
l'action des rayons solaires, les monticules produits par la pression
avaient été en partie nivelés, et les crevasses s'étaient fermées.
Pendant des milles, on pouvait cheminer sans rencontrer de grands
obstacles. En mai, la situation devint moins bonne par suite de
l'ouverture de nombreux canaux dans toutes les directions, autant
de larges fossés qui, à chaque instant, arrêtaient la marche. Dans les
premiers jours du mois, les froids étant encore très vifs, ces nappes
d'eau étaient rapidement recouvertes par une couche cristalline,
suffisamment épaisse pour résister au poids d'une caravane; plus
tard, par suite de l'élévation de la température, la formation de la
glace devint beaucoup plus lente et même s'arrêta complètement. A
la fin de mai et au commencement de juin, on n'aurait pu avancer
que très lentement à travers le réseau inextricable de canaux et de
lacs qui, à cette époque, morcelaient la banquise.