Writing a C Compiler Early Access Nora Sandler download
Writing a C Compiler Early Access Nora Sandler download
download
https://ebookmeta.com/product/writing-a-c-compiler-early-access-
nora-sandler/
https://ebookmeta.com/product/engineering-a-compiler-3rd-edition-
keith-d-cooper/
https://ebookmeta.com/product/osteopathy-and-obstetrics-stephen-
sandler/
https://ebookmeta.com/product/windows-security-internals-with-
powershell-early-access-james-forshaw/
https://ebookmeta.com/product/promises-novel-of-alternate-earth-
m-m-fantasy-romance-1st-edition-julie-mannino/
Psychoanalysis and Ethics The Necessity of Perspective
1st Edition Black
https://ebookmeta.com/product/psychoanalysis-and-ethics-the-
necessity-of-perspective-1st-edition-black/
https://ebookmeta.com/product/artificial-intelligence-and-
healthcare-the-impact-of-algorithmic-bias-on-health-
disparities-1st-edition-natasha-h-williams/
https://ebookmeta.com/product/germanic-languages-and-linguistic-
universals-1st-edition-john-ole-askedal-ian-roberts-tomonori-
matsushita-hiroshi-hasegawa/
https://ebookmeta.com/product/integration-for-calculus-analysis-
and-differential-equations-techniques-examples-and-exercises-1st-
edition-marat-v-markin/
https://ebookmeta.com/product/cybersecurity-for-small-networks-a-
no-nonsense-guide-for-the-reasonably-paranoid-1st-edition-seth-
enoka-2/
Aristotle and the Stoics 1st Edition F. H. Sandback
https://ebookmeta.com/product/aristotle-and-the-stoics-1st-
edition-f-h-sandback/
R L Y
E A S S
C E
AC
NO S TA RCH PRE SS
E A R LY A C C E S S P R O G R A M :
FEEDBACK WELCOME!
No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press,
Inc. Other product and company names mentioned herein may be the trademarks of their
respective owners. Rather than use a trademark symbol with every occurrence of a trade-
marked name, we are using the names only in an editorial fashion and to the benefit of the
trademark owner, with no intention of infringement of the trademark.
All rights reserved. No part of this work may be reproduced or transmitted in any form or by
any means, electronic or mechanical, including photocopying, recording, or by any informa-
tion storage or retrieval system, without the prior written permission of the copyright owner
and the publisher.
The information in this book is distributed on an “As Is” basis, without warranty. While every
precaution has been taken in the preparation of this work, neither the author nor No Starch
Press, Inc. shall have any liability to any person or entity with respect to any loss or damage
caused or alleged to be caused directly or indirectly by the information contained in it.
CONTENTS
Introduction
2
RETURNING AN INTEGER
which represents the program in a form that we can easily traverse and
analyze.
The code generation pass converts the AST into assembly. At this
stage, we still represent the assembly instructions in a data structure that
the compiler can understand, not as text.
The code emission pass writes the assembly to a file so the assembler
and linker can turn it into an executable.
This is a pretty normal way of structuring a compiler, although the
exact stages and intermediate representations vary. It’s also overkill for this
chapter; the programs you’ll handle here could be compiled in just one pass!
But setting up this structure now will make it easier to expand your compiler
in future chapters. As you implement more language features, you’ll extend
these compiler stages and add a few new ones. Each chapter in the book
starts with a diagram of the compiler's architecture in that chapter, including
the stages you've already implemented and any you'll need to add. Figure 2-
1 shows the four stages you'll implement in this chapter.
Before you start coding, let’s take a quick look at how to compile C to
assembly with GCC, and how to read assembly programs.
Hello, Assembly!
The simplest possible C program looks like this:
1 int main() {
2 return 32;
}
Listing 2-2 The program from Listing 2-1 translated into assembly.
NOTE All the assembly listings in this book use AT&T syntax. Elsewhere, you’ll sometimes
see x64 assembly written in Intel syntax. They’re just two different notations
for the same language; the biggest difference is that they put instruction
operands in different order.
Your .s file might contain a few other assembler directives, but you can
safely ignore them for now. The four lines in Listing 2-2 are a complete
assembly program. Assembly programs have several kinds of statements.
The first line, .globl main 1 , is an assembler directive, a statement
that provides directions for the assembler. Assembler directives always
starts with a period. Here, main is a symbol, a placeholder for a memory
address. An assembly instruction can include a symbol when it needs to
refer to the address of a particular function or variable, but the compiler
doesn’t know where that function or variable will end up in memory. Later,
after the linker has combined the different object files that make up the
executable, it can associate each symbol with a memory address; this
process is called symbol resolution. Then the linker will update every place
that uses a symbol to use the corresponding address instead; this is called
relocation.
The .globl main directive tells the assembler that main is a global
symbol. By default, a symbol can only be used in the same assembly file
(and therefore the same object file) where it’s defined. But because main is
global, other object files can refer to it too. The assembler will record this
fact in a section of the object file called the symbol table. The symbol table
contains information about all the symbols in an object file or executable.
The linker relies on the symbol table during symbol resolution. If the
symbol table doesn’t list main as a global symbol, but another object file
tries to refer to it, linking will fail.
Next, we use main 2 as a label for the code that follows it. Labels
consist of a string or number followed by a colon. This label marks the
location that the symbol main refers to. For example, the instruction jmp
Writing a C Compiler (Early Access) © 2022 by Nora Sandler
main should cause the program to jump to the instruction at line 3. But the
label can’t indicate the final location of main; like I mentioned earlier, we
won’t know that until link time. Instead, it defines main as an offset from
the start of the current section in this object file. (An object file includes
different sections for machine instructions, global variables, debug
information, and so on, which are loaded into different parts of the
program's address space at runtime. The object file produced from Listing 2-
2 will only have one section: the text section, which contains machine
instructions.) Because 3 is the very first machine instruction in this file, the
offset of main will be 0. The assembler will record this offset in the symbol
table so the linker can use it to determine the final address of main during
symbol resolution.
• Ian Lance Taylor’s 20-part essay on linkers goes into a lot more depth. The first post is at
https://www.airs.com/blog/archives/38, and there’s a table of contents at
https://lwn.net/Articles/276782/.
• “Position Independent Code (PIC) in shared libraries,” a blog post by Eli Bendersky, provides an
overview of how compilers, linkers, and assemblers work together to produce position-
independent code, focusing on 32-bit machines
(https://eli.thegreenplace.net/2011/11/03/position-independent-code-pic-in-shared-libraries/).
• “Position Independent Code (PIC) in shared libraries on x64,” also by Eli Bendersky, builds on
the previous article, focusing on 64-bit systems
(https://eli.thegreenplace.net/2011/11/11/position-independent-code-pic-in-shared-libraries-on-
x64).
3. Invokes the exit system call, passing it the return value from main.
Then exit handles whatever work needs to happen inside the
operating system to terminate the process and turn the return value
into an exit code.
The bottom line is that you don’t need to worry about process startup or
teardown; you can treat main like a normal function.
To verify that the assembly in Listing 2-2 works correctly, you can
assemble and link it, run it, and check the exit code with the $? shell
operator:
$ gcc return_2.s -o return_2
$ ./return_2
$ echo $?
2
Note that you can pass an assembly file to GCC just like a regular
source file. GCC assumes any input files with a .s extension contain
assembly, so it will just assemble and link those files without trying to
compile them first.
Writing a C Compiler (Early Access) © 2022 by Nora Sandler
3. Compile the preprocessed source file, and output an assembly file with
a .s extension. You’ll have to stub out this step, since you haven’t
written your compiler yet.
you can start writing the lexer, you need to know what tokens you might
encounter. Here are all the tokens in Listing 2-1:
int: a keyword
main: an identifier, whose value is “main”
( : an open parenthesis
) : a close parenthesis
{ : an open brace
return: a keyword
2: a constant, whose value is “2”
; : a semicolon
} : a close brace
I’ve used two lexer-specific terms here. An identifier is an ASCII letter
followed by a mix of letters and digits; identifiers are case sensitive. An
(integer) constant consists of one or more digits. (C supports hexadecimal
and octal integer constants too, but you can ignore them to keep things
simple. We’ll add character and floating-point constants in part II.)
Note that identifiers and constants have values in the list of tokens
above, but the other types of tokens don’t. There are many possible
identifiers (foo, variable1, or my_cool_function), so each
identifier token produced by the lexer needs to retain its specific name.
Likewise, each constant token needs to hold an integer value. By contrast,
there's only one possible return keyword, so a return keyword token
doesn't need to store any extra information. Even though main is the only
identifier right now, it’s a good idea to build the lexer in a way that can
support arbitrary identifiers later on. Also note that there are no whitespace
tokens. If we were compiling a language like Python, where whitespace is
significant, we’d need to include whitespace tokens.
You can define each token type with a regular expression. Table 2-1
gives the corresponding regular expression for each token in PCRE syntax:
Open parenthesis
\(
Close parenthesis
\)
Open brace
{
Close brace
}
Semicolon
;
Note that identifiers and constants must end at word boundaries. For
example, the first three digits of 123;bar match the regular expression for
a constant, and can be converted into the constant 123. That’s because ;
isn’t in the \w character class, so the boundary between 3 and ; is a word
boundary.
However, the first three digits of 123bar do not match the regular
expression for a constant, because those digits are followed by more
characters in the \w character class instead of a word boundary. If your
lexer sees a string like 123bar it should raise an error, because the start of
the string doesn’t match the regular expression for any token.
You can assume that your C source file only contains ASCII characters.
The C standard provides a mechanism called universal character names to
include non-ASCII characters in identifiers, but we won’t implement them.
Many C implementations let you use Unicode characters directly, but you
don’t need to support that either.
This command just tests whether the lexer succeeds or fails. You may
want to write your own tests to validate that it produces the correct list of
tokens for valid programs and emits an appropriate error for invalid ones.
Implementation Tips
Treat keywords like other identifiers. The regex for identifiers also
matches keywords. Don’t try to simultaneously find the end of the next
token and figure out whether it’s a keyword or not. First, find the end of
the token. Then, if it looks like an identifier, check whether it matches
any of the keywords.
Don’t split on whitespace. It might seem like a good idea to start by
splitting the string on whitespace, but it’s not. It will just complicate
things, because whitespace isn’t the only boundary between tokens. For
example, main() has three tokens and no whitespace.
tools.
Handwritten parsers also have some practical advantages over those
produced by parser generators; they can be faster and easier to debug, and
provide better support for error handling. In fact, both GCC and Clang use
handwritten parsers. So writing a parser by hand isn’t just an academic
exercise.
That said, if you’d rather use a parser generator, that’s fine too! It all
depends on what you’re hoping to get out of the book. But I won’t talk
about how to use them, so you’ll have to figure that out on your own. If you
decide to go that route, make sure to research what parsing libraries are
available in your implementation language of choice.
Whichever option you choose, the first step is designing the abstract
syntax tree you want your compiler to produce. It might help to see an
example of an AST first.
This is an if statement, so we’ll label the root of the AST if. The if
node will have two children: