ll1gen - recursive descent parser generator

SYNOPSIS

ll1gen [file]

DESCRIPTION

ll1gen generates a C function that parses input according to a
certain grammar. The function -- called ll1parse -- doesn't
actually read the input, instead it relies on a function called
yylex to deliver the next "token" of input. The function yylex
must be provided by the programmer (or generated with another
utility, e.g., Lex). The grammar from which the function is
generated may contain "semantic actions", which are pieces of C
code that are executed whenever a certain part of the input has
been recognized. This way the generated parser can do more than
just check the syntax of the input -- e.g., translate it to
another form. --

The way the grammar is input to ll1gen is reminiscent of that of
yacc, but since ll1gen generates a so-called "top-down" parser
instead of a "bottom-up" parser, the semantic actions are
radically different, and so is the way in which the generated
parser deals with errors

INPUT FORMAT

The input is read from file or from stdin. It is divided into
three parts. The first is the options section. It contains
%include directives for header files that must be imported for
the proper compilation of the semantic actions. This section may
be empty, if no %include files are needed.

The second section starts with the keyword %terminals. After the
keyword all terminal symbols are listed. These are symbols that
are not defined by the parser, but presumably they represent
chunks of input (maybe just a single character). The yylex()
function must return one of these symbols, or the predefined
symbol ENDMARK.

The third (and last) section starts with the keyword %rules and
contains the rules.

	%include <..>
	%include "..."
	...
	%terminals
	list of terminals...
	...
	%rules
	list of rules...

RULES

Basically, a rule looks like this:

	head: symbol symbol... | alternative... | alternative... ;

A rule starts with the symbol that is being defined (which may
not be a terminal symbol), followed by a colon (:). The rule
ends with a semicolon (;). Between the two there may be a single
list of symbols, or several lists separated with vertical bars
(|). The list may contain both terminal symbols and
non-terminals. The rule says that a certain part of the input
(tentatively named "head", but the name is unimportant) consists
of either... or... or..., where each branch is a sequence of
parts, which may further expand into finer grained sequences.

SEMANTIC ACTIONS

When rules are being recognized, they may cause side-effects by
means of embedded pieces of C code, that are executed when
recognition proceeds to the place in the grammar where they are
defined. Semantic actions are delimited by braces {} and they
may contain arbitrary C code, as long as the braces are
balanced. E.g.:

	adjust
	 : PLUS ONE {x++;}
	 | MINUS ONE {x--;}
	 ;

In this rule, when PLUS and ONE have been recognized, the
variable x is incremented. Since x is not defined in the rule
itself, it must be some global variable.

Rules can also pass arguments to each other. They look very much
like C functions in that respect (and indeed, that is how they
are implemented). Here is an example:

	stars(int *n)
	 : STAR stars(n) {(*n)++;}
	 | {*n = 0;}
	 ;

The lexical scanner might make more information available than
just the current token. For example, it is customary to report
an identifier as the terminal symbol IDENT and store the actual
name in a global variable curstr. The semantic actions can then
use this global variable for various things. However, the
actions must be executed directly after the IDENT is recognized,
because when the next token is read, curstr may be overwritten
again. To ensure that an action is executed before the yylex
function is called again, the action must be `attached' to the
IDENT with a `+', like this:

	two_ids(char *s1, char *s2)
	 : IDENT + "strcpy(s1, curstr);" 
	   IDENT + "strcpy(s2, curstr);" 
	 ;

Note that the following rule does not work:

	to_ids(char *s1, char *s2)
	 : IDENT "strcpy(s1, curstr)"	/* Incorrect! */
	   IDENT "strcpy(s2, curstr)"	/* Incorrect! */
	 ;

TYPES AND DECLARATIONS

ll1gen generates two files, one called file.c and one called
file.def (where file is the base name of the input file). file.c
contains a number of local functions and one exported function,
called ll1parse(). That function initializes the parser and
calls the start symbol, which is the first rule in the grammar.
the arguments to ll1parse are the same as those of the start
symbol. 

file.c assumes the existence of a global variable sym of type
tsymbol (see below), and the following three functions:

	extern tsymbol yylex(void);
	extern void insertion(tsymbol);
	extern void deletion(tsymbol);

insertion() is called when the parser expects to see a certain
token but could not find it. This function should print an error
message to that effect. deletion() is called in a similar
situation, but in this case the parser decided that the best way
to recover from the error is to skip a token. deletion() should
print an error message to that effect.

Some experimentation may be in order to determine exactly what
types of error cause these routines to be called. Writing
good, informative error messages is never easy!

file.def contains an enumerated type tsymbol, containing all
terminal and nonterminal symbols. It also declares the type
tset, which is an array of short int's, just long enough to hold
all tsymbol's at one bit per symbol. And finally it declares the
variable sym as extern tsymbol sym.

FILES

file.def	#include this file in the lexical scanner.
file.c		contains the ll1parse() function.

SEE ALSO

yacc(1), lex(1)

AUTHOR

Bert Bos <bert@let.rug.nl>, February 1993
Department Alfa-informatica, Groningen University
Groningen, The Netherlands
