%\batchmode

\subsection{NAME}

ll1gen - recursive descent parser generator


\subsection{SYNOPSIS}

{\bf ll1gen [{\em file\/}]}


\subsection{DESCRIPTION}

{\bf ll1gen} generates a C function that parses input according
to a certain grammar. The function -- called {\em ll1parse\/} --
doesn't actually read the input, instead it relies on a function
called {\em yylex\/} to deliver the next ``token'' of input. The
function {\em yylex\/} must be provided by the programmer (or
generated with another utility, e.g., {\bf lex(1)}). The
grammar from which the function is generated may contain ``semantic
actions'', which are pieces of C code that are executed whenever a
certain part of the input has been recognized. This way the
generated parser can do more than just check the syntax of the
input -- e.g., translate it to another form. --

The way the grammar is input to {\bf ll1gen} is reminiscent of
that of {\bf yacc(1)}, but since {\bf ll1gen} generates a
so-called ``top-down'' parser instead of a ``bottom-up'' parser, the
semantic actions are radically different, and so is the way in
which the generated parser deals with errors

The input is read from file or from {\em stdin.\/} It is divided
into three parts. The first is the options section. It contains
{\em \%include\/} directives for header files that must be
imported for the proper compilation of the semantic actions. This
section may be empty, if no {\em \%include\/} files are needed.

The second section starts with the keyword {\em \%terminals.\/}
After the keyword all terminal symbols are listed. These are
symbols that are not defined by the parser, but presumably they
represent chunks of input (maybe just a single character). The
{\em yylex\/} function must return one of these symbols, or the
predefined symbol {\em ENDMARK.\/}

The third (and last) section starts with the keyword
{\em \%rules\/} and contains the rules.
\begin{verbatim}
      %include $\langle$..$\rangle$
      %include "..."
      ...
      %terminals
      list of terminals...
      ...
      %rules
      list of rules...
    \end{verbatim}
Basically, a rule looks like this:
\begin{verbatim}
      head: symbol symbol... | alternative... | alternative... ;
    \end{verbatim}
A rule starts with the symbol that is being defined (which may not
be a terminal symbol), followed by a colon (:). The rule ends with
a semicolon (;). Between the two there may be a single list of
symbols, or several lists separated with vertical bars ({\verbar}). The
list may contain both terminal symbols and non-terminals. The rule
says that a certain part of the input (tentatively named ``head'',
but the name is unimportant) consists of either... or... or...,
where each branch is a sequence of parts, which may further expand
into finer grained sequences.

When rules are being recognized, they may cause side-effects by
means of embedded pieces of C code, that are executed when
recognition proceeds to the place in the grammar where they are
defined. Semantic actions are delimited by braces \{\} and they may
contain arbitrary C code, as long as the braces are balanced. E.g.:
\begin{verbatim}
      adjust
       : PLUS ONE {x++;}
       | MINUS ONE {x--;}
       ;
    \end{verbatim}
In this rule, when {\em PLUS\/} and {\em ONE\/} have been
recognized, the variable {\em x\/} is incremented. Since
{\em x\/} is not defined in the rule itself, it must be some
global variable.

Rules can also pass arguments to each other. They look very much
like C functions in that respect (and indeed, that is how they are
implemented). Here is an example:
\begin{verbatim}
      stars(int *n)
       : STAR stars(n) {(*n)++;}
       | {*n = 0;}
       ;
    \end{verbatim}
The lexical scanner might make more information available than
just the current token. For example, it is customary to report an
identifier as the terminal symbol {\em IDENT\/} and store the
actual name in a global variable {\em curstr.\/} The semantic
actions can then use this global variable for various things.
However, the actions must be executed directly after the
{\em IDENT\/} is recognized, because when the next token is read,
{\em curstr\/} may be overwritten again. To ensure that an action
is executed before the {\em yylex\/} function is called again,
the action must be `attached' to the {\em IDENT\/} with a `+',
like this:
\begin{verbatim}
      two_ids(char *s1, char *s2)
       : IDENT + "strcpy(s1, curstr);" 
         IDENT + "strcpy(s2, curstr);" 
       ;
    \end{verbatim}
Note that the following rule does not work:
\begin{verbatim}
      to_ids(char *s1, char *s2)
       : IDENT "strcpy(s1, curstr)"     /* Incorrect! */
         IDENT "strcpy(s2, curstr)"     /* Incorrect! */
       ;
    \end{verbatim}

{\bf ll1gen} generates two files, one called
{\em file\/}{\bf .c} and one called {\em file\/}{\bf .def}
(where {\em file\/} is the base name of the input file).
{\em file\/}{\bf .c} contains a number of local functions and
one exported function, called {\em ll1parse.\/} That function
initializes the parser and calls the start symbol, which is the
first rule in the grammar. The arguments to {\em ll1parse\/} are
the same as those of the start symbol.

{\em file\/}{\bf .c} assumes the existence of a global
variable {\em sym\/} of type {\em tsymbol\/} (see below), and
the following three functions:
\begin{quote}
extern tsymbol yylex(void);\\\relax 
extern void insertion(tsymbol);\\\relax 
extern void deletion(tsymbol);\\\relax 

\end{quote}
{\em insertion\/} is called when the parser expects to see a
certain token but could not find it. This function should print an
error message to that effect. {\em deletion\/} is called in a
similar situation, but in this case the parser decided that the
best way to recover from the error is to skip a token.
{\em deletion\/} should print an error message to that effect.

Some experimentation may be in order to determine exactly what
types of errors cause these routines to be called. Writing good,
informative error messages is never easy!

{\em file\/}{\bf .def} contains an enumerated type
{\em tsymbol,\/} containing all terminal and nonterminal symbols.
It also declares the type {\em tset,\/} which is an array of
{\em short int\/}'s, just long enough to hold all
{\em tsymbol\/}'s at one bit per symbol. And finally it declares
the variable {\em sym\/} as {\em extern tsymbol sym.\/}


\subsection{FILES}

{\em file\/}{\bf .def} include this file in the lexical
scanner.

{\em file\/}{\bf .c} contains the {\em ll1parse\/} function.


\subsection{SEE ALSO}

{\bf yacc(1),} {\bf lex(1)}


\subsection{AUTHOR}

Bert Bos $\langle$bert@let.rug.nl$\rangle$, February 1993\\\relax 
Department Alfa-informatica, Groningen University\\\relax  Groningen,
The Netherlands

