83a792d7ed
[SVN r67619]
155 lines
8.8 KiB
Plaintext
155 lines
8.8 KiB
Plaintext
[/==============================================================================
|
|
Copyright (C) 2001-2011 Joel de Guzman
|
|
Copyright (C) 2001-2011 Hartmut Kaiser
|
|
|
|
Distributed under the Boost Software License, Version 1.0. (See accompanying
|
|
file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
|
|
===============================================================================/]
|
|
|
|
[section:lexer_introduction Introduction to __lex__]
|
|
|
|
Lexical scanning is the process of analyzing the stream of input characters and
|
|
separating it into strings called tokens, separated by whitespace.
|
|
Most compiler texts start here, and devote several chapters to discussing
|
|
various ways to build scanners. __lex__ is a library built to take care of the
|
|
complexities of creating a lexer for your grammar (in this documentation we
|
|
will use the terms 'lexical analyzer', 'lexer' and 'scanner' interchangeably).
|
|
All that is needed to create a lexer is to know the set of patterns describing
|
|
the different tokens you want to recognize in the input. To make this a bit more
|
|
formal, here are some definitions:
|
|
|
|
* A token is a sequence of consecutive characters having a collective meaning.
|
|
Tokens may have attributes specific to the token type, carrying additional
|
|
information about the matched character sequence.
|
|
* A pattern is a rule expressed as a regular expression and describing how a
|
|
particular token can be formed. For example, [^\[A-Za-z\]\[A-Za-z_0-9\]*] is
|
|
a pattern for a rule matching C++ identifiers.
|
|
* Characters between tokens are called whitespace; these include spaces, tabs,
|
|
newlines, and formfeeds. Many people also count comments as whitespace,
|
|
though since some tools such as lint look at comments, this method is not
|
|
perfect.
|
|
|
|
[heading Why Use a Separate Lexer?]
|
|
|
|
Typically, lexical scanning is done in a separate module from the parser,
|
|
feeding the parser with a stream of input tokens only. Theoretically it is
|
|
not necessary implement this separation as in the end there is only one set of
|
|
syntactical rules defining the language, so in theory we could write the whole
|
|
parser in one module. In fact, __qi__ allows you to write parsers without using a
|
|
lexer, parsing the input character stream directly, and for the most part this
|
|
is the way __spirit__ has been used since its invention.
|
|
|
|
However, this separation has both practical and theoretical basis, and proves to
|
|
be very useful in practical applications. In 1956, Noam Chomsky defined the
|
|
"Chomsky Hierarchy" of grammars:
|
|
|
|
* Type 0: Unrestricted grammars (e.g., natural languages)
|
|
* Type 1: Context-Sensitive grammars
|
|
* Type 2: Context-Free grammars
|
|
* Type 3: Regular grammars
|
|
|
|
The complexity of these grammars increases from regular grammars being the
|
|
simplest to unrestricted grammars being the most complex. Similarly, the
|
|
complexity of pattern recognition for these grammars increases. Although, a few
|
|
features of some programming languages (such as C++) are Type 1, fortunately
|
|
for the most part programming languages can be described using only the Types 2
|
|
and 3. The neat part about these two types is that they are well known and the
|
|
ways to parse them are well understood. It has been shown that any regular
|
|
grammar can be parsed using a state machine (finite automaton). Similarly,
|
|
context-free grammars can always be parsed using a push-down automaton
|
|
(essentially a state machine augmented by a stack).
|
|
|
|
In real programming languages and practical grammars, the parts that can be
|
|
handled as regular expressions tend to be the lower-level pieces, such as the
|
|
definition of an identifier or of an integer value:
|
|
|
|
letter := [a-zA-Z]
|
|
digit := [0-9]
|
|
|
|
identifier := letter [ letter | digit ]*
|
|
integer := digit+
|
|
|
|
Higher level parts of practical grammars tend to be more complex and can't be
|
|
implemented using plain regular expressions. We need to store
|
|
information on the built-in hardware stack while recursing the grammar
|
|
hierarchy, and that is the preferred approach used for top-down
|
|
parsing. Since it takes a different kind of abstract machine to parse the two
|
|
types of grammars, it proved to be efficient to separate the lexical scanner
|
|
into a separate module which is built around the idea of a state machine. The
|
|
goal here is to use the simplest parsing technique needed for the job.
|
|
|
|
Another, more practical, reason for separating the scanner from the parser is
|
|
the need for backtracking during parsing. The input data is a stream of
|
|
characters, which is often thought to be processed left to right without any
|
|
backtracking. Unfortunately, in practice most of the time that isn't possible.
|
|
Almost every language has certain keywords such as IF, FOR, and WHILE. The
|
|
decision if a certain character sequence actually comprises a keyword or just
|
|
an identifier often can be made only after seeing the first delimiter /after/
|
|
it. In fact, this makes the process backtracking, since we need to store the
|
|
string long enough to be able to make the decision. The same is true for more
|
|
coarse grained language features such as nested IF/ELSE statements, where the
|
|
decision about to which IF belongs the last ELSE statement can be made only
|
|
after seeing the whole construct.
|
|
|
|
So the structure of a conventional compiler often involves splitting up the
|
|
functions of the lower-level and higher-level parsing. The lexical scanner
|
|
deals with things at the character level, collecting characters into strings,
|
|
converting character sequence into different representations as integers, etc.,
|
|
and passing them along to the parser proper as indivisible tokens. It's also
|
|
considered normal to let the scanner do additional jobs, such as identifying
|
|
keywords, storing identifiers in tables, etc.
|
|
|
|
Now, __spirit__ follows this structure, where __lex__ can be used to implement
|
|
state machine based recognizers, while __qi__ can be used to build recognizers
|
|
for context free grammars. Since both modules are seamlessly integrated with
|
|
each other and with the C++ target language it is even possible to use the
|
|
provided functionality to build more complex grammar recognizers.
|
|
|
|
[heading Advantages of using __lex__]
|
|
|
|
The advantage of using __lex__ to create the lexical analyzer over using more
|
|
traditional tools such as __flex__ is its carefully crafted integration with
|
|
the __spirit__ library and the C++ host language. You don't need any external
|
|
tools to generate the code, your lexer will be perfectly integrated with the
|
|
rest of your program, making it possible to freely access any context
|
|
information and data structure. Since the C++ compiler sees all the code, it
|
|
will generate optimal code no matter what configuration options have been chosen
|
|
by the user. __lex__ gives you the vast majority of features you could get from
|
|
a similar __flex__ program without the need to leave C++ as a host language:
|
|
|
|
* The definition of tokens is done using regular expressions (patterns)
|
|
* The token definitions can refer to special substitution strings (pattern
|
|
macros) simplifying pattern definitions
|
|
* The generated lexical scanner may have multiple start states
|
|
* It is possible to attach code to any of the token definitions; this code gets
|
|
executed whenever the corresponding token pattern has been matched
|
|
|
|
Even if it is possible to use __lex__ to generate C++ code representing
|
|
the lexical analyzer (we will refer to that as the /static/ model, described in
|
|
more detail in the section __sec_lex_static_model__) - a model
|
|
very similar to the way __flex__ operates - we will mainly focus on the
|
|
opposite, the /dynamic/ model. You can directly integrate the token definitions
|
|
into your C++ program, building the lexical analyzer dynamically at runtime. The
|
|
dynamic model is something not supported by __flex__ or other lexical scanner
|
|
generators (such as __re2c__, __ragel__, etc.). This dynamic flexibility allows
|
|
you to speed up the development of your application.
|
|
|
|
[heading The Library Structure of __lex__]
|
|
|
|
The [link spirit.lexerflow figure] below shows a high level overview of how the
|
|
__lex__ library might be used in an application. __lex__ allows to create
|
|
lexical analyzers based on patterns. These patterns are regular expression
|
|
based rules used to define the different tokens to be recognized in the
|
|
character input sequence. The input sequence is expected to be provided to the
|
|
lexical analyzer as an arbitrary standard forward iterator. The lexical
|
|
analyzer itself exposes a standard forward iterator as well. The difference
|
|
here is that the exposed iterator provides access to the token sequence instead
|
|
of to the character sequence. The tokens in this sequence are constructed on
|
|
the fly by analyzing the underlying character sequence and matching this to the
|
|
patterns as defined by the application.
|
|
|
|
[fig lexerflow.png..The Library structure and Common Flow of Information while using __lex__ in an application..spirit.lexerflow]
|
|
|
|
|
|
[endsect]
|