123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208 |
- [/==============================================================================
- Copyright (C) 2001-2011 Joel de Guzman
- Copyright (C) 2001-2011 Hartmut Kaiser
- Distributed under the Boost Software License, Version 1.0. (See accompanying
- file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
- ===============================================================================/]
- [section:lexer_token_values About Tokens and Token Values]
- As already discussed, lexical scanning is the process of analyzing the stream
- of input characters and separating it into strings called tokens, most of the
- time separated by whitespace. The different token types recognized by a lexical
- analyzer often get assigned unique integer token identifiers (token ids). These
- token ids are normally used by the parser to identify the current token without
- having to look at the matched string again. The __lex__ library is not
- different with respect to this, as it uses the token ids as the main means of
- identification of the different token types defined for a particular lexical
- analyzer. However, it is different from commonly used lexical analyzers in the
- sense that it returns (references to) instances of a (user defined) token class
- to the user. The only limitation of this token class is that it must carry at
- least the token id of the token it represents. For more information about the
- interface a user defined token type has to expose please look at the
- __sec_ref_lex_token__ reference. The library provides a default
- token type based on the __lexertl__ library which should be sufficient in most
- cases: the __class_lexertl_token__ type. This section focusses on the
- description of general features a token class may implement and how this
- integrates with the other parts of the __lex__ library.
- [heading The Anatomy of a Token]
- It is very important to understand the difference between a token definition
- (represented by the __class_token_def__ template) and a token itself (for
- instance represented by the __class_lexertl_token__ template).
- The token definition is used to describe the main features of a particular
- token type, especially:
- * to simplify the definition of a token type using a regular expression pattern
- applied while matching this token type,
- * to associate a token type with a particular lexer state,
- * to optionally assign a token id to a token type,
- * to optionally associate some code to execute whenever an instance of this
- token type has been matched,
- * and to optionally specify the attribute type of the token value.
- The token itself is a data structure returned by the lexer iterators.
- Dereferencing a lexer iterator returns a reference to the last matched token
- instance. It encapsulates the part of the underlying input sequence matched by
- the regular expression used during the definition of this token type.
- Incrementing the lexer iterator invokes the lexical analyzer to
- match the next token by advancing the underlying input stream. The token data
- structure contains at least the token id of the matched token type,
- allowing to identify the matched character sequence. Optionally, the token
- instance may contain a token value and/or the lexer state this token instance
- was matched in. The following [link spirit.lex.tokenstructure figure] shows the
- schematic structure of a token.
- [fig tokenstructure.png..The structure of a token..spirit.lex.tokenstructure]
- The token value and the lexer state the token has been recognized in may be
- omitted for optimization reasons, thus avoiding the need for the token to carry
- more data than actually required. This configuration can be achieved by supplying
- appropriate template parameters for the
- __class_lexertl_token__ template while defining the token type.
- The lexer iterator returns the same token type for each of the different
- matched token definitions. To accommodate for the possible different token
- /value/ types exposed by the various token types (token definitions), the
- general type of the token value is a __boost_variant__. At a minimum (for the
- default configuration) this token value variant will be configured to always
- hold a __boost_iterator_range__ containing the pair of iterators pointing to
- the matched input sequence for this token instance.
- [note If the lexical analyzer is used in conjunction with a __qi__ parser, the
- stored __boost_iterator_range__ token value will be converted to the
- requested token type (parser attribute) exactly once. This happens at the
- time of the first access to the token value requiring the
- corresponding type conversion. The converted token value will be stored
- in the __boost_variant__ replacing the initially stored iterator range.
- This avoids having to convert the input sequence to the token value more
- than once, thus optimizing the integration of the lexer with __qi__, even
- during parser backtracking.
- ]
- Here is the template prototype of the __class_lexertl_token__ template:
- template <
- typename Iterator = char const*,
- typename AttributeTypes = mpl::vector0<>,
- typename HasState = mpl::true_
- >
- struct lexertl_token;
- [variablelist where:
- [[Iterator] [This is the type of the iterator used to access the
- underlying input stream. It defaults to a plain
- `char const*`.]]
- [[AttributeTypes] [This is either a mpl sequence containing all
- attribute types used for the token definitions or the
- type `omit`. If the mpl sequence is empty (which is
- the default), all token instances will store a
- __boost_iterator_range__`<Iterator>` pointing to the start
- and the end of the matched section in the input stream.
- If the type is `omit`, the generated tokens will
- contain no token value (attribute) at all.]]
- [[HasState] [This is either `mpl::true_` or `mpl::false_`, allowing
- control as to whether the generated token instances will
- contain the lexer state they were generated in. The
- default is mpl::true_, so all token instances will
- contain the lexer state.]]
- ]
- Normally, during construction, a token instance always holds the
- __boost_iterator_range__ as its token value, unless it has been defined
- using the `omit` token value type. This iterator range then is
- converted in place to the requested token value type (attribute) when it is
- requested for the first time.
- [heading The Physiognomy of a Token Definition]
- The token definitions (represented by the __class_token_def__ template) are
- normally used as part of the definition of the lexical analyzer. At the same
- time a token definition instance may be used as a parser component in __qi__.
- The template prototype of this class is shown here:
- template<
- typename Attribute = unused_type,
- typename Char = char
- >
- class token_def;
- [variablelist where:
- [[Attribute] [This is the type of the token value (attribute)
- supported by token instances representing this token
- type. This attribute type is exposed to the __qi__
- library, whenever this token definition is used as a
- parser component. The default attribute type is
- `unused_type`, which means the token instance holds a
- __boost_iterator_range__ pointing to the start
- and the end of the matched section in the input stream.
- If the attribute is `omit` the token instance will
- expose no token type at all. Any other type will be
- used directly as the token value type.]]
- [[Char] [This is the value type of the iterator for the
- underlying input sequence. It defaults to `char`.]]
- ]
- The semantics of the template parameters for the token type and the token
- definition type are very similar and interdependent. As a rule of thumb you can
- think of the token definition type as the means of specifying everything
- related to a single specific token type (such as `identifier` or `integer`).
- On the other hand the token type is used to define the general properties of all
- token instances generated by the __lex__ library.
- [important If you don't list any token value types in the token type definition
- declaration (resulting in the usage of the default __boost_iterator_range__
- token type) everything will compile and work just fine, just a bit
- less efficient. This is because the token value will be converted
- from the matched input sequence every time it is requested.
-
- But as soon as you specify at least one token value type while
- defining the token type you'll have to list all value types used for
- __class_token_def__ declarations in the token definition class,
- otherwise compilation errors will occur.
- ]
- [heading Examples of using __class_lexertl_token__]
- Let's start with some examples. We refer to one of the __lex__ examples (for
- the full source code of this example please see
- [@../../example/lex/example4.cpp example4.cpp]).
- [import ../example/lex/example4.cpp]
- The first code snippet shows an excerpt of the token definition class, the
- definition of a couple of token types. Some of the token types do not expose a
- special token value (`if_`, `else_`, and `while_`). Their token value will
- always hold the iterator range of the matched input sequence. The token
- definitions for the `identifier` and the integer `constant` are specialized
- to expose an explicit token type each: `std::string` and `unsigned int`.
- [example4_token_def]
- As the parsers generated by __qi__ are fully attributed, any __qi__ parser
- component needs to expose a certain type as its parser attribute. Naturally,
- the __class_token_def__ exposes the token value type as its parser attribute,
- enabling a smooth integration with __qi__.
- The next code snippet demonstrates how the required token value types are
- specified while defining the token type to use. All of the token value types
- used for at least one of the token definitions have to be re-iterated for the
- token definition as well.
- [example4_token]
- To avoid the token to have a token value at all, the special tag `omit` can
- be used: `token_def<omit>` and `lexertl_token<base_iterator_type, omit>`.
- [endsect]
|