lexer_quickstart3.qbk 7.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
  1. [/==============================================================================
  2. Copyright (C) 2001-2011 Joel de Guzman
  3. Copyright (C) 2001-2011 Hartmut Kaiser
  4. Distributed under the Boost Software License, Version 1.0. (See accompanying
  5. file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
  6. ===============================================================================/]
  7. [section:lexer_quickstart3 Quickstart 3 - Counting Words Using a Parser]
  8. The whole purpose of integrating __lex__ as part of the __spirit__ library was
  9. to add a library allowing the merger of lexical analysis with the parsing
  10. process as defined by a __spirit__ grammar. __spirit__ parsers read their input
  11. from an input sequence accessed by iterators. So naturally, we chose iterators
  12. to be used as the interface between the lexer and the parser. A second goal of
  13. the lexer/parser integration was to enable the usage of different
  14. lexical analyzer libraries. The utilization of iterators seemed to be the
  15. right choice from this standpoint as well, mainly because these can be used as
  16. an abstraction layer hiding implementation specifics of the used lexer
  17. library. The [link spirit.lex.flowcontrol picture] below shows the common
  18. flow control implemented while parsing combined with lexical analysis.
  19. [fig flowofcontrol.png..The common flow control implemented while parsing combined with lexical analysis..spirit.lex.flowcontrol]
  20. Another problem related to the integration of the lexical analyzer with the
  21. parser was to find a way how the defined tokens syntactically could be blended
  22. with the grammar definition syntax of __spirit__. For tokens defined as
  23. instances of the `token_def<>` class the most natural way of integration was
  24. to allow to directly use these as parser components. Semantically these parser
  25. components succeed matching their input whenever the corresponding token type
  26. has been matched by the lexer. This quick start example will demonstrate this
  27. (and more) by counting words again, simply by adding up the numbers inside
  28. of semantic actions of a parser (for the full example code see here:
  29. [@../../example/lex/word_count.cpp word_count.cpp]).
  30. [import ../example/lex/word_count.cpp]
  31. [heading Prerequisites]
  32. This example uses two of the __spirit__ library components: __lex__ and __qi__,
  33. consequently we have to `#include` the corresponding header files. Again, we
  34. need to include a couple of header files from the __phoenix__ library. This
  35. example shows how to attach functors to parser components, which
  36. could be done using any type of C++ technique resulting in a callable object.
  37. Using __phoenix__ for this task simplifies things and avoids adding
  38. dependencies to other libraries (__phoenix__ is already in use for
  39. __spirit__ anyway).
  40. [wcp_includes]
  41. To make all the code below more readable we introduce the following namespaces.
  42. [wcp_namespaces]
  43. [heading Defining Tokens]
  44. If compared to the two previous quick start examples (__sec_lex_quickstart_1__
  45. and __sec_lex_quickstart_2__) the token definition class for this example does
  46. not reveal any surprises. However, it uses lexer token definition macros to
  47. simplify the composition of the regular expressions, which will be described in
  48. more detail in the section __fixme__. Generally, any token definition is usable
  49. without modification from either a stand alone lexical analyzer or in conjunction
  50. with a parser.
  51. [wcp_token_definition]
  52. [heading Using Token Definition Instances as Parsers]
  53. While the integration of lexer and parser in the control flow is achieved by
  54. using special iterators wrapping the lexical analyzer, we still need a means of
  55. expressing in the grammar what tokens to match and where. The token definition
  56. class above uses three different ways of defining a token:
  57. * Using an instance of a `token_def<>`, which is handy whenever you need to
  58. specify a token attribute (for more information about lexer related
  59. attributes please look here: __sec_lex_attributes__).
  60. * Using a single character as the token, in this case the character represents
  61. itself as a token, where the token id is the ASCII character value.
  62. * Using a regular expression represented as a string, where the token id needs
  63. to be specified explicitly to make the token accessible from the grammar
  64. level.
  65. All three token definition methods require a different method of grammar
  66. integration. But as you can see from the following code snippet, each of these
  67. methods are straightforward and blend the corresponding token instances
  68. naturally with the surrounding __qi__ grammar syntax.
  69. [table
  70. [[Token definition] [Parser integration]]
  71. [[`token_def<>`] [The `token_def<>` instance is directly usable as a
  72. parser component. Parsing of this component will
  73. succeed if the regular expression used to define
  74. this has been matched successfully.]]
  75. [[single character] [The single character is directly usable in the
  76. grammar. However, under certain circumstances it needs
  77. to be wrapped by a `char_()` parser component.
  78. Parsing of this component will succeed if the
  79. single character has been matched.]]
  80. [[explicit token id] [To use an explicit token id in a __qi__ grammar you
  81. are required to wrap it with the special `token()`
  82. parser component. Parsing of this component will
  83. succeed if the current token has the same token
  84. id as specified in the expression `token(<id>)`.]]
  85. ]
  86. The grammar definition below uses each of the three types demonstrating their
  87. usage.
  88. [wcp_grammar_definition]
  89. As already described (see: __sec_attributes__), the __qi__ parser
  90. library builds upon a set of fully attributed parser components.
  91. Consequently, all token definitions support this attribute model as well. The
  92. most natural way of implementing this was to use the token values as
  93. the attributes exposed by the parser component corresponding to the token
  94. definition (you can read more about this topic here: __sec_lex_tokenvalues__).
  95. The example above takes advantage of the full integration of the token values
  96. as the `token_def<>`'s parser attributes: the `word` token definition is
  97. declared as a `token_def<std::string>`, making every instance of a `word` token
  98. carry the string representation of the matched input sequence as its value.
  99. The semantic action attached to `tok.word` receives this string (represented by
  100. the `_1` placeholder) and uses it to calculate the number of matched
  101. characters: `ref(c) += size(_1)`.
  102. [heading Pulling Everything Together]
  103. The main function needs to implement a bit more logic now as we have to
  104. initialize and start not only the lexical analysis but the parsing process as
  105. well. The three type definitions (`typedef` statements) simplify the creation
  106. of the lexical analyzer and the grammar. After reading the contents of the
  107. given file into memory it calls the function __api_tokenize_and_parse__ to
  108. initialize the lexical analysis and parsing processes.
  109. [wcp_main]
  110. [endsect]