introduction.qbk 8.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
  1. [/==============================================================================
  2. Copyright (C) 2001-2011 Joel de Guzman
  3. Copyright (C) 2001-2011 Hartmut Kaiser
  4. Distributed under the Boost Software License, Version 1.0. (See accompanying
  5. file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
  6. ===============================================================================/]
  7. [section:lexer_introduction Introduction to __lex__]
  8. Lexical scanning is the process of analyzing the stream of input characters and
  9. separating it into strings called tokens, separated by whitespace.
  10. Most compiler texts start here, and devote several chapters to discussing
  11. various ways to build scanners. __lex__ is a library built to take care of the
  12. complexities of creating a lexer for your grammar (in this documentation we
  13. will use the terms 'lexical analyzer', 'lexer' and 'scanner' interchangeably).
  14. All that is needed to create a lexer is to know the set of patterns describing
  15. the different tokens you want to recognize in the input. To make this a bit more
  16. formal, here are some definitions:
  17. * A token is a sequence of consecutive characters having a collective meaning.
  18. Tokens may have attributes specific to the token type, carrying additional
  19. information about the matched character sequence.
  20. * A pattern is a rule expressed as a regular expression and describing how a
  21. particular token can be formed. For example, [^\[A-Za-z\]\[A-Za-z_0-9\]*] is
  22. a pattern for a rule matching C++ identifiers.
  23. * Characters between tokens are called whitespace; these include spaces, tabs,
  24. newlines, and formfeeds. Many people also count comments as whitespace,
  25. though since some tools such as lint look at comments, this method is not
  26. perfect.
  27. [heading Why Use a Separate Lexer?]
  28. Typically, lexical scanning is done in a separate module from the parser,
  29. feeding the parser with a stream of input tokens only. Theoretically it is
  30. not necessary implement this separation as in the end there is only one set of
  31. syntactical rules defining the language, so in theory we could write the whole
  32. parser in one module. In fact, __qi__ allows you to write parsers without using a
  33. lexer, parsing the input character stream directly, and for the most part this
  34. is the way __spirit__ has been used since its invention.
  35. However, this separation has both practical and theoretical basis, and proves to
  36. be very useful in practical applications. In 1956, Noam Chomsky defined the
  37. "Chomsky Hierarchy" of grammars:
  38. * Type 0: Unrestricted grammars (e.g., natural languages)
  39. * Type 1: Context-Sensitive grammars
  40. * Type 2: Context-Free grammars
  41. * Type 3: Regular grammars
  42. The complexity of these grammars increases from regular grammars being the
  43. simplest to unrestricted grammars being the most complex. Similarly, the
  44. complexity of pattern recognition for these grammars increases. Although, a few
  45. features of some programming languages (such as C++) are Type 1, fortunately
  46. for the most part programming languages can be described using only the Types 2
  47. and 3. The neat part about these two types is that they are well known and the
  48. ways to parse them are well understood. It has been shown that any regular
  49. grammar can be parsed using a state machine (finite automaton). Similarly,
  50. context-free grammars can always be parsed using a push-down automaton
  51. (essentially a state machine augmented by a stack).
  52. In real programming languages and practical grammars, the parts that can be
  53. handled as regular expressions tend to be the lower-level pieces, such as the
  54. definition of an identifier or of an integer value:
  55. letter := [a-zA-Z]
  56. digit := [0-9]
  57. identifier := letter [ letter | digit ]*
  58. integer := digit+
  59. Higher level parts of practical grammars tend to be more complex and can't be
  60. implemented using plain regular expressions. We need to store
  61. information on the built-in hardware stack while recursing the grammar
  62. hierarchy, and that is the preferred approach used for top-down
  63. parsing. Since it takes a different kind of abstract machine to parse the two
  64. types of grammars, it proved to be efficient to separate the lexical scanner
  65. into a separate module which is built around the idea of a state machine. The
  66. goal here is to use the simplest parsing technique needed for the job.
  67. Another, more practical, reason for separating the scanner from the parser is
  68. the need for backtracking during parsing. The input data is a stream of
  69. characters, which is often thought to be processed left to right without any
  70. backtracking. Unfortunately, in practice most of the time that isn't possible.
  71. Almost every language has certain keywords such as IF, FOR, and WHILE. The
  72. decision if a certain character sequence actually comprises a keyword or just
  73. an identifier often can be made only after seeing the first delimiter /after/
  74. it. In fact, this makes the process backtracking, since we need to store the
  75. string long enough to be able to make the decision. The same is true for more
  76. coarse grained language features such as nested IF/ELSE statements, where the
  77. decision about to which IF belongs the last ELSE statement can be made only
  78. after seeing the whole construct.
  79. So the structure of a conventional compiler often involves splitting up the
  80. functions of the lower-level and higher-level parsing. The lexical scanner
  81. deals with things at the character level, collecting characters into strings,
  82. converting character sequence into different representations as integers, etc.,
  83. and passing them along to the parser proper as indivisible tokens. It's also
  84. considered normal to let the scanner do additional jobs, such as identifying
  85. keywords, storing identifiers in tables, etc.
  86. Now, __spirit__ follows this structure, where __lex__ can be used to implement
  87. state machine based recognizers, while __qi__ can be used to build recognizers
  88. for context free grammars. Since both modules are seamlessly integrated with
  89. each other and with the C++ target language it is even possible to use the
  90. provided functionality to build more complex grammar recognizers.
  91. [heading Advantages of using __lex__]
  92. The advantage of using __lex__ to create the lexical analyzer over using more
  93. traditional tools such as __flex__ is its carefully crafted integration with
  94. the __spirit__ library and the C++ host language. You don't need any external
  95. tools to generate the code, your lexer will be perfectly integrated with the
  96. rest of your program, making it possible to freely access any context
  97. information and data structure. Since the C++ compiler sees all the code, it
  98. will generate optimal code no matter what configuration options have been chosen
  99. by the user. __lex__ gives you the vast majority of features you could get from
  100. a similar __flex__ program without the need to leave C++ as a host language:
  101. * The definition of tokens is done using regular expressions (patterns)
  102. * The token definitions can refer to special substitution strings (pattern
  103. macros) simplifying pattern definitions
  104. * The generated lexical scanner may have multiple start states
  105. * It is possible to attach code to any of the token definitions; this code gets
  106. executed whenever the corresponding token pattern has been matched
  107. Even if it is possible to use __lex__ to generate C++ code representing
  108. the lexical analyzer (we will refer to that as the /static/ model, described in
  109. more detail in the section __sec_lex_static_model__) - a model
  110. very similar to the way __flex__ operates - we will mainly focus on the
  111. opposite, the /dynamic/ model. You can directly integrate the token definitions
  112. into your C++ program, building the lexical analyzer dynamically at runtime. The
  113. dynamic model is something not supported by __flex__ or other lexical scanner
  114. generators (such as __re2c__, __ragel__, etc.). This dynamic flexibility allows
  115. you to speed up the development of your application.
  116. [heading The Library Structure of __lex__]
  117. The [link spirit.lexerflow figure] below shows a high level overview of how the
  118. __lex__ library might be used in an application. __lex__ allows to create
  119. lexical analyzers based on patterns. These patterns are regular expression
  120. based rules used to define the different tokens to be recognized in the
  121. character input sequence. The input sequence is expected to be provided to the
  122. lexical analyzer as an arbitrary standard forward iterator. The lexical
  123. analyzer itself exposes a standard forward iterator as well. The difference
  124. here is that the exposed iterator provides access to the token sequence instead
  125. of to the character sequence. The tokens in this sequence are constructed on
  126. the fly by analyzing the underlying character sequence and matching this to the
  127. patterns as defined by the application.
  128. [fig lexerflow.png..The Library structure and Common Flow of Information while using __lex__ in an application..spirit.lexerflow]
  129. [endsect]