boundary_analysys.txt 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503
  1. // vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen
  2. //
  3. // Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
  4. //
  5. // Distributed under the Boost Software License, Version 1.0. (See
  6. // accompanying file LICENSE_1_0.txt or copy at
  7. // http://www.boost.org/LICENSE_1_0.txt)
  8. //
  9. /*!
  10. \page boundary_analysys Boundary analysis
  11. - \ref boundary_analysys_basics
  12. - \ref boundary_analysys_segments
  13. - \ref boundary_analysys_segments_basics
  14. - \ref boundary_analysys_segments_rules
  15. - \ref boundary_analysys_segments_search
  16. - \ref boundary_analysys_break
  17. - \ref boundary_analysys_break_basics
  18. - \ref boundary_analysys_break_rules
  19. - \ref boundary_analysys_break_search
  20. \section boundary_analysys_basics Basics
  21. Boost.Locale provides a boundary analysis tool, allowing you to split text into characters,
  22. words, or sentences, and find appropriate places for line breaks.
  23. \note This task is not a trivial task.
  24. \par
  25. A Unicode code point and a character are not equivalent, for example:
  26. Hebrew word Shalom - "שָלוֹם" that consists of 4 characters and 6 code points (4 base letters and 2 diacritical marks)
  27. \par
  28. Words may not be separated by space characters in some languages like in Japanese or Chinese.
  29. Boost.Locale provides 2 major classes for boundary analysis:
  30. - \ref boost::locale::boundary::segment_index - an object that holds an index of segments in the text (like words, characters,
  31. sentences). It provides an access to \ref boost::locale::boundary::segment "segment" objects via iterators.
  32. - \ref boost::locale::boundary::boundary_point_index - an object that holds an index of boundary points in the text.
  33. It allows to iterate over the \ref boost::locale::boundary::boundary_point "boundary_point" objects.
  34. Each of the classes above use an iterator type as template parameter.
  35. Both of these classes accept in their constructor:
  36. - A flag that defines boundary analysis \ref boost::locale::boundary::boundary_type "boundary_type".
  37. - The pair of iterators that define the text range that should be analysed
  38. - A locale parameter (if not given the global one is used)
  39. For example:
  40. \code
  41. namespace ba=boost::locale::boundary;
  42. std::string text= ... ;
  43. std::locale loc = ... ;
  44. ba::segment_index<std::string::const_iterator> map(ba::word,text.begin(),text.end(),loc);
  45. \endcode
  46. Each of them provide a members \c begin(), \c end() and \c find() that allow to iterate
  47. over the selected segments or boundaries in the text or find a location of a segment or
  48. boundary for given iterator.
  49. Convenience a typedefs like \ref boost::locale::boundary::ssegment_index "ssegment_index"
  50. or \ref boost::locale::boundary::wcboundary_point_index "wcboundary_point_index" provided as well,
  51. where "w", "u16" and "u32" prefixes define a character type \c wchar_t,
  52. \c char16_t and \c char32_t and "c" and "s" prefixes define whether <tt>std::basic_string<CharType>::const_iterator</tt>
  53. or <tt>CharType const *</tt> are used.
  54. \section boundary_analysys_segments Iterating Over Segments
  55. \section boundary_analysys_segments_basics Basic Iteration
  56. The text segments analysis is done using \ref boost::locale::boundary::segment_index "segment_index" class.
  57. It provides a bidirectional iterator that returns \ref boost::locale::boundary::segment "segment" object.
  58. The segment object represents a pair of iterators that define this segment and a rule according to which it was selected.
  59. It can be automatically converted to \c std::basic_string object.
  60. To perform boundary analysis, we first create an index object and then iterate over it:
  61. For example:
  62. \code
  63. using namespace boost::locale::boundary;
  64. boost::locale::generator gen;
  65. std::string text="To be or not to be, that is the question."
  66. // Create mapping of text for token iterator using global locale.
  67. ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
  68. // Print all "words" -- chunks of word boundary
  69. for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
  70. std::cout <<"\""<< * it << "\", ";
  71. std::cout << std::endl;
  72. \endcode
  73. Would print:
  74. \verbatim
  75. "To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",
  76. \endverbatim
  77. This sentence "生きるか死ぬか、それが問題だ。" (<a href="http://tatoeba.org/eng/sentences/show/868189">from Tatoeba database</a>)
  78. would be split into following segments in \c ja_JP.UTF-8 (Japanese) locale:
  79. \verbatim
  80. "生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。",
  81. \endverbatim
  82. The boundary analysis that is done by Boost.Locale
  83. is much more complicated then just splitting the text according
  84. to white space characters, even thou it is not perfect.
  85. \section boundary_analysys_segments_rules Using Rules
  86. The segments selection can be customized using \ref boost::locale::boundary::segment_index::rule(rule_type) "rule()" and
  87. \ref boost::locale::boundary::segment_index::full_select(bool) "full_select()" member functions.
  88. By default segment_index's iterator return each text segment defined by two boundary points regardless
  89. the way they were selected. Thus in the example above we could see text segments like "." or " "
  90. that were selected as words.
  91. Using a \c rule() member function we can specify a binary mask of rules we want to use for selection of
  92. the boundary points using \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
  93. and \ref bl_boundary_sentence_rules "sentence" boundary rules.
  94. For example, by calling
  95. \code
  96. map.rule(word_any);
  97. \endcode
  98. Before starting the iteration process, specify a selection mask that fetches: numbers, letter, Kana letters and
  99. ideographic characters ignoring all non-word related characters like white space or punctuation marks.
  100. So the code:
  101. \code
  102. using namespace boost::locale::boundary;
  103. std::string text="To be or not to be, that is the question."
  104. // Create mapping of text for token iterator using global locale.
  105. ssegment_index map(word,text.begin(),text.end());
  106. // Define a rule
  107. map.rule(word_any);
  108. // Print all "words" -- chunks of word boundary
  109. for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
  110. std::cout <<"\""<< * it << "\", ";
  111. std::cout << std::endl;
  112. \endcode
  113. Would print:
  114. \verbatim
  115. "To", "be", "or", "not", "to", "be", "that", "is", "the", "question",
  116. \endverbatim
  117. And the for given text="生きるか死ぬか、それが問題だ。" and rule(\ref boost::locale::boundary::word_ideo "word_ideo"), the example above would print.
  118. \verbatim
  119. "生", "死", "問題",
  120. \endverbatim
  121. You can access specific rules the segments where selected it using \ref boost::locale::boundary::segment::rule() "segment::rule()" member
  122. function. Using a bit-mask of rules.
  123. For example:
  124. \code
  125. boost::locale::generator gen;
  126. using namespace boost::locale::boundary;
  127. std::string text="生きるか死ぬか、それが問題だ。";
  128. ssegment_index map(word,text.begin(),text.end(),gen("ja_JP.UTF-8"));
  129. for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it) {
  130. std::cout << "Segment " << *it << " contains: ";
  131. if(it->rule() & word_none)
  132. std::cout << "white space or punctuation marks ";
  133. if(it->rule() & word_kana)
  134. std::cout << "kana characters ";
  135. if(it->rule() & word_ideo)
  136. std::cout << "ideographic characters";
  137. std::cout<< std::endl;
  138. }
  139. \endcode
  140. Would print
  141. \verbatim
  142. Segment 生 contains: ideographic characters
  143. Segment きるか contains: kana characters
  144. Segment 死 contains: ideographic characters
  145. Segment ぬか contains: kana characters
  146. Segment 、 contains: white space or punctuation marks
  147. Segment それが contains: kana characters
  148. Segment 問題 contains: ideographic characters
  149. Segment だ contains: kana characters
  150. Segment 。 contains: white space or punctuation marks
  151. \endverbatim
  152. One important things that should be noted that each segment is defined
  153. by a pair of boundaries and the rule of its ending point defines
  154. if it is selected or not.
  155. In some cases it may be not what we actually look like.
  156. For example we have a text:
  157. \verbatim
  158. Hello! How
  159. are you?
  160. \endverbatim
  161. And we want to fetch all sentences from the text.
  162. The \ref bl_boundary_sentence_rules "sentence rules" have two options:
  163. - Split the text on the point where sentence terminator like ".!?" detected: \ref boost::locale::boundary::sentence_term "sentence_term"
  164. - Split the text on the point where sentence separator like "line feed" detected: \ref boost::locale::boundary::sentence_sep "sentence_sep"
  165. Naturally to ignore sentence separators we would call \ref boost::locale::boundary::segment_index::rule(rule_type v) "segment_index::rule(rule_type v)"
  166. with sentence_term parameter and then run the iterator.
  167. \code
  168. boost::locale::generator gen;
  169. using namespace boost::locale::boundary;
  170. std::string text= "Hello! How\n"
  171. "are you?\n";
  172. ssegment_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
  173. map.rule(sentence_term);
  174. for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
  175. std::cout << "Sentence [" << *it << "]" << std::endl;
  176. \endcode
  177. However we would get the expected segments:
  178. \verbatim
  179. Sentence [Hello! ]
  180. Sentence [are you?
  181. ]
  182. \endverbatim
  183. The reason is that "How\n" is still considered a sentence but selected by different
  184. rule.
  185. This behavior can be changed by setting \ref boost::locale::boundary::segment_index::full_select(bool) "segment_index::full_select(bool)"
  186. to \c true. It would force iterator to join the current segment with all previous segments that may not fit the required rule.
  187. So we add this line:
  188. \code
  189. map.full_select(true);
  190. \endcode
  191. Right after "map.rule(sentence_term);" and get expected output:
  192. \verbatim
  193. Sentence [Hello! ]
  194. Sentence [How
  195. are you?
  196. ]
  197. \endverbatim
  198. \subsection boundary_analysys_segments_search Locating Segments
  199. Sometimes it is useful to find a segment that some specific iterator is pointing on.
  200. For example a user had clicked at specific point, we want to select a word on this
  201. location.
  202. \ref boost::locale::boundary::segment_index "segment_index" provides
  203. \ref boost::locale::boundary::segment_index::find() "find(base_iterator p)"
  204. member function for this purpose.
  205. This function returns the iterator to the segmet such that \a p points to.
  206. For example:
  207. \code
  208. text="to be or ";
  209. ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
  210. ssegment_index::iterator p = map.find(text.begin() + 4);
  211. if(p!=map.end())
  212. std::cout << *p << std::endl;
  213. \endcode
  214. Would print:
  215. \verbatim
  216. be
  217. \endverbatim
  218. \note
  219. if the iterator lays inside the segment this segment returned. If the segment does
  220. not fit the selection rules, then the segment following requested position
  221. is returned.
  222. For example: For \ref boost::locale::boundary::word "word" boundary analysis with \ref boost::locale::boundary::word_any "word_any" rule:
  223. - "t|o be or ", would point to "to" - the iterator in the middle of segment "to".
  224. - "to |be or ", would point to "be" - the iterator at the beginning of the segment "be"
  225. - "to| be or ", would point to "be" - the iterator does is not point to segment with required rule so next valid segment is selected "be".
  226. - "to be or| ", would point to end as not valid segment found.
  227. \section boundary_analysys_break Iterating Over Boundary Points
  228. \section boundary_analysys_break_basics Basic Iteration
  229. The \ref boost::locale::boundary::boundary_point_index "boundary_point_index" is similar to
  230. \ref boost::locale::boundary::segment_index "segment_index" in its interface but as a different role.
  231. Instead of returning text chunks (\ref boost::locale::boundary::segment "segment"s, it returns
  232. \ref boost::locale::boundary::boundary_point "boundary_point" object that
  233. represents a position in text - a base iterator used that is used for
  234. iteration of the source text C++ characters.
  235. The \ref boost::locale::boundary::boundary_point "boundary_point" object
  236. also provides a \ref boost::locale::boundary::boundary_point::rule() "rule()" member
  237. function that defines a rule this boundary was selected according to.
  238. \note The beginning and the ending of the text are considered boundary points, so even
  239. an empty text consists of at least one boundary point.
  240. Lets see an example of selecting first two sentences from a text:
  241. \code
  242. using namespace boost::locale::boundary;
  243. boost::locale::generator gen;
  244. // our text sample
  245. std::string const text="First sentence. Second sentence! Third one?";
  246. // Create an index
  247. sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
  248. // Count two boundary points
  249. sboundary_point_index::iterator p = map.begin(),e=map.end();
  250. int count = 0;
  251. while(p!=e && count < 2) {
  252. ++count;
  253. ++p;
  254. }
  255. if(p!=e) {
  256. std::cout << "First two sentences are: "
  257. << std::string(text.begin(),p->iterator())
  258. << std::endl;
  259. }
  260. else {
  261. std::cout <<"There are less then two sentences in this "
  262. <<"text: " << text << std::endl;
  263. }\endcode
  264. Would print:
  265. \verbatim
  266. First two sentences are: First sentence. Second sentence!
  267. \endverbatim
  268. \section boundary_analysys_break_rules Using Rules
  269. Similarly to the \ref boost::locale::boundary::segment_index "segment_index" the
  270. \ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
  271. a \ref boost::locale::boundary::boundary_point_index::rule(rule_type r) "rule(rule_type mask)"
  272. member function to filter boundary points that interest us.
  273. It allows to set \ref bl_boundary_word_rules "word", \ref bl_boundary_line_rules "line"
  274. and \ref bl_boundary_sentence_rules "sentence" rules for filtering boundary points.
  275. Lets change an example above a little:
  276. \code
  277. // our text sample
  278. std::string const text= "First sentence. Second\n"
  279. "sentence! Third one?";
  280. \endcode
  281. If we run our program as is on the sample above we would get:
  282. \verbatim
  283. First two sentences are: First sentence. Second
  284. \endverbatim
  285. Which is not something that we really expected. As the "Second\n"
  286. is considered an independent sentence that was separated by
  287. a line separator "Line Feed".
  288. However, we can set set a rule \ref boost::locale::boundary::sentence_term "sentence_term"
  289. and the iterator would use only boundary points that are created
  290. by a sentence terminators like ".!?".
  291. So by adding:
  292. \code
  293. map.rule(sentence_term);
  294. \endcode
  295. Right after the generation of the index we would get the desired output:
  296. \verbatim
  297. First two sentences are: First sentence. Second
  298. sentence!
  299. \endverbatim
  300. You can also use \ref boost::locale::boundary::boundary_point::rule() "boundary_point::rule()" member
  301. function to learn about the reason this boundary point was created by comparing it with an appropriate
  302. mask.
  303. For example:
  304. \code
  305. using namespace boost::locale::boundary;
  306. boost::locale::generator gen;
  307. // our text sample
  308. std::string const text= "First sentence. Second\n"
  309. "sentence! Third one?";
  310. sboundary_point_index map(sentence,text.begin(),text.end(),gen("en_US.UTF-8"));
  311. for(sboundary_point_index::iterator p = map.begin(),e=map.end();p!=e;++p) {
  312. if(p->rule() & sentence_term)
  313. std::cout << "There is a sentence terminator: ";
  314. else if(p->rule() & sentence_sep)
  315. std::cout << "There is a sentence separator: ";
  316. if(p->rule()!=0) // print if some rule exists
  317. std::cout << "[" << std::string(text.begin(),p->iterator())
  318. << "|" << std::string(p->iterator(),text.end())
  319. << "]\n";
  320. }
  321. \endcode
  322. Would give the following output:
  323. \verbatim
  324. There is a sentence terminator: [First sentence. |Second
  325. sentence! Third one?]
  326. There is a sentence separator: [First sentence. Second
  327. |sentence! Third one?]
  328. There is a sentence terminator: [First sentence. Second
  329. sentence! |Third one?]
  330. There is a sentence terminator: [First sentence. Second
  331. sentence! Third one?|]
  332. \endverbatim
  333. \subsection boundary_analysys_break_search Locating Boundary Points
  334. Sometimes it is useful to find a specific boundary point according to given
  335. iterator.
  336. \ref boost::locale::boundary::boundary_point_index "boundary_point_index" provides
  337. a \ref boost::locale::boundary::boundary_point_index::find() "iterator find(base_iterator p)" member
  338. function.
  339. It would return an iterator to a boundary point on \a p's location or at the
  340. location following it if \a p does not point to appropriate position.
  341. For example, for word boundary analysis:
  342. - If a base iterator points to "to |be", then the returned boundary point would be "to |be" (same position)
  343. - If a base iterator points to "t|o be", then the returned boundary point would be "to| be" (next valid position)
  344. For example if we want to select 6 words around specific boundary point we can use following code:
  345. \code
  346. using namespace boost::locale::boundary;
  347. boost::locale::generator gen;
  348. // our text sample
  349. std::string const text= "To be or not to be, that is the question.";
  350. // Create a mapping
  351. sboundary_point_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
  352. // Ignore wite space
  353. map.rule(word_any);
  354. // define our arbitraty point
  355. std::string::const_iterator pos = text.begin() + 12; // "no|t";
  356. // Get the search range
  357. sboundary_point_index::iterator
  358. begin =map.begin(),
  359. end = map.end(),
  360. it = map.find(pos); // find a boundary
  361. // go 3 words backward
  362. for(int count = 0;count <3 && it!=begin; count ++)
  363. --it;
  364. // Save the start
  365. std::string::const_iterator start = *it;
  366. // go 6 words forward
  367. for(int count = 0;count < 6 && it!=end; count ++)
  368. ++it;
  369. // make sure we at valid position
  370. if(it==end)
  371. --it;
  372. // print the text
  373. std::cout << std::string(start,it->iterator()) << std::endl;
  374. \endcode
  375. That would print:
  376. \verbatim
  377. be or not to be, that
  378. \endverbatim
  379. */