123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237 |
- [/===========================================================================
- Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
- Distributed under the Boost Software License, Version 1.0
- See accompanying file LICENSE_1_0.txt or copy at
- http://www.boost.org/LICENSE_1_0.txt
- =============================================================================/]
- [section:block_indirect_sort 3.1- block_indirect_sort]
- [section:block_description Description]
- [:
- [*BLOCK_INDIRECT_SORT] is a new unstable parallel sort, conceived and implemented by Francisco Jose Tapia for the Boost Library.
- The most important characteristics of this algorithm are the *speed* and the *low memory consumption*.
- [*[teletype]
- ``
- | | | |
- Algorithm |Stable | Additional memory |Best, average, and worst case |
- ----------------------+-------+------------------------+------------------------------+
- block_indirect_sort | no |block_size * num_threads| N, N LogN, N LogN |
- | | | |
- ``
- ]
- The block_size is an internal parameter of the algorithm, which in order to achieve the
- highest speed, changes according with the size of the objects to sort according to the next table.The strings use a block_size of 128.
- [*[teletype]
- ``
- | | | | | | | |
- object size (bytes) | 1 - 15 | 16 - 31 | 32 - 63 | 64 - 127|128 - 255|256 - 511| 512 - |
- --------------------------------+--------+---------+---------+---------+---------+---------+----------+
- block_size (number of elements) | 4096 | 2048 | 1024 | 768 | 512 | 256 | 128 |
- | | | | | | | |
- ``
- ]
- ]
- [endsect]
- [br]
- [section:block_benchmark Benchmark]
- [:
- Sorting 100 000 000 64 bits numbers, the measured memory used was:
- [*[teletype]
- ``
- | | |
- Algorithm | Time (secs) | Memory used |
- ----------------------------------+-------------+-------------+
- Open MP Parallel Sort | 1.1990 | 1564 MB |
- Threading Building Blocks (TBB) | 1.6411 | 789 MB |
- Block Indirect Sort | 0.9270 | 790 MB |
- | | |
- ``
- ]
- ]
- [endsect]
- [br]
- [section:block_programming Programming]
- [:
- [h4[_Thread specification]]
- [:
- This algorithm has an integer parameter indicating the *number of threads* to use in the sorting process,
- which always is the last value in the call. The default value (if left unspecified) is the number of HW threads of
- the machine where the program is running provided by std::thread::hardware_concurrency().
- If the number is 1 or 0, the algorithm runs with only 1 thread.
- The number of threads is not a fixed number, but is calculated in each execution. The number of threads passed can be greater
- than the number of hardware threads on the machine. We can pass 100 threads in a machine with 4 HW threads,
- and in the same mode we can pass a function as (std::thread::hardware_concurrency() / 4 ).
- If this value is 0, the program is executed with 1 thread.
- ]
- [h4[_Programming]]
- [:
- You only need to include the file boost/sort/sort.hpp to use this algorithm
- The algorithm runs in the namespace boost::sort
- [c++]
- ``
- #include <boost/sort/sort.hpp>
- template <class iter_t>
- void block_indirect_sort (iter_t first, iter_t last);
- template <class iter_t, typename compare>
- void block_indirect_sort (iter_t first, iter_t last, compare comp);
- template <class iter_t>
- void block_indirect_sort (iter_t first, iter_t last, uint32_t num_thread);
- template <class iter_t, typename compare>
- void block_indirect_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);
- ``
- This algorithm needs a *C++11 compliant compiler*. You don't need any other code or library. With older compilers correct operation is not guaranteed.
- If the number of threads is unspecified, use the result of std::thread::hardware_concurrency()
- This algorithm uses a *comparison object*, in the same way as the standard library sort
- algorithms. If not defined, the comparison object is std::less, which uses
- the < operator internally.
- This algorithm is [*exception safe], meaning that, the exceptions generated by the algorithm
- guarantee the integrity of the objects to sort, but not their relative order. If the exception
- is generated inside the objects (in the move or in the copy constructor.. ) the results can be
- unpredictable.
- ]
- ]
- [endsect]
- [br]
- [section:block_internal Internal Description]
- [:
- There are two primary categories of parallelization in sorting algorithms.
- [h4[_Subdivision Algorithms]]
- [: Filter the data and generate two or more parts. Each part obtained is
- filtered and divided by other threads. This filter and division process is done
- until the size of the part is smaller than a predefined size, then is sorted by a single thread.
- The algorithm most frequently used in the filter and sorting is quick sort
- These algorithms are fast with a small number of threads, but are inefficient
- with a great number of threads. Examples of this category are
- *Intel Threading Building Blocks (TBB)
- *Microsoft PPL Parallel Sort.
- ]
- [h4[_Merging Algorithms]]
- [:
- Divide the data in parts, and each part is sorted by a thread. When
- the parts are sorted, they are merged to obtain the final results. These algorithms need additional memory for the
- merge, usually the same size as the data.
- With a small number of threads, these algorithms have similar speed to
- the subdivision algorithms, but with many threads are much faster.
- Examples of this category are
- *GCC Parallel Sort (based on OpenMP)
- *Microsoft PPL Parallel Buffered Sort
- This generates an *undesirable duality*. With a small number of threads the optimal algorithm is not the optimal for a big number of threads.
- For this reason, the SW designed for a *small machine* is *inadequate* for a *big machine* and vice versa.
- Using only *merging algorithms*, has the *problem of the additional memory* used, usually of the same size as the data.
- ]
- [h4[_New Parallel Sort Algorithm (Block Indirect Sort)]]
- [:
- This algorithm, named Block Indirect Sort, created for processors connected with shared memory, is a hybrid algorithm.
- *With small number of threads, it is a subdivision algorithm.
- *With many threads it is a merging algorithm, with a small auxiliary memory ( block_size * number of threads).
- This algorithm *eliminates the duality*. The same code has *optimal performance* with a small and a big number of threads.
- The number of threads to use is evaluated in each execution.
- When the program runs with a *small number of threads* the algorithm
- internally uses a *subdivision algorithm* and has similar performance to TBB, and when run with *many threads*,
- it internally uses the *new algorithm* and has the performance of GCC Parallel Sort, with the additional advantage of *reduced memory consumption*.
- ]
- ]
- [endsect]
- [br]
- [section:design_process Design Process]
- [:
- [h4[_Initial Idea]]
- [:
- The *initial idea*, was to build a *merge algorithm*, to be *fast with many threads, with a low additional memory*.
- This algortihm is *not based in any other idea or algorithm*. The technique used in the algorithm (indirect blocks) *is new, and had been conceived and implemented for this algorithm*.
- As sample of their results, we can see the the sorting 100 000 000 64 bits numbers, ramdomly generated,
- in a machine with 12 threads.
- [*[teletype]
- ``
- | | |
- Algorithm | Time (secs) | Memory used |
- ----------------------------------+-------------+-------------+
- Open MP Parallel Sort | 1.1990 | 1564 MB |
- Threading Building Blocks (TBB) | 1.6411 | 789 MB |
- Block Indirect Sort | 0.9270 | 790 MB |
- | | |
- ``
- ]
- The best words about this algorithm are expressed by their [*[link sort.parallel.linux_parallel Benchmarks results]]
- ]
- [h4[_Design process]]
- [:
- The process had been *long and very hard*, mainly, by the uncertainty about if the ideas are correct and run
- so fast as need for to be useful. This is complicated by the fact that we can’t be sure of the efficiency until the last part
- of the code is done and the first benchmark has run.
- But it had been a *very exciting process*, each time a problem is resolved, a new internal algorithm is designed,
- tested …, and see, step by step, the advance of the process.
- I discovered *many new problems* during this process, unknown until now, which forced me to design *new internal algorithms* to resolve them,
- and divide the work in many parts to execute in parallel mode. Due to this, you can find many nice algorithms inside the sorting algorithm
- written to resolve and parallelize the internal problems.
- If you are interested in a detailed description of the algorithm, you can find it here : [* [@../papers/block_indirect_sort_en.pdf block_indirect_sort_en.pdf]].
- ]
- ]
- [endsect]
- [endsect]
|