238 lines
9.2 KiB
Plaintext
238 lines
9.2 KiB
Plaintext
[/===========================================================================
|
||
Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
|
||
|
||
|
||
Distributed under the Boost Software License, Version 1.0
|
||
See accompanying file LICENSE_1_0.txt or copy at
|
||
http://www.boost.org/LICENSE_1_0.txt
|
||
=============================================================================/]
|
||
|
||
[section:block_indirect_sort 3.1- block_indirect_sort]
|
||
|
||
|
||
[section:block_description Description]
|
||
[:
|
||
|
||
[*BLOCK_INDIRECT_SORT] is a new unstable parallel sort, conceived and implemented by Francisco Jose Tapia for the Boost Library.
|
||
|
||
The most important characteristics of this algorithm are the *speed* and the *low memory consumption*.
|
||
|
||
[*[teletype]
|
||
``
|
||
| | | |
|
||
Algorithm |Stable | Additional memory |Best, average, and worst case |
|
||
----------------------+-------+------------------------+------------------------------+
|
||
block_indirect_sort | no |block_size * num_threads| N, N LogN, N LogN |
|
||
| | | |
|
||
``
|
||
]
|
||
|
||
The block_size is an internal parameter of the algorithm, which in order to achieve the
|
||
highest speed, changes according with the size of the objects to sort according to the next table.The strings use a block_size of 128.
|
||
|
||
|
||
[*[teletype]
|
||
``
|
||
| | | | | | | |
|
||
object size (bytes) | 1 - 15 | 16 - 31 | 32 - 63 | 64 - 127|128 - 255|256 - 511| 512 - |
|
||
--------------------------------+--------+---------+---------+---------+---------+---------+----------+
|
||
block_size (number of elements) | 4096 | 2048 | 1024 | 768 | 512 | 256 | 128 |
|
||
| | | | | | | |
|
||
``
|
||
]
|
||
]
|
||
[endsect]
|
||
[br]
|
||
[section:block_benchmark Benchmark]
|
||
[:
|
||
Sorting 100 000 000 64 bits numbers, the measured memory used was:
|
||
[*[teletype]
|
||
``
|
||
| | |
|
||
Algorithm | Time (secs) | Memory used |
|
||
----------------------------------+-------------+-------------+
|
||
Open MP Parallel Sort | 1.1990 | 1564 MB |
|
||
Threading Building Blocks (TBB) | 1.6411 | 789 MB |
|
||
Block Indirect Sort | 0.9270 | 790 MB |
|
||
| | |
|
||
``
|
||
]
|
||
]
|
||
[endsect]
|
||
[br]
|
||
|
||
[section:block_programming Programming]
|
||
[:
|
||
[h4[_Thread specification]]
|
||
[:
|
||
This algorithm has an integer parameter indicating the *number of threads* to use in the sorting process,
|
||
which always is the last value in the call. The default value (if left unspecified) is the number of HW threads of
|
||
the machine where the program is running provided by std::thread::hardware_concurrency().
|
||
|
||
If the number is 1 or 0, the algorithm runs with only 1 thread.
|
||
|
||
The number of threads is not a fixed number, but is calculated in each execution. The number of threads passed can be greater
|
||
than the number of hardware threads on the machine. We can pass 100 threads in a machine with 4 HW threads,
|
||
and in the same mode we can pass a function as (std::thread::hardware_concurrency() / 4 ).
|
||
If this value is 0, the program is executed with 1 thread.
|
||
]
|
||
[h4[_Programming]]
|
||
[:
|
||
You only need to include the file boost/sort/sort.hpp to use this algorithm
|
||
|
||
The algorithm runs in the namespace boost::sort
|
||
|
||
[c++]
|
||
``
|
||
#include <boost/sort/sort.hpp>
|
||
|
||
|
||
|
||
template <class iter_t>
|
||
void block_indirect_sort (iter_t first, iter_t last);
|
||
|
||
template <class iter_t, typename compare>
|
||
void block_indirect_sort (iter_t first, iter_t last, compare comp);
|
||
|
||
template <class iter_t>
|
||
void block_indirect_sort (iter_t first, iter_t last, uint32_t num_thread);
|
||
|
||
template <class iter_t, typename compare>
|
||
void block_indirect_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);
|
||
``
|
||
|
||
This algorithm needs a *C++11 compliant compiler*. You don't need any other code or library. With older compilers correct operation is not guaranteed.
|
||
|
||
If the number of threads is unspecified, use the result of std::thread::hardware_concurrency()
|
||
|
||
This algorithm uses a *comparison object*, in the same way as the standard library sort
|
||
algorithms. If not defined, the comparison object is std::less, which uses
|
||
the < operator internally.
|
||
|
||
This algorithm is [*exception safe], meaning that, the exceptions generated by the algorithm
|
||
guarantee the integrity of the objects to sort, but not their relative order. If the exception
|
||
is generated inside the objects (in the move or in the copy constructor.. ) the results can be
|
||
unpredictable.
|
||
]
|
||
]
|
||
[endsect]
|
||
[br]
|
||
[section:block_internal Internal Description]
|
||
[:
|
||
|
||
There are two primary categories of parallelization in sorting algorithms.
|
||
|
||
|
||
[h4[_Subdivision Algorithms]]
|
||
|
||
|
||
[: Filter the data and generate two or more parts. Each part obtained is
|
||
filtered and divided by other threads. This filter and division process is done
|
||
until the size of the part is smaller than a predefined size, then is sorted by a single thread.
|
||
|
||
The algorithm most frequently used in the filter and sorting is quick sort
|
||
|
||
These algorithms are fast with a small number of threads, but are inefficient
|
||
with a great number of threads. Examples of this category are
|
||
|
||
*Intel Threading Building Blocks (TBB)
|
||
|
||
*Microsoft PPL Parallel Sort.
|
||
|
||
]
|
||
[h4[_Merging Algorithms]]
|
||
|
||
[:
|
||
|
||
Divide the data in parts, and each part is sorted by a thread. When
|
||
the parts are sorted, they are merged to obtain the final results. These algorithms need additional memory for the
|
||
merge, usually the same size as the data.
|
||
|
||
With a small number of threads, these algorithms have similar speed to
|
||
the subdivision algorithms, but with many threads are much faster.
|
||
|
||
Examples of this category are
|
||
|
||
*GCC Parallel Sort (based on OpenMP)
|
||
|
||
*Microsoft PPL Parallel Buffered Sort
|
||
|
||
This generates an *undesirable duality*. With a small number of threads the optimal algorithm is not the optimal for a big number of threads.
|
||
|
||
For this reason, the SW designed for a *small machine* is *inadequate* for a *big machine* and vice versa.
|
||
|
||
Using only *merging algorithms*, has the *problem of the additional memory* used, usually of the same size as the data.
|
||
]
|
||
|
||
[h4[_New Parallel Sort Algorithm (Block Indirect Sort)]]
|
||
[:
|
||
|
||
This algorithm, named Block Indirect Sort, created for processors connected with shared memory, is a hybrid algorithm.
|
||
|
||
*With small number of threads, it is a subdivision algorithm.
|
||
|
||
*With many threads it is a merging algorithm, with a small auxiliary memory ( block_size * number of threads).
|
||
|
||
This algorithm *eliminates the duality*. The same code has *optimal performance* with a small and a big number of threads.
|
||
|
||
The number of threads to use is evaluated in each execution.
|
||
When the program runs with a *small number of threads* the algorithm
|
||
internally uses a *subdivision algorithm* and has similar performance to TBB, and when run with *many threads*,
|
||
it internally uses the *new algorithm* and has the performance of GCC Parallel Sort, with the additional advantage of *reduced memory consumption*.
|
||
|
||
]
|
||
]
|
||
[endsect]
|
||
[br]
|
||
|
||
[section:design_process Design Process]
|
||
|
||
[:
|
||
[h4[_Initial Idea]]
|
||
[:
|
||
The *initial idea*, was to build a *merge algorithm*, to be *fast with many threads, with a low additional memory*.
|
||
|
||
This algortihm is *not based in any other idea or algorithm*. The technique used in the algorithm (indirect blocks) *is new, and had been conceived and implemented for this algorithm*.
|
||
|
||
As sample of their results, we can see the the sorting 100 000 000 64 bits numbers, ramdomly generated,
|
||
in a machine with 12 threads.
|
||
|
||
[*[teletype]
|
||
``
|
||
| | |
|
||
Algorithm | Time (secs) | Memory used |
|
||
----------------------------------+-------------+-------------+
|
||
Open MP Parallel Sort | 1.1990 | 1564 MB |
|
||
Threading Building Blocks (TBB) | 1.6411 | 789 MB |
|
||
Block Indirect Sort | 0.9270 | 790 MB |
|
||
| | |
|
||
``
|
||
]
|
||
|
||
|
||
The best words about this algorithm are expressed by their [*[link sort.parallel.linux_parallel Benchmarks results]]
|
||
|
||
]
|
||
[h4[_Design process]]
|
||
[:
|
||
The process had been *long and very hard*, mainly, by the uncertainty about if the ideas are correct and run
|
||
so fast as need for to be useful. This is complicated by the fact that we can’t be sure of the efficiency until the last part
|
||
of the code is done and the first benchmark has run.
|
||
|
||
But it had been a *very exciting process*, each time a problem is resolved, a new internal algorithm is designed,
|
||
tested …, and see, step by step, the advance of the process.
|
||
|
||
I discovered *many new problems* during this process, unknown until now, which forced me to design *new internal algorithms* to resolve them,
|
||
and divide the work in many parts to execute in parallel mode. Due to this, you can find many nice algorithms inside the sorting algorithm
|
||
written to resolve and parallelize the internal problems.
|
||
|
||
|
||
If you are interested in a detailed description of the algorithm, you can find it here : [* [@../papers/block_indirect_sort_en.pdf block_indirect_sort_en.pdf]].
|
||
|
||
]
|
||
|
||
]
|
||
[endsect]
|
||
[endsect]
|
||
|