Tests for inplace_reduce, radix_sort and radix_sort_by_key are disabled
for CPU devices on Apple platform because for CPU devices on Apple
platform when local memory is used in a kernel, local work group size
must be [1;1;1]. Those algorithms are designed for GPU devices,
therefore we do no lose any functionality.
This changes the vector<T> constructors which copy or initialize
data to take a queue argument used for performing the operations.
Previously they just took a context argument used to initialize the
buffer and then created a new command queue to use. This improves
performance by not requiring a new command queue and also fixes issues
when performing operations on a different command queue while the
vector was still being initialized.
refs kylelutz/compute#9
device, context, and queue are initialized statically in `context_setup.hpp`.
With this change all tests are able to complete when an NVIDIA GPU is in
exclusive compute mode.
Side effect of the change:
Time for all tests to complete reduced from 15.71 to 13.03 sec Tesla C2075.
This fixes a bug in the test for inplace_reduce() in which
the vector was being filled with data from two different
command queues leaving the data in an undefined state.