This adds a new macro which allows the user to adapt a C++ struct
or class for use with OpenCL given its type, name, and members.
This allows for custom user-defined data-types to be used with the
Boost.Compute containers and algorithms.
This adds an experimental directory which contains various
experimental algorithms and functions. The files and APIs
under this directory are experimental and unstable.
This changes the vector<T> constructors which copy or initialize
data to take a queue argument used for performing the operations.
Previously they just took a context argument used to initialize the
buffer and then created a new command queue to use. This improves
performance by not requiring a new command queue and also fixes issues
when performing operations on a different command queue while the
vector was still being initialized.
This skips the generate_pair test on AMD which does not
properly support struct assignment. Before this patch the
test would fail with "UNREACHABLE executed!" and a SIGABRT.
detail::getenv() function was not declared inline, which led to
`multiple definition` errors at link time when a program consisted of
multiple objects that included Boost.Compute headers.
Fixed the problem and added core.multiple_objects test.
This adds interoperability support between Boost.Compute and various
other C/C++ libraries (Eigen, OpenCV, OpenGL, Qt and VTK). This eases
development for users using external libraries with Boost.Compute.
See kylelutz/compute#21
This adds program::build_with_source() function that both creates and
builds the program for the given context with supplied source and
compile options. In case BOOST_COMPUTE_USE_OFFLINE_CACHE macro is
defined, it also saves the compiled program binary for reuse in the
offline cache located in $HOME/.boost_compute folder on UNIX-like
systems and in %APPDATA%/boost_compute folder on Windows.
All internal uses of program::create_with_source() followed by
program::build() are replaced with program::build_with_source().
The find_device check in core.system is invalid. It could fail when same
device is supported by several platforms. In my case this happens for
Intel CPU when both AMD and Intel platforms are installed. The CPU
returned by boost::compute::system::default_device() is served by the
AMD platform, and the CPU returned by
boost::compute::system::find_device(name) is served by Intel SDK. The
only thing that could be safely asserted here is that both devices have
the same name.
This adds a improved reduce() algorithm implementation for
GPUs. Also adds checks to accumulate() which allow it to
use the higher-performance reduce() algorithm if possible.
This adds adds an overload of the reduce() function which
uses plus<T>() as the reductor. This simplifies the common
case of calculating the sum for a range of values.
This removes the init argument from reduce. This simplifies the
implementation and avoids copying a value from the host to the
device on every call to reduce.
If an initial value is required, the accumulate function can be
called instead.
This adds an experimental algorithm like copy_if() which copies
the index of the values for which predicate returns true instead
of the values themselves.
This adds a new function which will return the named field
from a value. For example, this can be used to return one of
the components of a pair object or to swizzle a vector value.
This adds a new macro to ease the definition of custom user
functions. The BOOST_COMPUTE_FUNCTION() macro creates a new
boost::compute::function<> object with the provided return
type, argument types, function name and OpenCL source code.
This adds a new unpack() function adaptor which converts
a function with N arguments to a function which takes a
single tuple argument with N components.
This is useful for calling built-in functions with the tuples
values returned from zip_iterator. This also removes the now
un-needed binary_transform_iterator.
This adds a test for computing the minimum and maximum
values of a vector simultaneously using reduce() with a
custom reduction function.
Also fixes a bug in reduce() in which inplace_reduce() was
being used even if the input type and result type differed.
This adds a program cache which can be used by algorithms and other
functions to store programs which may be re-used. This improves
performance by reducing the need for costly recompilation of commonly
used programs.
Program caches are context specific and multiple copies of the same
context will use the same program cache. They are created and accessed
by the global get_program_cache() function.
For now, only a few algorithms and functions (radix sort, mersenne
twister, fixed size sorts) make use of the program cache.
This adds a sort_by_transform() algorithm which sorts a sets of
values based on the value of a transform function.
For example, this can be used to sort a set of vectors by their
length (when used with the length<T>() function) or by a single
component (when used with the get<N>() function).
This adds a new sort_by_key() algorithm which sorts a range
of values by a range of keys with a comparison operator.
For now this is only implemented by the serial insertion sort
algorithm. In the future it will be ported to the other sorting
algorithms (e.g. radix sort).
This adds an output iterator result argument to the reduce()
algorithm. Now, instead of returning the reduced result, the
result is written to an output iterator. This allows the value
to stay on the device and avoids a device-to-host copy in cases
where the result is not needed on the host (e.g. it is part of
a larger computation).
This is an API breaking change to users of reduce(). Affected code
should now declare a result variable and then pass a pointer to it
as the new result argument.
This adds a copy() specialization for host-to-host transfers
which simply forwards the call to std::copy().
This is useful in templated algorithms which may in certain
circumstances copy() between data ranges on the host.
This fixes an issue in which comparison operators (e.g. <, ==)
in lambda expressions would return the wrong result type causing
compilation errors.
Also adds a few test cases to ensure the correct result type
and that lambda expressions can be properly used with count_if().
This adds a random number distribution which generates random
numbers in a uniform distribution.
Also adds a convenience algorithm which fills a range with
uniformly distributed random numbers between two values.
This adds a specialization for the get<N>() function when used
with zip_iterator's. Now, only the N'th iterator for the expression
will be dereferenced instead of dereferencing all of the iterators
into a tuple and then extracting the N'th component.
This adds a check to skip tests which use fill() with pair and
tuple types on AMD platforms. There is a bug which crashes the
OpenCL compiler with an "UNREACHABLE executed!" message on AMD
platforms when using struct assignment in kernel code.
See: http://devgurus.amd.com/thread/166622
This adds a simple inplace_merge() algorithm which merges
two contiguous sorted ranges in-place.
For now, the implementation simply copies the ranges to
two temporary vectors and calls merge().
This adds support for using the get<N>() function in lambda
expressions to extract a single component of an aggregate type.
Also adds a test of using boost::tuple<> to store a user-defined
data type on the device and sort them by their first component
using a lambda expression as the comparator.
This fixes a few issues encountered when using iterators with a
void value_type (e.g. std::insert_iterator<>).
The is_contiguous_iterator meta-function was refactored to always
return false for iterators with a void value_type and avoid
instantiating types for containers with a void value_type
(e.g. std::vector<void>::iterator) which previously resulted
in compilation errors.
This adds a system-wide default command queue. This queue is
accessible via the new static system::default_queue() method.
The default command queue is created for the default compute
device in the default context and is analogous to the default
stream in CUDA.
This changes how algorithms operate when invoked without an
explicit command queue. Previously, each algorithm had two
overloads, the first expected a command queue to be explicitly
passed and the second would create and use a temporary command
queue. Now, all algorithms take a command queue argument which
has a default value equal to system::default_queue().
This fixes a number of race-conditions and performance issues
througout the library associated with create, using, and
destroying many separate command queues.
This adds a new macro for the unit-tests which checks a range of
values on the device against an array of values on the host. This
simplifies writing tests and removes the need to explicitly copy
values back to the host for verification.
This refactors the system::default_device() method. Now, the
default compute device for the system is only found once and
stored in a static variable. This eliminates many redundant
calls to clGetPlatformIDs() and clGetDeviceIDs().
Also, the default_cpu_device() and default_gpu_device() methods
have been removed and their usages replaced with default_device().
This adds checks to the device test-suite to ensure that the
current device supports the partitioning types before attempting
to use the corresponding device::partition_*() methods.
This adds a get<N>() function which returns the n'th element
of an aggregate type (e.g. vector type, pair, tuple).
This unifies the functionality of, and replaces, the get_pair()
and vector_component() functions.
This changes the clamp_range() test to use float values instead
of int values. The OpenCL clamp() function is only defined for
float values and this test caused kernel compilation errors on
certain platforms.
Also updates the test to use the new global context.
This adds a clamp_range() algorithm which clamps a range
of values between a low and high value. This is based on
the algorithm of the same name in Boost.Algorithm.
refs kylelutz/compute#9
device, context, and queue are initialized statically in `context_setup.hpp`.
With this change all tests are able to complete when an NVIDIA GPU is in
exclusive compute mode.
Side effect of the change:
Time for all tests to complete reduced from 15.71 to 13.03 sec Tesla C2075.
This adds a test for the enqueue_write_buffer_rect() method
in the command_queue class. This method copies a rectangular
region of memory from the host to a device buffer.
This changes the enqueue_nd_range_kernel() method to return an
event object. This allows clients to monitor the progress of a
kernel executing on a device.
This adds a specialization of multiplies<T> for std::complex<T>
which implements complex number multiplication.
Also adds a simple test using transform() to verify the complex
multiplication works correctly.
This fixes a bug in the event_profiling test case in the
command_queue test. On AMD platforms, the event object
returned from clEnqueueMarker() has no profiling information
associated with it and returns an error code when accessed.
Now, profiling information for a simple write to a device
buffer is checked instead.
This adds a new set of methods to the device class allowing
device objects to be partitioned into multiple sub-devices
using the clCreateSubDevices() function.
For now, device partitioning is only supported on systems
with OpenCL version 1.2 (or later).
This adds support for returning a std::vector<T> from the
various get_info<T>() methods. This provides a simpler
interface to get the values in an array returned from one
of the clGet*Info() functions.
This also adds a test using the new API to get the maximum
work item sizes in each dimension for a device.
This fixes a bug in the test for inplace_reduce() in which
the vector was being filled with data from two different
command queues leaving the data in an undefined state.
This removes the check for local_memory_size in test_kernel. The
local memory size differs between platforms and some (e.g. Intel)
don't report any local memory usage.
This fixes the check for the local memory size in the
get_work_group_info kernel test.
While the kernel only allocates 16 float's, some platforms
will use more local memory. This changes the test to check
for at least 16 float's worth of local memory.
This adds a test for the sort() method which sorts a container
of 3D vectors by their length. This uses a lambda expression to
generate the compare function for the sort() algorithm.
This implements the merge() algorithm which merges two
ranges of sorted values into a single sorted range.
The current implementation uses a simple serial merge
algorithm. A GPU optimized version is coming soon.