This changes the BOOST_COMPUTE_FUNCTION() macro (and the related
BOOST_COMPUTE_CLOSURE() macro) to use custom, user-provided argument
names instead of auto-generating them based on their index.
This is an API-breaking change. Users should now provide argument
names when using the BOOST_COMPUTE_FUNCTION() macro. The examples
and documentation have been updated to reflect the new API.
This should fix the following error seen on the Apple OpenCL
implementation when compiling the radix_sort program: "error:
definition of macro 'K' conflicts with an identifier used in
the precompiled header".
This updates the methods in command_queue to either return void
(for synchronous operations) or an event object (for asynchronous
operations). The caller will be notified of OpenCL errors via an
exception being thrown.
This removes the explicit call to finish() in the destructor
for the command_queue class.
The clFinish() function will be called automatically by the
clReleaseCommandQueue() function once the reference count for
the command queue drops to zero.
This adds a method to the event class which allows the user to
register a callback function to be invoked when the event reaches
the specified state (e.g. when it completes).
This adds a new macro which allows the user to adapt a C++ struct
or class for use with OpenCL given its type, name, and members.
This allows for custom user-defined data-types to be used with the
Boost.Compute containers and algorithms.
This adds an experimental directory which contains various
experimental algorithms and functions. The files and APIs
under this directory are experimental and unstable.
This adds third-party performance tests to use in comparing
Boost.Compute with other parallel/GPGPU frameworks like Intel's
TBB and NVIDIA's Thrust along with the C++ STL.
Also refactors the timing and profiling infrastructure and adds
a simple perf.py driver script for running performance tests.
This changes the vector<T> constructors which copy or initialize
data to take a queue argument used for performing the operations.
Previously they just took a context argument used to initialize the
buffer and then created a new command queue to use. This improves
performance by not requiring a new command queue and also fixes issues
when performing operations on a different command queue while the
vector was still being initialized.
This uses Boost.Preprocessor macros to allow zip iterators to work with
arbitrary number of elements (the current limit is maximum boost::tuple
size which is 10 by default).
Refs #50
This makes online cache use sha1 of the program source as key.
Introduces boost::compute::detail::sha1() function, which is moved
from compute::program into its own header file.
detail::getenv() function was not declared inline, which led to
`multiple definition` errors at link time when a program consisted of
multiple objects that included Boost.Compute headers.
Fixed the problem and added core.multiple_objects test.
Instead of building the program from source with the added comment
block (used for distinction between different platforms and devices
when offline cache is in use), only use the altered source for the
hash computation. This way users will not get unexpected results from
program.source().
This adds interoperability support between Boost.Compute and various
other C/C++ libraries (Eigen, OpenCV, OpenGL, Qt and VTK). This eases
development for users using external libraries with Boost.Compute.
See kylelutz/compute#21
This adds program::build_with_source() function that both creates and
builds the program for the given context with supplied source and
compile options. In case BOOST_COMPUTE_USE_OFFLINE_CACHE macro is
defined, it also saves the compiled program binary for reuse in the
offline cache located in $HOME/.boost_compute folder on UNIX-like
systems and in %APPDATA%/boost_compute folder on Windows.
All internal uses of program::create_with_source() followed by
program::build() are replaced with program::build_with_source().
This adds a improved reduce() algorithm implementation for
GPUs. Also adds checks to accumulate() which allow it to
use the higher-performance reduce() algorithm if possible.
This adds adds an overload of the reduce() function which
uses plus<T>() as the reductor. This simplifies the common
case of calculating the sum for a range of values.
This removes the init argument from reduce. This simplifies the
implementation and avoids copying a value from the host to the
device on every call to reduce.
If an initial value is required, the accumulate function can be
called instead.
This fixes a compilation error which occurs on Windows when
registering the default error handler callback when creating
a new context object.
In OpenCL 1.1 and later the callback function is expected to
use the __stdcall calling convention. This is optionally defined
by the CL_CALLBACK macro on WIN32 platforms. If available, it is
defined with the BOOST_COMPUTE_CL_CALLBACK macro which is then
used to annotate the callback functions.
This adds an experimental algorithm like copy_if() which copies
the index of the values for which predicate returns true instead
of the values themselves.
This adds an error handler function which is invoked when an OpenCL
context encounters an error condition. The context error is converted
to a C++ exception containing the error information and thrown.
This adds a new function which will return the named field
from a value. For example, this can be used to return one of
the components of a pair object or to swizzle a vector value.
This adds a new macro to ease the definition of custom user
functions. The BOOST_COMPUTE_FUNCTION() macro creates a new
boost::compute::function<> object with the provided return
type, argument types, function name and OpenCL source code.
This refactors the invoked_function<> classes. Previously each
function arity (e.g. unary, binary) had a separate invoked_function<>
template class. Now they all use the same class which simplifies the
logic in function<> and meta_kernel.
This fixes a bug in which type definitions were being inserted
into meta_kernel's multiple times. Also forces zip_iterator to
insert its type definitions when used in a kernel.
This adds a macro for registering custom type names for C++ types
to be used in OpenCL kernel code. Internally the macro specializes
the type_name<T>() function.
This adds a new unpack() function adaptor which converts
a function with N arguments to a function which takes a
single tuple argument with N components.
This is useful for calling built-in functions with the tuples
values returned from zip_iterator. This also removes the now
un-needed binary_transform_iterator.
This adds a test for computing the minimum and maximum
values of a vector simultaneously using reduce() with a
custom reduction function.
Also fixes a bug in reduce() in which inplace_reduce() was
being used even if the input type and result type differed.
This fixes an issue in which the source strings for binary
and ternary functions were not being stored and thus not
being inserted into kernels when they were invoked.
This adds a program cache which can be used by algorithms and other
functions to store programs which may be re-used. This improves
performance by reducing the need for costly recompilation of commonly
used programs.
Program caches are context specific and multiple copies of the same
context will use the same program cache. They are created and accessed
by the global get_program_cache() function.
For now, only a few algorithms and functions (radix sort, mersenne
twister, fixed size sorts) make use of the program cache.
This adds a sort_by_transform() algorithm which sorts a sets of
values based on the value of a transform function.
For example, this can be used to sort a set of vectors by their
length (when used with the length<T>() function) or by a single
component (when used with the get<N>() function).
This adds a new sort_by_key() algorithm which sorts a range
of values by a range of keys with a comparison operator.
For now this is only implemented by the serial insertion sort
algorithm. In the future it will be ported to the other sorting
algorithms (e.g. radix sort).
This adds an output iterator result argument to the reduce()
algorithm. Now, instead of returning the reduced result, the
result is written to an output iterator. This allows the value
to stay on the device and avoids a device-to-host copy in cases
where the result is not needed on the host (e.g. it is part of
a larger computation).
This is an API breaking change to users of reduce(). Affected code
should now declare a result variable and then pass a pointer to it
as the new result argument.
This adds a copy() specialization for host-to-host transfers
which simply forwards the call to std::copy().
This is useful in templated algorithms which may in certain
circumstances copy() between data ranges on the host.
This adds a new scan_on_cpu() algorithm which implements the scan()
algorithm for CPU devices. Also renames the existing scan() algorithm
to scan_on_gpu().
This fixes some tests failures on POCL which were caused by the prior
GPU scan() algorithm not functioning properly with POCL.
This changes the checks for the device type to use the bitwise-and
operator instead of the equaility operator. The returned type is a
bitset and this would cause errors when multiple bits were set.
This fixes a bug on POCL which returns the device type as a
combination of CL_DEVICE_TYPE_DEFAULT and CL_DEVICE_TYPE_CPU. Now
the correct device type (device::cpu) is detected for POCL.
This fixes an issue in which comparison operators (e.g. <, ==)
in lambda expressions would return the wrong result type causing
compilation errors.
Also adds a few test cases to ensure the correct result type
and that lambda expressions can be properly used with count_if().
This adds a random number distribution which generates random
numbers in a uniform distribution.
Also adds a convenience algorithm which fills a range with
uniformly distributed random numbers between two values.
This adds an enqueue_migrate_memory_objects() method to the
command_queue class which allows memory objects to be migrated
between compute devices and to the host.
This makes a few tweaks to the reduce() algorithm in order to
improve performance. An unnecessary barrier() has been removed
and now multiple values are reduced on the initial read.
This changes the meta_kernel::add_arg() overload with a name
and a value to a separate method. This fixes conflict when
using add_arg() with string values.
This adds a specialization for the get<N>() function when used
with zip_iterator's. Now, only the N'th iterator for the expression
will be dereferenced instead of dereferencing all of the iterators
into a tuple and then extracting the N'th component.
This removes the cv-qualifiers for the value-type returned from
get<N>() expressions. This fixes issues when specializing based
on the type (e.g. pair, tuple).
This fixes a bug in the meta_kernel streaming operators with
float values. Now, float scalar and vector literals are inserted
into the kernel source with the proper 'f' suffix.
This makes some improvements to the system::find_default_device()
method. Now, the devices on the system will only be queried once
when searching for the default device. This reduces the number of
calls to clGetPlatformIDs() and clGetDeviceIDs().
Also, in the case that no GPU or CPU devices are found, the first
device on the system will be selected as the default device. This
fixes issues when using Boost.Compute with pocl.
This adds a check to the reverse() algorithm to ensure that
the range contains at least two elements. Previously, passing
zero or one element ranges to reverse() would result in errors.
This fixes a compilation error which occurred when assigning
to a future<void> from a future<T>. For different future types
the event member variable is private and must be accessed via
the get_event() method.
This fixes issues when using char and unsigned char literals in
a meta_kernel. Previously the character values would be directly
inserted without quotes (e.g. c instead of 'c') which lead to
kernel compilation errors.
This fixes a bug when creating a temporary vector for use in the
in-place scan() algorithm. Previously, a separate command queue
was used to copy the input values to the temporary vector. Now,
the same command queue is used for copying the input values and
performing the scan.
This removes the timer class. The technique of measuring the time
difference between two different OpenCL markers on a command queue
is not portable to all OpenCL implementations (only works on NVIDIA).
A new internal timer class has been added which uses boost::chrono
(or std::chrono if BOOST_COMPUTE_TIMER_USE_STD_CHRONO is defined).
This new timer is used by the benchmarks to measure time elapsed
on the host.
This adds a simple inplace_merge() algorithm which merges
two contiguous sorted ranges in-place.
For now, the implementation simply copies the ranges to
two temporary vectors and calls merge().
This adds support for using the get<N>() function in lambda
expressions to extract a single component of an aggregate type.
Also adds a test of using boost::tuple<> to store a user-defined
data type on the device and sort them by their first component
using a lambda expression as the comparator.
This fixes a few issues encountered when using iterators with a
void value_type (e.g. std::insert_iterator<>).
The is_contiguous_iterator meta-function was refactored to always
return false for iterators with a void value_type and avoid
instantiating types for containers with a void value_type
(e.g. std::vector<void>::iterator) which previously resulted
in compilation errors.
This adds a system-wide default command queue. This queue is
accessible via the new static system::default_queue() method.
The default command queue is created for the default compute
device in the default context and is analogous to the default
stream in CUDA.
This changes how algorithms operate when invoked without an
explicit command queue. Previously, each algorithm had two
overloads, the first expected a command queue to be explicitly
passed and the second would create and use a temporary command
queue. Now, all algorithms take a command queue argument which
has a default value equal to system::default_queue().
This fixes a number of race-conditions and performance issues
througout the library associated with create, using, and
destroying many separate command queues.
This fixes a few memory handling issues between device_ptr,
buffer_iterator, buffer_value, allocator, and malloc/free.
Previously, memory buffers that were allocated by allocator and
malloc were being retained (via clRetainMemObject() in buffer's
constructor) by device_ptr, buffer_iterator and buffer_value.
Now, false is passed for the retain parameter to buffer's
constructor so that the buffer's reference count is not
incremented. Furthermore, the classes now set the buffer to
null before being destructed so that they will not decrement its
reference count (which normally occurs buffer's destructor).
The main effect of this change is that objects which refer to a
memory buffer but do not own it (e.g. device_ptr, buffer_iterator)
will not modify the reference count for the buffer. This fixes a
number of memory leaks which occured in longer running programs.
This adds a new scalar<T> "container" which stores a single
value in a memory buffer. This simplifies memory handling in
algorithms which read and write a single value.
This refactors the system::default_device() method. Now, the
default compute device for the system is only found once and
stored in a static variable. This eliminates many redundant
calls to clGetPlatformIDs() and clGetDeviceIDs().
Also, the default_cpu_device() and default_gpu_device() methods
have been removed and their usages replaced with default_device().
This fixes a couple of narrowing conversion warnings in the
device partitioning methods which were seen when compiling
VexCL with Boost.Compute in C++11 mode.
This adds a get<N>() function which returns the n'th element
of an aggregate type (e.g. vector type, pair, tuple).
This unifies the functionality of, and replaces, the get_pair()
and vector_component() functions.
This changes the vector class to not auto-initialize values
when it is created or resized. This improves performance by
eliminating a call to fill(). If needed, user code can call
fill() explicitly on the newly allocated values.
This increases the work-group size for the copy() kernel to be
up to 32 items based on the size of the input. This increases the
performance of copy() and related algorithms (e.g. transform()).
This adds a clamp_range() algorithm which clamps a range
of values between a low and high value. This is based on
the algorithm of the same name in Boost.Algorithm.
This changes the enqueue_nd_range_kernel() method to return an
event object. This allows clients to monitor the progress of a
kernel executing on a device.
boost::compute::system::default_device() supports the following
environment variables:
BOOST_COMPUTE_DEFAULT_DEVICE for device name
BOOST_COMPUTE_DEFAULT_PLATFORM for OpenCL platform name
BOOST_COMPUTE_DEFAULT_VENDOR for device vendor name
If one or more of these variables is set, then device that satisfies
all conditions gets selected. If such a device is unavailable, then
the first available GPU is selected. If there are no GPUs in the
system, then the first available CPU is selected. Otherwise,
default_device() returns null device.
The hello_world example is modified to use default_device() instead
of default_gpu_device().
This adds a specialization of multiplies<T> for std::complex<T>
which implements complex number multiplication.
Also adds a simple test using transform() to verify the complex
multiplication works correctly.
This fixes an unused variable warning which occurs in the
get_base_iterator_buffer() function when the base iterator
is not a buffer iterator and thus the iter argument is not
used.
This fixes a bug in which boost::result_of() would return the
wrong result type for a function due to the new implementation
using decltype instead of the result_of protocol on compilers
that sufficently support C++11 (such as clang >= 3.2).
Now, boost::tr1_result_of() is used to explicitly request that
the result_of protocol be used even when decltype is supported
by the compiler.
This cleans up the constructor methods for the OpenCL wrapper
classes and unifies the API used for creating a wrapper class
object from the underlying OpenCL objects.
Now, every wrapper class has a constructor taking the OpenCL
object and an optional boolean retain parameter which indicates
whether the constructor should increment the reference count.
This updates the constructors for the image2d and image3d
classes to use the new clCreateImage() function instead of
the deprecated clCreateImage2D/3D() functions.
This changes the enqueue_marker() method in the command_queue
class to use clEnqueueMarkerWithWaitList() instead of the
deprecated clEnqueueMarker() function when compiling with
OpenCL 1.2.
This changes the enqueue_barrier() method in the command_queue
class to use clEnqueueBarrierWithWaitList() instead of the
deprecated clEnqueueBarrier() function when compiling with
OpenCL 1.2.
This remove the enqueue_wait_for_event() method from the
command_queue class as the clEnqueueWaitForEvents() function
has been deprecated in OpenCL 1.2.
This moves the unload_compiler() method from the system class
to the platform class. Also changes the method to use the
clUnloadPlatformCompiler() function instead of the deprecated
clUnloadCompiler() when compiling with OpenCL 1.2.
This moves the get_extension_function_address() method from
the system class to the platform class. Also changes the method
to use the clGetExtensionFunctionAddressForPlatform() function
instead of the deprecated clGetExtensionFunctionAddress() when
compiling with OpenCL 1.2.
This fixes a bug in the move-constuctor for the vector<T>
class.
Previously, the moved-from object was also deallocating the
memory buffer leading to an error when the moved-to object
attempted to use it. Now, the constructor checks if the buffer
is non-empty before deallocating it.
This removes support for cl_half (typedef'd to half_).
The issue is that the cl_half type is indistinguishable
from the cl_ushort type (both are typedefs for uint16_t)
which caused the cl_khr_fp16 pragma to be injected into
kernels using cl_ushort which causes errors on platforms
that do not support the cl_khr_fp16 extension.
This adds a new set of methods to the device class allowing
device objects to be partitioned into multiple sub-devices
using the clCreateSubDevices() function.
For now, device partitioning is only supported on systems
with OpenCL version 1.2 (or later).
This adds support for returning a std::vector<T> from the
various get_info<T>() methods. This provides a simpler
interface to get the values in an array returned from one
of the clGet*Info() functions.
This also adds a test using the new API to get the maximum
work item sizes in each dimension for a device.
This fixes a bug in which the remove_if() function would overwrite
parts of the input before they were properly copied to the output
range. This is fixed by first copying the input values to a temporary
vector and then passing that as the input range to copy_if().
This fixes a bug in which the Intel OpenCL compiler would
fail to compile the count_if() and find_if() kernels for
vector types with the following error:
error: no matching function for call to 'all'
note: candidate function not viable: 1st argument ('__global int4')
is in address space 16776960, but parameter must be in address space 0
This is caused when the predicate compares a value from the input
buffer (in the global memory space) to a literal value (in the
private memory space).
This is fixed by first reading the value into a local variable in
the private memory space and then calling the predicate function.
This fixes a bug in which the fill() algorithm was called by
scan_impl() with an integer zero rather than zero of the value
type which caused issues when using scan() with vector values.
This adds a new method which allows for type definitions and
type pragmas to be added to a meta_kernel.
This provides a more generic and general interface and replaces
the previously used add_pair_type() method along with the special
case handling of half and double types.
This fixes a bug in which certain platforms would return
CL_INVALID_VALUE from clCreateProgramWithBinary() if the
binary_status argument was not provided.
This removes the default type_name_trait::value() function
implementation.
Previously, the default implementation would return a null
pointer leading to run-time errors if a type name was not
provided. Now, a compile-time error will occur if type_name()
is called for an unknown type.
This implements the merge() algorithm which merges two
ranges of sorted values into a single sorted range.
The current implementation uses a simple serial merge
algorithm. A GPU optimized version is coming soon.