This adds a parameter_cache class which can be used to store
execution parameters for an algorithm. Also updates some of
the benchmark programs to find and store optimal parameters.
This adds third-party performance tests to use in comparing
Boost.Compute with other parallel/GPGPU frameworks like Intel's
TBB and NVIDIA's Thrust along with the C++ STL.
Also refactors the timing and profiling infrastructure and adds
a simple perf.py driver script for running performance tests.
This changes the vector<T> constructors which copy or initialize
data to take a queue argument used for performing the operations.
Previously they just took a context argument used to initialize the
buffer and then created a new command queue to use. This improves
performance by not requiring a new command queue and also fixes issues
when performing operations on a different command queue while the
vector was still being initialized.