histogram/doc/rationale.qbk

184 lines
21 KiB
Plaintext

[/
Copyright Hans Dembinski 2018 - 2019.
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at
https://www.boost.org/LICENSE_1_0.txt)
]
[section:rationale Rationale]
[section:motivation Motivation]
C++ lacks a widely-used, free multi-dimensional histogram class. While it is easy to write a one-dimensional histogram, writing a general multi-dimensional histogram poses more of a challenge. If a few more features required by scientific professionals are added onto the wish-list, then the implementation becomes non-trivial and a well-tested library solution desirable.
The [@https://www.gnu.org/software/gsl GNU Scientific Library (GSL)] and the [@https://root.cern.ch ROOT framework] from CERN have histogram implementations. The GSL has histograms for one and two dimensions in C. The implementations are not customizable. ROOT has well-tested implementations of histograms, but they are not customizable and they are not easy to use correctly. ROOT also has new implementations in beta-stage similar to this one, but they are still less flexible, not easy to use, and they cannot be used without the rest of ROOT, which is a huge library to install just to get histograms.
The templated histogram class in this library has a minimal interface and focuses on the core task of creating histograms from input data. It is very customizable and extensible through user-provided classes. A single implementation is used for one and multi-dimensional histograms. While being safe, customizable, and convenient, the histogram is also very fast. The static version, which has an axis configuration that is hard-coded at compile-time, is faster than any tested competitor.
One of the central design goals was to hide the implementation details of the internal counters of the histogram. The internal counting mechanism is encapsulated in a storage class, which can be switched out. The default storage uses an adaptive memory management which is safe to use, memory-efficient, and fast. The safety comes from the guarantee, that counts cannot overflow or be capped. This is a rare guarantee, hardly found in other libraries. In the standard configuration, the histogram /just works/ under any circumstance. Yet, users with special requirements can implement their own custom storage class or use an alternative builtin array-based storage.
[endsect]
[section:guidelines Guidelines]
This library was written based on a decade of experience collected in working with big data, more precisely in the field of particle physics and astroparticle physics. The design is guided by advice from people like Bjarne Stroustrup, Scott Meyers, Herb Sutter, and Andrei Alexandrescu, and Chandler Carruth. The [@https://www.python.org/dev/peps/pep-0020 Zen of Python] (also applies to other languages) was an inspiration and well as ideas from the [@https://eigen.tuxfamily.org/ Eigen library]. The feature set was designed to be a superset of what is offered by the [@https://root.cern.ch ROOT framework] and the [@https://www.gnu.org/software/gsl GNU Scientific Library (GSL)].
Design goals of the library:
* Provide a simple and convenient default behavior for the casual user, yet allow a maximum of customization for the power user. Follow the "Don't pay for what you don't use" principle. Features that you don't use should not affect your performance negatively.
* Provide the same interface for one-dimensional and multi-dimensional histograms. This makes the interface easier to learn, and makes it easier to move a project from one-dimensional to multi-dimensional analysis.
* Hide the details of how the bin counters work. This design allows for interesting implementations, such as the default storage that provides a no-overflow-guarantee, which no other library offers.
* Minimalism, STL and Boost compatibility. Focus the library on the task of creating histograms. Functionality on top of that (drawing, further processing...) should come from other libraries. This gives users maximum flexibility to mix and match libraries. The histogram provides iterators ranges that allow other libraries access to the histogram state. The library provides iterators to access its internal counters, making it compatible with STL algorithms and other Boost libraries. In addition, the library was made compatible with [@boost:/libs/accumulators/index.html Boost.Accumulators] and [@boost:/libs/range/index.html Boost.Range].
[endsect]
[section:no_lambdas No lambdas as axis types]
Lambdas were considered and rejected as a form of simple user-defined axis type, because they do not allow access to their state, such as the current axis size. Lambdas can be fully replaced by locally-defined structs. A local struct cannot be templated and cannot have templated methods, but this is not an issue. In the local context where the struct is created, all relevant types must be known already so that locally defined structs can simply use these concrete types and there is no need for templates.
[endsect]
[section:uoflow Under- and overflow bins]
Axis instances by default add extra bins that count values which fall below or above the range covered by the axis (for those types where that makes sense). These extra bins are called under- and overflow bins, respectively. The extra bins can be turned off individually for each axis to conserve memory, but it is generally recommended to have them. The normal bins, excluding under- and overflow, are called *inner bins*.
Under- and overflow bins are useful in one-dimensional histograms, and nearly essential in multi-dimensional histograms. Here are the advantages:
* No loss: The total sum over all bin counts is strictly equal to the number of times the histogram was filled. Even NaN values are counted, they are put in the overflow-bin by convention.
* Diagnosis: Unexpected extreme values show up in the extra bins, which otherwise may be overlooked.
* Ability to reduce histograms: In multi-dimensional histograms, an out-of-range value along one axis may be paired with an in-range value along another axis. If under- and overflow bins are missing, such a value pair is lost completely. If you apply a `reduce` operation on a histogram, which removes some axes by summing all counts along that dimension, this would lead to distortions of the histogram along the remaining axes. When under- and overflow bins are present, the `reduce` operation always produces a sub-histogram identical to one obtained, if it was filled with the original data.
The presence of the extra bins does not interfere with normal indexing. On an axis with `n` bins, the first bin has the index `0`, the last bin `n-1`, while the under- and overflow bins are accessible at the indices `-1` and `n`, respectively. This choice is optimized for users who are unaware of the existence of these extra bins. They would find the other indexing scheme surprising, where you start with `0` at the underflow bin and the first normal bin is at `1`. Also, the chosen scheme allows one to turn off the extra bins in the code where the histogram is created, without changing any code downstream that addresses inner bins with indices.
[endsect]
[section:index_type Size method of axis returns signed integer]
The standard library returns a container size as an unsigned integer, because a container size cannot be negative. The `size()` method of the histogram class follows this rule, but the `size()` methods of axis types return a signed integral type. Why?
As explained in the [link histogram.rationale.uoflow section about under- and overflow], a histogram axis may have an optional underflow bin, which is addressed by the index `-1`. It follows that the index type must be signed integer for all axis types.
The `size()` method of any axis returns the same signed integer type. The size of an axis cannot be negative, but this choice has two advantages. Firstly, the value returned by `size()` itself is guaranteed to be a valid index, which is good since it may address the overflow bin. Secondly, comparisons between an index and the value returned by `size()` are frequent. If `size()` returned an unsigned integral type, compilers would produce a warning for each comparisons, and rightly so. [@https://www.youtube.com/watch?v=wvtFGa6XJDU Something awful happens] on most machines when you compare `-1` with an unsigned integer, `-1 < 1u == false`, which causes a serious bug in the following innocent-looking loop:
```
auto my_axis = /* ... */;
// naive loop to iterate over all bins, including underflow and overflow
for (int i = -1; i <= my_axis.size(); ++i) {
// body is never executed if return value of my_axis.size() is an unsigned integral type
}
```
The advantages clearly override the disadvantages of this choice.
[endsect]
[section:real_index_type Continuous axis accepts real-valued cell index]
Each axis has a method called `value(index_type)` which converts an index into the equivalent value at that index. If the axis is continuous, there are many possible values in the interval between two adjacent integer indices. User often want to access the center of such an interval. An easy and very efficient way to access the center value is for this method to accept real-valued indices. Then, the center of the first bin between index `i` and `i+1` is simply obtained by passing `i+0.5`.
This scheme is computationally efficient and intuitive. Each continuous axis is required to accept a real-valued index, in fact, internal library code relies uses this to detect whether an axis is continuous or discrete.
[endsect]
[section:variance On variance estimates]
Once a histogram is filled, the bin counter can be accessed with the `at(...)` method. Some accumulators offer a `value()` method to return the cell value ['k] and a `variance()` method, which returns an estimate ['v] of the [@https://en.wikipedia.org/wiki/Variance variance] of that cell.
If the input values for the histogram come from a [@https://en.wikipedia.org/wiki/Stochastic_process stochastic process], the variance estimate provides useful additional information. Examples for a stochastic process are a physics experiment or a random person filling out a questionnaire [footnote The choices of the person are most likely not random, but if we pick a random person from a group, we randomly sample from a pool of opinions]. The variance ['v] is the square of the [@https://en.wikipedia.org/wiki/Standard_deviation standard deviation]. The standard deviation is a number that tells us how much we can expect the observed value to fluctuate if we or someone else would repeat our experiment with new random input.
Variance estimates are useful in many ways:
* Error bars: Drawing an [@https://en.wikipedia.org/wiki/Error_bar error bar] over the interval ['(k - sqrt(v), k + sqrt(v))] is a simple visualization of the expected random scatter of the bin value ['k], if the histogram was cleared and filled again with another independent sample of the same size (e.g. by repeating the physics experiment or asking more people to fill a questionnaire). If you compare the result with a fitted model (see next item), about 2/3 of the error bars should overlap with the model, if the model is correct.
* Least-squares fitting: Often you have a model of the expected number of counts ['lambda] per bin, which is a function of parameters with unknown values. A simple method to find good (sometimes the best) estimates for those parameter values is to vary them until the sum of squared residuals ['(k - lambda)^2/v] is minimized. This is the [@https://en.wikipedia.org/wiki/Least_squares method of least squares], in which both the bin values ['k] and variance estimates ['v] enter.
* Pull distributions: If you have two histograms filled with the same number of samples and you want to know whether they are in agreement, you can compare the so-called pull distribution. It is formed by subtracting the counts and dividing by the square root of their variances ['(k1 - k2)/sqrt(v1 + v2)]. If the histograms are identical, the pull distribution randomly scatters around zero, and about 2/3 of the values are in the interval ['[ -1, 1]].
Why return the variance ['v] and not the standard deviation ['s = sqrt(v)]? The reason is that variances can be trivially added and it is computationally more efficient to return the variance. [@https://en.wikipedia.org/wiki/Variance#Properties Variances of independent samples can be added] like normal numbers ['v3 = v1 + v2]. This is not true for standard deviations, where the addition law is more complex ['s3 = sqrt(s1^2 + s2^2)]. In that sense, the variance is more straight-forward to use during data processing. The user can take the square-root at the end of the processing obtain the standard deviation as needed.
How is the variance estimate ['v] computed for a normal counting histogram? If we know the expected number of counts ['lambda] per bin, we could compute the variance as ['v = lambda], because counts in a histogram follow the [@https://en.wikipedia.org/wiki/Poisson_distribution Poisson distribution]
[footnote
The Poisson distribution is correct as far as the counts ['k] themselves are of interest. If the fractions per bin ['p = k / N] are of interest, where ['N] is the total number of counts, then the correct distribution to describe the fractions is the [@https://en.wikipedia.org/wiki/Multinomial_distribution multinomial distribution].
]. After filling a histogram, we do not know the expected number of counts ['lambda] for any particular bin, but we know the observed count ['k], which is not too far from ['lambda]. We therefore might be tempted to just replace ['lambda] with ['k] in the formula ['v = lambda = k]. This is in fact the so-called non-parametric estimate for the variance based on the [@https://en.wikipedia.org/wiki/Plug-in_principle plug-in principle]. It is the best (and only) estimate for the variance, if we know nothing more about the underlying stochastic process which generated the inputs (or want to feign ignorance about it).
[endsect]
[section:weights Support of weighted fills]
A histogram sorts input values into bins and increments a bin counter if an input value falls into the range covered by that bin. The [classref boost::histogram::unlimited_storage standard storage] uses integer types to store these counts, see the [link histogram.overview.structure.storage storage section] how integer overflow is avoided. However, sometimes histograms need to be filled with values that have a weight ['w] attached to them. In this case, the corresponding bin counter is not increased by one, but by the weight value ['w].
[note
There are several use-cases for weighted increments. The main use in particle physics is to adapt simulated data of an experiment to real data. Simulations are needed to determine various corrections and efficiencies, but a simulated experiment is almost never a perfect replica of the real experiment. In addition, simulations are expensive to do. So, when deviations in a simulated distribution of a variable are found, one typically does not rerun the simulations, but assigns weights to match the simulated distribution to the real one.
]
When the [classref boost::histogram::weight_storage weight_storage] is used, histograms may be filled with weighted value tuples. Two real numbers per bin are stored in this case. The first keeps track of the sum of weights. The second keeps track of the sum of weights squared, which is the variance estimate in this case. The former is accessed with the `value()` method of the bin counter, and the latter with the `variance()` method.
Why the sum of weights squared is the variance estimate can be derived from the [@https://en.wikipedia.org/wiki/Variance#Properties mathematical properties of the variance]. Let us say a bin is filled ['k1] times with a fixed weight ['w1]. The sum of weights is then ['w1 k1]. It then follows from the variance properties that ['Var(w1 k1) = w1^2 Var(k1)]. Using the reasoning from before, the estimated variance of ['k1] is ['k1], so that ['Var(w1 k1) = w1^2 Var(k1) = w1^2 k1]. Variances of independent samples are additive. If the bin is further filled ['k2] times with weight ['w2], the sum of weights is ['w1 k1 + w2 k2], with variance ['w1^2 k1 + w2^2 k2]. This also holds for ['k1 = k2 = 1]. Therefore, the sum of weights ['w[i]] has variance sum of ['w[i]^2]. In other words, to incrementally keep track of the variance of the sum of weights, we need to keep track of the sum of weights squared.
[endsect]
[section Python support]
Python is a popular scripting language in the data science community. Thus, the library must be designed to support Python bindings, which are developed separately. The histogram should usable as an interface between a complex simulation or data-storage system written in C++ and data-analysis/plotting in Python. Users are able to define a histogram in Python, let it be filled on the C++ side, and then get it back for further data analysis or plotting.
This is a major reason why a purely static design was rejected, where the histogram must be fully configured at compile-time. While this generates more efficient code, it does not work with Python, which requires one to configure histograms at run-time without recompiling the code.
[endsect]
[section Support of Boost.Accumulators]
Boost.Histogram can be configured to use arbitrary accumulators as cells, in particular the accumulators from [@boost:/libs/accumulators/index.html Boost.Accumulators]. Sample values can be passed to the cell accumulator, which it may use to compute the mean, median, variance or other statistics of the samples sorted into each cell.
[endsect]
[section Support of Boost.Range]
The histogram class is a valid range and can be used with the [@boost:/libs/range/index.html Boost.Range] library. This library provides a custom adaptor generator, `indexed`, analog to the corresponding adaptor generator in Boost.Range, but with a potentially multi-dimensional index.
[endsect]
[section Support of serialization]
Serialization is implemented using [@boost:/libs/serialization/index.html Boost.Serialization]. It would be great to have a portable binary archive with support for floating point data to store and retrieve histograms efficiently, which is currently not available. The library has to be open for other serialization libraries.
[endsect]
[section Comparison to Boost.Accumulators]
Boost.Histogram has a minor overlap with [@boost:/libs/accumulators/index.html Boost.Accumulators], but the scopes are rather different. The statistical accumulators `density` and `weighted_density` in Boost.Accumulators generate one-dimensional histograms. The axis range and the bin widths are determined automatically from a cached sample of initial values. They cannot be used for multi-dimensional data. Boost.Histogram focuses on multi-dimensional data and gives the user full control of how the binning should be done for each dimension.
Automatic binning is not an option for Boost.Histogram, because it does not scale well to many dimensions. Because of the Curse of Dimensionality, a prohibitive number of samples would need to be collected.
[note
There is no scientific consensus on how do automatic binning in an optimal way, mostly because there is no consensus over the cost function (there are many articles with different solutions in the literature). The problem is not solved for one-dimensional data, and even less so for multi-dimensional data.
]
Recommendation:
* Boost.Accumulators
* You have one-dimensional data of which you know nothing about, and you want a histogram quickly without worrying about binning details.
* Boost.Histogram
* You have multi-dimensional data or you suspect you will switch to multi-dimensional data later.
* You want to customize the binning by hand, for example, to make bin edges coincide with special values or to handle special properties of your values, like angles defined on a circle.
[endsect]
[section Why is Boost.Histogram not built on top of Boost.MultiArray?]
Boost.MultiArray implements a multi-dimensional array, it also converts an index tuple into a global index that is used to access an element in the array. Boost.Histogram and Boost.MultiArray share this functionality, but Boost.Histogram cannot use Boost.MultiArray as a back-end. Boost.MultiArray makes the rank of the array a compile-time property, while this library needs the rank to be dynamic.
Boost.MultiArray also does not allow to change the element type dynamically. This is needed to implement the adaptive storage mentioned further up. Using a variant type as the element type of a Boost.MultiArray would not work, because it creates this wasteful layout:
`[type-index 1][value 1][type-index 2][value 2]...`
A type index is stored for each cell. Moreover, the variant is always as large as the largest type in the union, so there is no way to safe memory by using a smaller type when the bin count is low, as it is done by the adaptive storage. The adaptive storage uses only one type-index for the whole array and allocates a homogeneous array of values of the same type that exactly matches their sizes, creating the following layout:
`[type-index][value 1][value 2][value 3]...`
There is only one type index and the number of allocated bytes for the array can adapted dynamically to the size of the value type.
[endsect]
[endsect]