histogram/doc/overview.qbk
2019-11-21 00:51:05 +01:00

111 lines
14 KiB
Plaintext

[/
Copyright Hans Dembinski 2018 - 2019.
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at
https://www.boost.org/LICENSE_1_0.txt)
]
[section:overview Overview]
[section:introduction Introduction]
[@https://en.wikipedia.org/wiki/Histogram Histograms] are a basic tool in statistical analysis. A histogram consists of a number of non-overlapping cells in data space. When an input value is passed to the histogram, the corresponding cell that envelopes the value is found and an associated counter is incremented.
When analyzing a large low-dimensional data set, it is more convenient to work with a histogram of the input values than the original values. Keeping the cell counts in memory for analysis and/or processing the counts requires far fewer resources than keeping the original values in memory and processing them. Information present in the original can also be extracted from the histogram[footnote Parameters of interest, like the center of a distribution, can be extracted from the histogram instead of the original data set; likewise, statistical models can be fitted to histograms.]. Some information is lost in this way, but if the cells are small enough[footnote What small enough means has to be decided case by case.], the loss is negligible. A histogram is a kind of lossy data-compression. It is also often used as a simple estimator for the [@https://en.wikipedia.org/wiki/Probability_density_function probability density function] of the input data. More complex density estimators exist, but histograms remain attractive because they are easy to reason about.
This library provides a histogram for multi-dimensional data. In the multi-dimensional case, the input consist of tuples of values which belong together and describing different aspects of the same entity. For example, when you make a digital image with a camera, photons hit a pixel detector. The photon is the entity and it has two coordinates values where it hit the detector. The camera only counts how often a photon hit each cell, so it is a real-life example of making a two-dimensional histogram. A two-dimensional histogram collects more information than two separate one-dimensional histograms, one for each coordinate. For example, if the two-dimensional image looks like a checker board, with high and low densities are alternating along each coordinate, then the one-dimensional histograms along each coordinate would look flat. There would be no hint that there is a complex structure in two dimensions.
The term /histogram/ is usually strictly used for something with cells over discrete or continuous data. This histogram class can also process categorical variables and it even allows for non-consecutive cells if that is desired. There is no restriction to numbers as input either. Any C++ type can be fed into the histogram, if the user provides a specialized axis class that maps values of this type to a cell index. The only remaining restriction is that cells are non-overlapping, since there must be a unique mapping from input value to cell. The library is not able to automatically ensure this for user-provided axis classes, so the responsibly is on the user.
Furthermore, the histogram can handle weighted input. Normally, the cell counter which is connected to an input tuple is incremented by one, but sometimes it is useful to increment by a weight, an integral or floating point number. Finally, the histogram can be configured to store any kind of accumulator in each cell. Arbitrary samples can be passed to this accumulator, which may compute the mean or other interesting quantities from the samples that are sorted into the cell. When the accumulator computes a mean, the result is called a /profile/. The feature set is informed by popular libraries for scientific computing, notably [@https://root.cern.ch CERN's ROOT framework] and the [@https://www.gnu.org/software/gsl GNU Scientific Library].
[endsect]
[section:structure Structure]
The library consists of three orthogonal components:
* [link histogram.overview.structure.host histogram host class]: The histogram host class defines the public user interface and holds axis objects (one for each dimension) and a storage object. The user can chose whether axis objects are stored in a static tuple, a vector, or a vector of variants.
* [link histogram.overview.structure.axis axis types]: Defines how input values are mapped to bins. Several axis types are provided which implement different specializations. Users can make their own axis types following the axis concept and use them with the library. A variant type is provided, which can hold one of several concrete axis types.
* [link histogram.overview.structure.storage storage types]: Manages a collection of bin counters. The requirements for a storage differ from those of an STL container, it needs to follow the storage concept. Two implementations are provided.
[section:host Histogram host class]
Histograms store axis objects and a storage object. A one-dimensional histogram has one axis, a multi-dimensional histogram has several. When you pass an input tuple, say (v1, v2, v3), then the first axis will map v1 onto index i1, the second axis v2 onto i2, and so on, to generate the index tuple (i1, i2, i3). The histogram host class then converts these indices into a linear global index that is used to address bin counter in the storage object.
[note
To understand the need for multi-dimensional histograms, think of point coordinates. If all points that you consider lie on a line, you need only one value to describe the point. If all points lie in a plane, you need two values to describe the position. Three values are needed for a point in space. A histogram puts a discrete grid over the line, the plane or the space, and counts how many points lie in each cell of the grid. To approximate a point distribution on a line, a 1d-histogram is sufficient. To do the same in 3d-space, one needs a 3d-histogram.
]
This library supports different axis types, so that the user can customize how the mapping is done exactly, see [link histogram.overview.structure.axis axis types]. Users can furthermore chose between several ways of storing axis types in the histogram.
When the number and types of the axes are known at compile-time, the histogram host class stores axis types in a `std::tuple`. We call this a /static histogram/. To access a particular axis, one should use a compile-time number as index (a run-time index also works with some limitations). A static histogram is extremely fast (see [link histogram.benchmarks benchmark]), because there is no overhead and the compiler can inline code, unroll loops, and more. Also, many user errors are can be caught at compile-time rather than run-time.
Static histograms are the best kind, but cannot be used when histograms are to be created with an axis configuration that is only known at run-time. This is the case, for example, when histograms are created from Python or from a graphical user interface. Therefore also more dynamic kinds of histograms are supported.
There are two levels of dynamism. The histogram can hold instances of a single axis type in a `std::vector`. Now the number of axis instances per histogram can vary at run-time, but the axis type must be the same for all instances. We call this a /semi-dynamic histogram/.
If also the axis types need to vary at run-time, one can place `boost::histogram::axis::variant` type in a `std::vector`, which can hold one of a set of different concrete axis types. We call this a /dynamic histogram/. The dynamic histogram is a single type that can store arbitrary sequences of different axes types, which may be generated at run-time. The polymorphic behavior of the generic `boost::histogram::axis::variant` type has a run-time cost, however. Typically, the performance is reduced by a factor of two compared to a static histogram.
[note
The design decision to store axis types in the variant-like type `boost::histogram::axis::variant` has several advantages over forms of run-time polymorphism. Firstly, it guarantees that axis objects which belong to the same histogram are stored locally together in memory, which reduces cache misses when the histogram iterates over axis objects in a tight loop, which it often does. Secondly, each axis can accept a different value type in this scheme. Classic polymorphism with vtables requires that all classes share the same method call signatures, but we want different axis to allowed to accepted different types of arguments. Variants work well for this case.
]
[endsect]
[section:axis Axis types]
An axis defines an injective mapping of (a range of) input values to a bin. The logic is encapsulated in an axis type. Users can create their own axis classes and use them with the library, by implementing the [link histogram.concepts.Axis [*Axis] concept]. The library comes with four builtin types, which implement different specializations.
* [classref boost::histogram::axis::regular] sorts real numbers into bins with equal width. The regular axis also supports monotonic transforms, which are applied when the input values are passed to the axis. This can be used to make a fast logarithmic axis, where the bins have equal width in the logarithm of the variable.
* [classref boost::histogram::axis::variable] sorts real numbers into bins with varying width.
* [classref boost::histogram::axis::integer] is a specialization of a regular axis for a range of integers with unit bin width. It is much faster than a regular axis.
* [classref boost::histogram::axis::category] is a bijective mapping of unique values onto bin indices and vice versa. This can be used with discrete categorical data, like "red", "green", "blue", for example.
Each builtin axis type has a few compile-time options, which change its behavior.
* All axis types can have an optional overflow bin. When the overflow bin is enabled and an input value is above the range covered by the axis, it is not discarded but counted in the overflow bin.
* All axis types except the category axis can have an optional underflow bin. When the underflow bin is enabled and an input value is below the range covered by the axis, it is not discarded but counted in the underflow bin.
* All axis types except the category axis can be circular, meaning that the axis range is periodic. This is useful for periodic data like polar angles.
* All axis types can optionally grow. When an input value is outside of the range of an axis which is configured to grow, the range of the axis is extended until the value is in range. This option is incompatible with the circular option, only either can be active.
[endsect]
[section:storage Storage types]
A storage type holds the actual cell values. It uses a one-dimensional index for cell lookup, computed by the histogram host from the indices generated by its axes. The storage needs to know nothing about axes. Users can integrate their own storage classes with the library, by implementing the [link histogram.concepts.Storage storage concept]. Standard containers can be used as storage backends, the library adapts them with the [classref boost::histogram::storage_adaptor].
Cell lookup is often happening in a tight loop and is random-access. A normal `std::vector` works well as a storage backend. Sometimes this is the best solution, but there are some caveats to this approach. The user has to decide which type should represents the cell counts and it is not an obvious choice. An integer type needs to be large enough to avoid counter overflow, but only a fraction of the bits are used if its capacity is too large. This is a waste of memory then. When memory is wasted, more cache misses occur and performance is degraded (see the benchmarks). The performance of modern CPUs depends a lot on effective utilization of the CPU cache, which is still small. Using floating point numbers instead of integers is also dangerous. They don't overflow, but cap the bin count when the bits in the mantissa are used up.
The default storage used in the library is [classref boost::histogram::unlimited_storage]. It solves these issues with a dynamic counter type management, based on the following insight. The [@https://en.wikipedia.org/wiki/Curse_of_dimensionality curse of dimensionality] makes the total number of bins grow very fast as the dimension of the histogram grows. However, having many bins also reduces the typical number of counts per bin, since the input values are spread over many more bins now. This means a small counter is often sufficient for high-dimensional histograms.
The default storage therefore starts with a minimum amount of memory per cell, it uses an 1 byte. If the count in any cell is about to overflow, all cells switch to the next larger integer type simultaneously. This goes on, the capacity per cell is always doubled when it is used up, until 8 byte per bin are reached. The following images illustrate this progression for a storage of 3 bin counters. A new memory block is always allocated for all counters, when the first one of them hits its capacity limit.
[$../storage_3_uint8.svg]
[$../storage_3_uint16.svg]
[$../storage_3_uint32.svg]
[$../storage_3_uint64.svg]
When even that is not enough, the default storage switches to a multiprecision type similar to those in [@boost:/libs/multiprecision/index.html Boost.Multiprecision], whose capacity is limited only by available memory. The following image is not to scale:
[$../storage_3_cpp_int.svg]
This approach is not only memory conserving, but also provides the strong guarantee that bin counters cannot overflow.
[note
The no-overflow-guarantee only applies when the histogram is not using weighted fills or if all weights are integral numbers. When floating point weights are used, the default storage switches to a double counter per cell to store the sum of such weights. A double cannot provide the no-overflow-guarantee.
]
The best part: this approach is even faster for a histogram with sufficient size despite the run-time overheads of handling the counter type dynamically. The benchmarks show that the gains from better cache usage outweigh the run-time overheads of dynamic dispatching to the right bin counter type and the occasional allocation costs. Doubling the size of the bin counters each time helps, because the allocations happen only O(logN) times for N increments.
[endsect]
[endsect]
[endsect]