See: Description
| Class | Description |
|---|---|
| KllDoublesSketch | |
| KllDoublesSketchIterator |
Iterator over KllDoublesSketch.
|
| KllFloatsSketch | |
| KllFloatsSketchIterator |
Iterator over KllFloatsSketch.
|
| KllSketch |
This class is the root of the KLL sketch class hierarchy.
|
| Enum | Description |
|---|---|
| KllSketch.SketchType |
Used to define the variable type of the current instance of this class.
|
This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution of values from a very large stream in a single pass, requiring only that the values are comparable. The analysis is obtained using getQuantile() or getQuantiles() functions or the inverse functions getRank(), getPMF() (Probability Mass Function), and getCDF() (Cumulative Distribution Function).
Given an input stream of N numeric values, the absolute rank of any specific value is defined as its index (0 to N-1) in the hypothetical sorted stream of all N input values.
The normalized rank (rank) of any specific value is defined as its absolute rank divided by N. Thus, the normalized rank is a value in the interval [0.0, 1.0). In the documentation and Javadocs for this sketch absolute rank is never used so any reference to just rank should be interpreted to mean normalized rank.
This sketch is configured with a parameter k, which affects the size of the sketch and its estimation error.
In the research literature, the estimation error is commonly called epsilon (or eps) and is a fraction between zero and one. Larger values of k result in smaller values of epsilon. The epsilon error is always with respect to the rank and cannot be applied to the corresponding values.
The relationship between the normalized rank and the corresponding values can be viewed as a two dimensional monotonic plot with the normalized rank on one axis and the corresponding values on the other axis. If the y-axis is specified as the value-axis and the x-axis as the normalized rank, then y = getQuantile(x) is a monotonically increasing function.
The functions getQuantile(rank) and getQuantiles(...) translate ranks into corresponding values. The functions getRank(value), getCDF(...) (Cumulative Distribution Function), and getPMF(...) (Probability Mass Function) perform the opposite operation and translate values into ranks.
The getPMF(...) function has about 13 to 47% worse rank error (depending on k) than the other queries because the mass of each "bin" of the PMF has "double-sided" error from the upper and lower edges of the bin as a result of a subtraction, as the errors from the two edges can sometimes add.
The default k of 200 yields a "single-sided" epsilon of about 1.33% and a "double-sided" (PMF) epsilon of about 1.65%.
A getQuantile(rank) query has the following guarantees:
A getRank(value) query has the following guarantees:
A getPMF(...) query has the following guarantees:
A getCDF(...) query has the following guarantees;
From the above, it might seem like we could make some estimates to bound the value returned from a call to getQuantile(). The sketch, however, does not let us derive error bounds or confidences around values. Because errors are independent, we can approximately bracket a value as shown below, but there are no error estimates available. Additionally, the interval may be quite large for certain distributions.
Please visit our website: DataSketches Home Page and the Javadocs for more information.
Copyright © 2015–2020 The Apache Software Foundation. All rights reserved.