Class KllFloatsSketch
public class KllFloatsSketch extends Object
This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution of values from a very large stream in a single pass, requiring only that the values are comparable. The analysis is obtained using getQuantile() or getQuantiles() functions or the inverse functions getRank(), getPMF() (Probability Mass Function), and getCDF() (Cumulative Distribution Function).
Given an input stream of N numeric values, the absolute rank of any specific value is defined as its index (0 to N-1) in the hypothetical sorted stream of all N input values.
The normalized rank (rank) of any specific value is defined as its absolute rank divided by N. Thus, the normalized rank is a value between zero and one. In the documentation and Javadocs for this sketch absolute rank is never used so any reference to just rank should be interpreted to mean normalized rank.
This sketch is configured with a parameter k, which affects the size of the sketch and its estimation error.
The estimation error is commonly called epsilon (or eps) and is a fraction between zero and one. Larger values of k result in smaller values of epsilon. Epsilon is always with respect to the rank and cannot be applied to the corresponding values.
The relationship between the normalized rank and the corresponding values can be viewed as a two dimensional monotonic plot with the normalized rank on one axis and the corresponding values on the other axis. If the y-axis is specified as the value-axis and the x-axis as the normalized rank, then y = getQuantile(x) is a monotonically increasing function.
The functions getQuantile(rank) and getQuantiles(...) translate ranks into corresponding values. The functions getRank(value), getCDF(...) (Cumulative Distribution Function), and getPMF(...) (Probability Mass Function) perform the opposite operation and translate values into ranks.
The getPMF(...) function has about 13 to 47% worse rank error (depending on k) than the other queries because the mass of each "bin" of the PMF has "double-sided" error from the upper and lower edges of the bin as a result of a subtraction, as the errors from the two edges can sometimes add.
The default k of 200 yields a "single-sided" epsilon of about 1.33% and a "double-sided" (PMF) epsilon of about 1.65%.
A getQuantile(rank) query has the following guarantees:
- Let v = getQuantile(r) where r is the rank between zero and one.
- The value v will be a value from the input stream.
- Let trueRank be the true rank of v derived from the hypothetical sorted stream of all N values.
- Let eps = getNormalizedRankError(false).
- Then r - eps ≤ trueRank ≤ r + eps with a confidence of 99%. Note that the error is on the rank, not the value.
A getRank(value) query has the following guarantees:
- Let r = getRank(v) where v is a value between the min and max values of the input stream.
- Let trueRank be the true rank of v derived from the hypothetical sorted stream of all N values.
- Let eps = getNormalizedRankError(false).
- Then r - eps ≤ trueRank ≤ r + eps with a confidence of 99%.
A getPMF(...) query has the following guarantees:
- Let {r1, r2, ..., rm+1} = getPMF(v1, v2, ..., vm) where v1, v2, ..., vm are monotonically increasing values supplied by the user that are part of the monotonic sequence v0 = min, v1, v2, ..., vm, vm+1 = max, and where min and max are the actual minimum and maximum values of the input stream automatically included in the sequence by the getPMF(...) function.
- Let ri = massi = estimated mass between vi-1 and vi where v0 = min and vm+1 = max.
- Let trueMass be the true mass between the values of vi, vi+1 derived from the hypothetical sorted stream of all N values.
- Let eps = getNormalizedRankError(true).
- Then mass - eps ≤ trueMass ≤ mass + eps with a confidence of 99%.
- r1 includes the mass of all points between min = v0 and v1.
- rm+1 includes the mass of all points between vm and max = vm+1.
A getCDF(...) query has the following guarantees;
- Let {r1, r2, ..., rm+1} = getCDF(v1, v2, ..., vm) where v1, v2, ..., vm) are monotonically increasing values supplied by the user that are part of the monotonic sequence {v0 = min, v1, v2, ..., vm, vm+1 = max}, and where min and max are the actual minimum and maximum values of the input stream automatically included in the sequence by the getCDF(...) function.
- Let ri = massi = estimated mass between v0 = min and vi.
- Let trueMass be the true mass between the true ranks of vi, vi+1 derived from the hypothetical sorted stream of all N values.
- Let eps = getNormalizedRankError(true).
- then mass - eps ≤ trueMass ≤ mass + eps with a confidence of 99%.
- r1 includes the mass of all points between min = v0 and v1.
- rm+1 includes the mass of all points between min = v0 and max = vm+1.
From the above, it might seem like we could make some estimates to bound the value returned from a call to getQuantile(). The sketch, however, does not let us derive error bounds or confidences around values. Because errors are independent, we can approximately bracket a value as shown below, but there are no error estimates available. Additionally, the interval may be quite large for certain distributions.
- Let v = getQuantile(r), the estimated quantile value of rank r.
- Let eps = getNormalizedRankError(false).
- Let vlo = estimated quantile value of rank (r - eps).
- Let vhi = estimated quantile value of rank (r + eps).
- Then vlo ≤ v ≤ vhi, with 99% confidence.
- Author:
- Kevin Lang, Alexander Saydakov, Lee Rhodes
-
Field Summary
Fields Modifier and Type Field Description static intDEFAULT_KThe default value of K. -
Constructor Summary
Constructors Constructor Description KllFloatsSketch()Heap constructor with the default k = 200, which has a rank error of about 1.65%.KllFloatsSketch(int k)Heap constructor with a given parameter k. -
Method Summary
Modifier and Type Method Description double[]getCDF(float[] splitPoints)Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).intgetK()Returns the parameter kstatic intgetKFromEpsilon(double epsilon, boolean pmf)Gets the approximate value of k to use given epsilon, the normalized rank error.static intgetMaxSerializedSizeBytes(int k, long n)Returns upper bound on the serialized size of a sketch given a parameter k and stream length.floatgetMaxValue()Returns the max value of the stream.floatgetMinValue()Returns the min value of the stream.longgetN()Returns the length of the input stream.doublegetNormalizedRankError(boolean pmf)Gets the approximate rank error of this sketch normalized as a fraction between zero and one.static doublegetNormalizedRankError(int k, boolean pmf)Gets the normalized rank error given k and pmf.intgetNumRetained()Returns the number of retained items (samples) in the sketch.double[]getPMF(float[] splitPoints)Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).floatgetQuantile(double fraction)Returns an approximation to the value of the data item that would be preceded by the given fraction of a hypothetical sorted version of the input stream so far.floatgetQuantileLowerBound(double fraction)Gets the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.float[]getQuantiles(double[] fractions)This is a more efficient multiple-query version of getQuantile().float[]getQuantiles(int numEvenlySpaced)This is also a more efficient multiple-query version of getQuantile() and allows the caller to specify the number of evenly spaced fractional ranks.floatgetQuantileUpperBound(double fraction)Gets the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.doublegetRank(float value)Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1, inclusive.intgetSerializedSizeBytes()Returns the number of bytes this sketch would require to store.static KllFloatsSketchheapify(org.apache.datasketches.memory.Memory mem)Factory heapify takes the sketch image in Memory and instantiates an on-heap sketch.booleanisEmpty()Returns true if this sketch is empty.booleanisEstimationMode()Returns true if this sketch is in estimation mode.KllFloatsSketchIteratoriterator()voidmerge(KllFloatsSketch other)Merges another sketch into this one.byte[]toByteArray()Returns serialized sketch in a byte array form.StringtoString()StringtoString(boolean withLevels, boolean withData)Returns a summary of the sketch as a string.voidupdate(float value)Updates this sketch with the given data item.
-
Field Details
-
DEFAULT_K
public static final int DEFAULT_KThe default value of K.- See Also:
- Constant Field Values
-
-
Constructor Details
-
KllFloatsSketch
public KllFloatsSketch()Heap constructor with the default k = 200, which has a rank error of about 1.65%. -
KllFloatsSketch
public KllFloatsSketch(int k)Heap constructor with a given parameter k. k can be any value between 8 and 65535, inclusive. The default k = 200 results in a normalized rank error of about 1.65%. Higher values of K will have smaller error but the sketch will be larger (and slower).- Parameters:
k- parameter that controls size of the sketch and accuracy of estimates
-
-
Method Details
-
heapify
Factory heapify takes the sketch image in Memory and instantiates an on-heap sketch. The resulting sketch will not retain any link to the source Memory.- Parameters:
mem- a Memory image of a sketch serialized by this sketch. See Memory- Returns:
- a heap-based sketch based on the given Memory.
-
getCDF
public double[] getCDF(float[] splitPoints)Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(false) function.
If the sketch is empty this returns null.
- Parameters:
splitPoints- an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.- Returns:
- an array of m+1 double values, which are a consecutive approximation to the CDF of the input stream given the splitPoints. The value at array position j of the returned CDF array is the sum of the returned values in positions 0 through j of the returned PMF array.
-
getK
public int getK()Returns the parameter k- Returns:
- parameter k
-
getKFromEpsilon
public static int getKFromEpsilon(double epsilon, boolean pmf)Gets the approximate value of k to use given epsilon, the normalized rank error.- Parameters:
epsilon- the normalized rank error between zero and one.pmf- if true, this function returns the value of k assuming the input epsilon is the desired "double-sided" epsilon for the getPMF() function. Otherwise, this function returns the value of k assuming the input epsilon is the desired "single-sided" epsilon for all the other queries.- Returns:
- the value of k given a value of epsilon.
- See Also:
KllFloatsSketch
-
getMaxValue
public float getMaxValue()Returns the max value of the stream. If the sketch is empty this returns NaN.- Returns:
- the max value of the stream
-
getMinValue
public float getMinValue()Returns the min value of the stream. If the sketch is empty this returns NaN.- Returns:
- the min value of the stream
-
getN
public long getN()Returns the length of the input stream.- Returns:
- stream length
-
getNormalizedRankError
public double getNormalizedRankError(boolean pmf)Gets the approximate rank error of this sketch normalized as a fraction between zero and one.- Parameters:
pmf- if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.- Returns:
- if pmf is true, returns the normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
- See Also:
KllFloatsSketch
-
getNormalizedRankError
public static double getNormalizedRankError(int k, boolean pmf)Gets the normalized rank error given k and pmf. Static method version of thegetNormalizedRankError(boolean).- Parameters:
k- the configuation parameterpmf- if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.- Returns:
- if pmf is true, the normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
- See Also:
KllFloatsSketch
-
getNumRetained
public int getNumRetained()Returns the number of retained items (samples) in the sketch.- Returns:
- the number of retained items (samples) in the sketch
-
getMaxSerializedSizeBytes
public static int getMaxSerializedSizeBytes(int k, long n)Returns upper bound on the serialized size of a sketch given a parameter k and stream length. The resulting size is an overestimate to make sure actual sketches don't exceed it. This method can be used if allocation of storage is necessary beforehand, but it is not optimal.- Parameters:
k- parameter that controls size of the sketch and accuracy of estimatesn- stream length- Returns:
- upper bound on the serialized size
-
getPMF
public double[] getPMF(float[] splitPoints)Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(true) function.
If the sketch is empty this returns null.
- Parameters:
splitPoints- an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.- Returns:
- an array of m+1 doubles each of which is an approximation to the fraction of the input stream values (the mass) that fall into one of those intervals. The definition of an "interval" is inclusive of the left splitPoint and exclusive of the right splitPoint, with the exception that the last interval will include maximum value.
-
getQuantile
public float getQuantile(double fraction)Returns an approximation to the value of the data item that would be preceded by the given fraction of a hypothetical sorted version of the input stream so far.We note that this method has a fairly large overhead (microseconds instead of nanoseconds) so it should not be called multiple times to get different quantiles from the same sketch. Instead use getQuantiles(), which pays the overhead only once.
If the sketch is empty this returns NaN.
- Parameters:
fraction- the specified fractional position in the hypothetical sorted stream. These are also called normalized ranks or fractional ranks. If fraction = 0.0, the true minimum value of the stream is returned. If fraction = 1.0, the true maximum value of the stream is returned.- Returns:
- the approximation to the value at the given fraction
-
getQuantileUpperBound
public float getQuantileUpperBound(double fraction)Gets the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.- Parameters:
fraction- the given normalized rank as a fraction- Returns:
- the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%. Returns NaN if the sketch is empty.
-
getQuantileLowerBound
public float getQuantileLowerBound(double fraction)Gets the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.- Parameters:
fraction- the given normalized rank as a fraction- Returns:
- the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%. Returns NaN if the sketch is empty.
-
getQuantiles
public float[] getQuantiles(double[] fractions)This is a more efficient multiple-query version of getQuantile().This returns an array that could have been generated by using getQuantile() with many different fractional ranks, but would be very inefficient. This method incurs the internal set-up overhead once and obtains multiple quantile values in a single query. It is strongly recommend that this method be used instead of multiple calls to getQuantile().
If the sketch is empty this returns null.
- Parameters:
fractions- given array of fractional positions in the hypothetical sorted stream. These are also called normalized ranks or fractional ranks. These fractions must be in the interval [0.0, 1.0], inclusive.- Returns:
- array of approximations to the given fractions in the same order as given fractions array.
-
getQuantiles
public float[] getQuantiles(int numEvenlySpaced)This is also a more efficient multiple-query version of getQuantile() and allows the caller to specify the number of evenly spaced fractional ranks.If the sketch is empty this returns null.
- Parameters:
numEvenlySpaced- an integer that specifies the number of evenly spaced fractional ranks. This must be a positive integer greater than 0. A value of 1 will return the min value. A value of 2 will return the min and the max value. A value of 3 will return the min, the median and the max value, etc.- Returns:
- array of approximations to the given fractions in the same order as given fractions array.
-
getRank
public double getRank(float value)Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1, inclusive.The resulting approximation has a probabilistic guarantee that can be obtained from the getNormalizedRankError(false) function.
If the sketch is empty this returns NaN.
- Parameters:
value- to be ranked- Returns:
- an approximate rank of the given value
-
getSerializedSizeBytes
public int getSerializedSizeBytes()Returns the number of bytes this sketch would require to store.- Returns:
- the number of bytes this sketch would require to store.
-
isEmpty
public boolean isEmpty()Returns true if this sketch is empty.- Returns:
- empty flag
-
isEstimationMode
public boolean isEstimationMode()Returns true if this sketch is in estimation mode.- Returns:
- estimation mode flag
-
iterator
- Returns:
- the iterator for this class
-
merge
Merges another sketch into this one.- Parameters:
other- sketch to merge into this one
-
toByteArray
public byte[] toByteArray()Returns serialized sketch in a byte array form.- Returns:
- serialized sketch in a byte array form.
-
toString
-
toString
Returns a summary of the sketch as a string.- Parameters:
withLevels- if true include information about levelswithData- if true include sketch data- Returns:
- string representation of sketch summary
-
update
public void update(float value)Updates this sketch with the given data item.- Parameters:
value- an item from a stream of items. NaNs are ignored.
-