Class KllFloatsSketch

java.lang.Object
org.apache.datasketches.kll.KllFloatsSketch

public class KllFloatsSketch
extends Object
Implementation of a very compact quantiles sketch with lazy compaction scheme and nearly optimal accuracy per retained item. See Optimal Quantile Approximation in Streams.

This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution of values from a very large stream in a single pass, requiring only that the values are comparable. The analysis is obtained using getQuantile() or getQuantiles() functions or the inverse functions getRank(), getPMF() (Probability Mass Function), and getCDF() (Cumulative Distribution Function).

Given an input stream of N numeric values, the absolute rank of any specific value is defined as its index (0 to N-1) in the hypothetical sorted stream of all N input values.

The normalized rank (rank) of any specific value is defined as its absolute rank divided by N. Thus, the normalized rank is a value between zero and one. In the documentation and Javadocs for this sketch absolute rank is never used so any reference to just rank should be interpreted to mean normalized rank.

This sketch is configured with a parameter k, which affects the size of the sketch and its estimation error.

The estimation error is commonly called epsilon (or eps) and is a fraction between zero and one. Larger values of k result in smaller values of epsilon. Epsilon is always with respect to the rank and cannot be applied to the corresponding values.

The relationship between the normalized rank and the corresponding values can be viewed as a two dimensional monotonic plot with the normalized rank on one axis and the corresponding values on the other axis. If the y-axis is specified as the value-axis and the x-axis as the normalized rank, then y = getQuantile(x) is a monotonically increasing function.

The functions getQuantile(rank) and getQuantiles(...) translate ranks into corresponding values. The functions getRank(value), getCDF(...) (Cumulative Distribution Function), and getPMF(...) (Probability Mass Function) perform the opposite operation and translate values into ranks.

The getPMF(...) function has about 13 to 47% worse rank error (depending on k) than the other queries because the mass of each "bin" of the PMF has "double-sided" error from the upper and lower edges of the bin as a result of a subtraction, as the errors from the two edges can sometimes add.

The default k of 200 yields a "single-sided" epsilon of about 1.33% and a "double-sided" (PMF) epsilon of about 1.65%.

A getQuantile(rank) query has the following guarantees:

  • Let v = getQuantile(r) where r is the rank between zero and one.
  • The value v will be a value from the input stream.
  • Let trueRank be the true rank of v derived from the hypothetical sorted stream of all N values.
  • Let eps = getNormalizedRankError(false).
  • Then r - eps ≤ trueRank ≤ r + eps with a confidence of 99%. Note that the error is on the rank, not the value.

A getRank(value) query has the following guarantees:

  • Let r = getRank(v) where v is a value between the min and max values of the input stream.
  • Let trueRank be the true rank of v derived from the hypothetical sorted stream of all N values.
  • Let eps = getNormalizedRankError(false).
  • Then r - eps ≤ trueRank ≤ r + eps with a confidence of 99%.

A getPMF(...) query has the following guarantees:

  • Let {r1, r2, ..., rm+1} = getPMF(v1, v2, ..., vm) where v1, v2, ..., vm are monotonically increasing values supplied by the user that are part of the monotonic sequence v0 = min, v1, v2, ..., vm, vm+1 = max, and where min and max are the actual minimum and maximum values of the input stream automatically included in the sequence by the getPMF(...) function.
  • Let ri = massi = estimated mass between vi-1 and vi where v0 = min and vm+1 = max.
  • Let trueMass be the true mass between the values of vi, vi+1 derived from the hypothetical sorted stream of all N values.
  • Let eps = getNormalizedRankError(true).
  • Then mass - eps ≤ trueMass ≤ mass + eps with a confidence of 99%.
  • r1 includes the mass of all points between min = v0 and v1.
  • rm+1 includes the mass of all points between vm and max = vm+1.

A getCDF(...) query has the following guarantees;

  • Let {r1, r2, ..., rm+1} = getCDF(v1, v2, ..., vm) where v1, v2, ..., vm) are monotonically increasing values supplied by the user that are part of the monotonic sequence {v0 = min, v1, v2, ..., vm, vm+1 = max}, and where min and max are the actual minimum and maximum values of the input stream automatically included in the sequence by the getCDF(...) function.
  • Let ri = massi = estimated mass between v0 = min and vi.
  • Let trueMass be the true mass between the true ranks of vi, vi+1 derived from the hypothetical sorted stream of all N values.
  • Let eps = getNormalizedRankError(true).
  • then mass - eps ≤ trueMass ≤ mass + eps with a confidence of 99%.
  • r1 includes the mass of all points between min = v0 and v1.
  • rm+1 includes the mass of all points between min = v0 and max = vm+1.

From the above, it might seem like we could make some estimates to bound the value returned from a call to getQuantile(). The sketch, however, does not let us derive error bounds or confidences around values. Because errors are independent, we can approximately bracket a value as shown below, but there are no error estimates available. Additionally, the interval may be quite large for certain distributions.

  • Let v = getQuantile(r), the estimated quantile value of rank r.
  • Let eps = getNormalizedRankError(false).
  • Let vlo = estimated quantile value of rank (r - eps).
  • Let vhi = estimated quantile value of rank (r + eps).
  • Then vlo ≤ v ≤ vhi, with 99% confidence.
Author:
Kevin Lang, Alexander Saydakov, Lee Rhodes
  • Field Summary

    Fields 
    Modifier and Type Field Description
    static int DEFAULT_K
    The default value of K.
  • Constructor Summary

    Constructors 
    Constructor Description
    KllFloatsSketch()
    Heap constructor with the default k = 200, which has a rank error of about 1.65%.
    KllFloatsSketch​(int k)
    Heap constructor with a given parameter k.
  • Method Summary

    Modifier and Type Method Description
    double[] getCDF​(float[] splitPoints)
    Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).
    int getK()
    Returns the parameter k
    static int getKFromEpsilon​(double epsilon, boolean pmf)
    Gets the approximate value of k to use given epsilon, the normalized rank error.
    static int getMaxSerializedSizeBytes​(int k, long n)
    Returns upper bound on the serialized size of a sketch given a parameter k and stream length.
    float getMaxValue()
    Returns the max value of the stream.
    float getMinValue()
    Returns the min value of the stream.
    long getN()
    Returns the length of the input stream.
    double getNormalizedRankError​(boolean pmf)
    Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
    static double getNormalizedRankError​(int k, boolean pmf)
    Gets the normalized rank error given k and pmf.
    int getNumRetained()
    Returns the number of retained items (samples) in the sketch.
    double[] getPMF​(float[] splitPoints)
    Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).
    float getQuantile​(double fraction)
    Returns an approximation to the value of the data item that would be preceded by the given fraction of a hypothetical sorted version of the input stream so far.
    float getQuantileLowerBound​(double fraction)
    Gets the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
    float[] getQuantiles​(double[] fractions)
    This is a more efficient multiple-query version of getQuantile().
    float[] getQuantiles​(int numEvenlySpaced)
    This is also a more efficient multiple-query version of getQuantile() and allows the caller to specify the number of evenly spaced fractional ranks.
    float getQuantileUpperBound​(double fraction)
    Gets the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
    double getRank​(float value)
    Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1, inclusive.
    int getSerializedSizeBytes()
    Returns the number of bytes this sketch would require to store.
    static KllFloatsSketch heapify​(org.apache.datasketches.memory.Memory mem)
    Factory heapify takes the sketch image in Memory and instantiates an on-heap sketch.
    boolean isEmpty()
    Returns true if this sketch is empty.
    boolean isEstimationMode()
    Returns true if this sketch is in estimation mode.
    KllFloatsSketchIterator iterator()  
    void merge​(KllFloatsSketch other)
    Merges another sketch into this one.
    byte[] toByteArray()
    Returns serialized sketch in a byte array form.
    String toString()  
    String toString​(boolean withLevels, boolean withData)
    Returns a summary of the sketch as a string.
    void update​(float value)
    Updates this sketch with the given data item.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

  • Constructor Details

    • KllFloatsSketch

      public KllFloatsSketch()
      Heap constructor with the default k = 200, which has a rank error of about 1.65%.
    • KllFloatsSketch

      public KllFloatsSketch​(int k)
      Heap constructor with a given parameter k. k can be any value between 8 and 65535, inclusive. The default k = 200 results in a normalized rank error of about 1.65%. Higher values of K will have smaller error but the sketch will be larger (and slower).
      Parameters:
      k - parameter that controls size of the sketch and accuracy of estimates
  • Method Details

    • heapify

      public static KllFloatsSketch heapify​(org.apache.datasketches.memory.Memory mem)
      Factory heapify takes the sketch image in Memory and instantiates an on-heap sketch. The resulting sketch will not retain any link to the source Memory.
      Parameters:
      mem - a Memory image of a sketch serialized by this sketch. See Memory
      Returns:
      a heap-based sketch based on the given Memory.
    • getCDF

      public double[] getCDF​(float[] splitPoints)
      Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).

      The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(false) function.

      If the sketch is empty this returns null.

      Parameters:
      splitPoints - an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.
      Returns:
      an array of m+1 double values, which are a consecutive approximation to the CDF of the input stream given the splitPoints. The value at array position j of the returned CDF array is the sum of the returned values in positions 0 through j of the returned PMF array.
    • getK

      public int getK()
      Returns the parameter k
      Returns:
      parameter k
    • getKFromEpsilon

      public static int getKFromEpsilon​(double epsilon, boolean pmf)
      Gets the approximate value of k to use given epsilon, the normalized rank error.
      Parameters:
      epsilon - the normalized rank error between zero and one.
      pmf - if true, this function returns the value of k assuming the input epsilon is the desired "double-sided" epsilon for the getPMF() function. Otherwise, this function returns the value of k assuming the input epsilon is the desired "single-sided" epsilon for all the other queries.
      Returns:
      the value of k given a value of epsilon.
      See Also:
      KllFloatsSketch
    • getMaxValue

      public float getMaxValue()
      Returns the max value of the stream. If the sketch is empty this returns NaN.
      Returns:
      the max value of the stream
    • getMinValue

      public float getMinValue()
      Returns the min value of the stream. If the sketch is empty this returns NaN.
      Returns:
      the min value of the stream
    • getN

      public long getN()
      Returns the length of the input stream.
      Returns:
      stream length
    • getNormalizedRankError

      public double getNormalizedRankError​(boolean pmf)
      Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
      Parameters:
      pmf - if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
      Returns:
      if pmf is true, returns the normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
      See Also:
      KllFloatsSketch
    • getNormalizedRankError

      public static double getNormalizedRankError​(int k, boolean pmf)
      Gets the normalized rank error given k and pmf. Static method version of the getNormalizedRankError(boolean).
      Parameters:
      k - the configuation parameter
      pmf - if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
      Returns:
      if pmf is true, the normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
      See Also:
      KllFloatsSketch
    • getNumRetained

      public int getNumRetained()
      Returns the number of retained items (samples) in the sketch.
      Returns:
      the number of retained items (samples) in the sketch
    • getMaxSerializedSizeBytes

      public static int getMaxSerializedSizeBytes​(int k, long n)
      Returns upper bound on the serialized size of a sketch given a parameter k and stream length. The resulting size is an overestimate to make sure actual sketches don't exceed it. This method can be used if allocation of storage is necessary beforehand, but it is not optimal.
      Parameters:
      k - parameter that controls size of the sketch and accuracy of estimates
      n - stream length
      Returns:
      upper bound on the serialized size
    • getPMF

      public double[] getPMF​(float[] splitPoints)
      Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).

      The resulting approximations have a probabilistic guarantee that can be obtained from the getNormalizedRankError(true) function.

      If the sketch is empty this returns null.

      Parameters:
      splitPoints - an array of m unique, monotonically increasing float values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.
      Returns:
      an array of m+1 doubles each of which is an approximation to the fraction of the input stream values (the mass) that fall into one of those intervals. The definition of an "interval" is inclusive of the left splitPoint and exclusive of the right splitPoint, with the exception that the last interval will include maximum value.
    • getQuantile

      public float getQuantile​(double fraction)
      Returns an approximation to the value of the data item that would be preceded by the given fraction of a hypothetical sorted version of the input stream so far.

      We note that this method has a fairly large overhead (microseconds instead of nanoseconds) so it should not be called multiple times to get different quantiles from the same sketch. Instead use getQuantiles(), which pays the overhead only once.

      If the sketch is empty this returns NaN.

      Parameters:
      fraction - the specified fractional position in the hypothetical sorted stream. These are also called normalized ranks or fractional ranks. If fraction = 0.0, the true minimum value of the stream is returned. If fraction = 1.0, the true maximum value of the stream is returned.
      Returns:
      the approximation to the value at the given fraction
    • getQuantileUpperBound

      public float getQuantileUpperBound​(double fraction)
      Gets the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
      Parameters:
      fraction - the given normalized rank as a fraction
      Returns:
      the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%. Returns NaN if the sketch is empty.
    • getQuantileLowerBound

      public float getQuantileLowerBound​(double fraction)
      Gets the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
      Parameters:
      fraction - the given normalized rank as a fraction
      Returns:
      the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%. Returns NaN if the sketch is empty.
    • getQuantiles

      public float[] getQuantiles​(double[] fractions)
      This is a more efficient multiple-query version of getQuantile().

      This returns an array that could have been generated by using getQuantile() with many different fractional ranks, but would be very inefficient. This method incurs the internal set-up overhead once and obtains multiple quantile values in a single query. It is strongly recommend that this method be used instead of multiple calls to getQuantile().

      If the sketch is empty this returns null.

      Parameters:
      fractions - given array of fractional positions in the hypothetical sorted stream. These are also called normalized ranks or fractional ranks. These fractions must be in the interval [0.0, 1.0], inclusive.
      Returns:
      array of approximations to the given fractions in the same order as given fractions array.
    • getQuantiles

      public float[] getQuantiles​(int numEvenlySpaced)
      This is also a more efficient multiple-query version of getQuantile() and allows the caller to specify the number of evenly spaced fractional ranks.

      If the sketch is empty this returns null.

      Parameters:
      numEvenlySpaced - an integer that specifies the number of evenly spaced fractional ranks. This must be a positive integer greater than 0. A value of 1 will return the min value. A value of 2 will return the min and the max value. A value of 3 will return the min, the median and the max value, etc.
      Returns:
      array of approximations to the given fractions in the same order as given fractions array.
    • getRank

      public double getRank​(float value)
      Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1, inclusive.

      The resulting approximation has a probabilistic guarantee that can be obtained from the getNormalizedRankError(false) function.

      If the sketch is empty this returns NaN.

      Parameters:
      value - to be ranked
      Returns:
      an approximate rank of the given value
    • getSerializedSizeBytes

      public int getSerializedSizeBytes()
      Returns the number of bytes this sketch would require to store.
      Returns:
      the number of bytes this sketch would require to store.
    • isEmpty

      public boolean isEmpty()
      Returns true if this sketch is empty.
      Returns:
      empty flag
    • isEstimationMode

      public boolean isEstimationMode()
      Returns true if this sketch is in estimation mode.
      Returns:
      estimation mode flag
    • iterator

      public KllFloatsSketchIterator iterator()
      Returns:
      the iterator for this class
    • merge

      public void merge​(KllFloatsSketch other)
      Merges another sketch into this one.
      Parameters:
      other - sketch to merge into this one
    • toByteArray

      public byte[] toByteArray()
      Returns serialized sketch in a byte array form.
      Returns:
      serialized sketch in a byte array form.
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • toString

      public String toString​(boolean withLevels, boolean withData)
      Returns a summary of the sketch as a string.
      Parameters:
      withLevels - if true include information about levels
      withData - if true include sketch data
      Returns:
      string representation of sketch summary
    • update

      public void update​(float value)
      Updates this sketch with the given data item.
      Parameters:
      value - an item from a stream of items. NaNs are ignored.