Class ItemsSketch<T>

java.lang.Object
org.apache.datasketches.quantiles.ItemsSketch<T>
Type Parameters:
T - type of item

public final class ItemsSketch<T>
extends Object
This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution of comparable items from a very large stream in a single pass. The analysis is obtained using a getQuantiles(*) function or its inverse functions the Probability Mass Function from getPMF(*) and the Cumulative Distribution Function from getCDF(*).

The documentation for DoublesSketch applies here except that the size of an ItemsSketch is very dependent on the Items input into the sketch, so there is no comparable size table as for the DoublesSketch.

There is more documentation available on datasketches.apache.org.

Author:
Kevin Lang, Alexander Saydakov
  • Field Summary

    Fields 
    Modifier and Type Field Description
    static Random rand
    Setting the seed makes the results of the sketch deterministic if the input values are received in exactly the same order.
  • Method Summary

    Modifier and Type Method Description
    ItemsSketch<T> downSample​(int newK)
    From an existing sketch, this creates a new sketch that can have a smaller value of K.
    double[] getCDF​(T[] splitPoints)
    Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoints (values).
    static <T> ItemsSketch<T> getInstance​(int k, Comparator<? super T> comparator)
    Obtains a new instance of an ItemsSketch.
    static <T> ItemsSketch<T> getInstance​(Comparator<? super T> comparator)
    Obtains a new instance of an ItemsSketch using the DEFAULT_K.
    static <T> ItemsSketch<T> getInstance​(org.apache.datasketches.memory.Memory srcMem, Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
    Heapifies the given srcMem, which must be a Memory image of a ItemsSketch
    int getK()
    Returns the configured value of K
    static int getKFromEpsilon​(double epsilon, boolean pmf)
    Gets the approximate value of k to use given epsilon, the normalized rank error.
    T getMaxValue()
    Returns the max value of the stream
    T getMinValue()
    Returns the min value of the stream
    long getN()
    Returns the length of the input stream so far.
    double getNormalizedRankError​(boolean pmf)
    Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
    static double getNormalizedRankError​(int k, boolean pmf)
    Gets the normalized rank error given k and pmf.
    double[] getPMF​(T[] splitPoints)
    Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).
    T getQuantile​(double fraction)
    This returns an approximation to the value of the data item that would be preceded by the given fraction of a hypothetical sorted version of the input stream so far.
    T getQuantileLowerBound​(double fraction)
    Gets the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
    T[] getQuantiles​(double[] fRanks)
    This is a more efficient multiple-query version of getQuantile().
    T[] getQuantiles​(int evenlySpaced)
    This is also a more efficient multiple-query version of getQuantile() and allows the caller to specify the number of evenly spaced fractional ranks.
    T getQuantileUpperBound​(double fraction)
    Gets the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
    double getRank​(T value)
    Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1 inclusive.
    int getRetainedItems()
    Computes the number of retained entries (samples) in the sketch
    boolean isDirect()  
    boolean isEmpty()
    Returns true if this sketch is empty
    boolean isEstimationMode()  
    ItemsSketchIterator<T> iterator()  
    void putMemory​(org.apache.datasketches.memory.WritableMemory dstMem, ArrayOfItemsSerDe<T> serDe)
    Puts the current sketch into the given Memory if there is sufficient space.
    void reset()
    Resets this sketch to a virgin state, but retains the original value of k.
    byte[] toByteArray​(boolean ordered, ArrayOfItemsSerDe<T> serDe)
    Serialize this sketch to a byte array form.
    byte[] toByteArray​(ArrayOfItemsSerDe<T> serDe)
    Serialize this sketch to a byte array form.
    String toString()
    Returns summary information about this sketch.
    String toString​(boolean sketchSummary, boolean dataDetail)
    Returns summary information about this sketch.
    static String toString​(byte[] byteArr)
    Returns a human readable string of the preamble of a byte array image of an ItemsSketch.
    static String toString​(org.apache.datasketches.memory.Memory mem)
    Returns a human readable string of the preamble of a Memory image of an ItemsSketch.
    void update​(T dataItem)
    Updates this sketch with the given double data item

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

    • rand

      public static final Random rand
      Setting the seed makes the results of the sketch deterministic if the input values are received in exactly the same order. This is only useful when performing test comparisons, otherwise is not recommended.
  • Method Details

    • getInstance

      public static <T> ItemsSketch<T> getInstance​(Comparator<? super T> comparator)
      Obtains a new instance of an ItemsSketch using the DEFAULT_K.
      Type Parameters:
      T - type of item
      Parameters:
      comparator - to compare items
      Returns:
      a GenericQuantileSketch
    • getInstance

      public static <T> ItemsSketch<T> getInstance​(int k, Comparator<? super T> comparator)
      Obtains a new instance of an ItemsSketch.
      Type Parameters:
      T - type of item
      Parameters:
      k - Parameter that controls space usage of sketch and accuracy of estimates. Must be greater than 2 and less than 65536 and a power of 2.
      comparator - to compare items
      Returns:
      a GenericQuantileSketch
    • getInstance

      public static <T> ItemsSketch<T> getInstance​(org.apache.datasketches.memory.Memory srcMem, Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
      Heapifies the given srcMem, which must be a Memory image of a ItemsSketch
      Type Parameters:
      T - type of item
      Parameters:
      srcMem - a Memory image of a sketch. See Memory
      comparator - to compare items
      serDe - an instance of ArrayOfItemsSerDe
      Returns:
      a ItemsSketch on the Java heap.
    • update

      public void update​(T dataItem)
      Updates this sketch with the given double data item
      Parameters:
      dataItem - an item from a stream of items. NaNs are ignored.
    • getQuantile

      public T getQuantile​(double fraction)
      This returns an approximation to the value of the data item that would be preceded by the given fraction of a hypothetical sorted version of the input stream so far.

      We note that this method has a fairly large overhead (microseconds instead of nanoseconds) so it should not be called multiple times to get different quantiles from the same sketch. Instead use getQuantiles(). which pays the overhead only once.

      Parameters:
      fraction - the specified fractional position in the hypothetical sorted stream. These are also called normalized ranks or fractional ranks. If fraction = 0.0, the true minimum value of the stream is returned. If fraction = 1.0, the true maximum value of the stream is returned.
      Returns:
      the approximation to the value at the above fraction
    • getQuantileUpperBound

      public T getQuantileUpperBound​(double fraction)
      Gets the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
      Parameters:
      fraction - the given normalized rank as a fraction
      Returns:
      the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%. Returns NaN if the sketch is empty.
    • getQuantileLowerBound

      public T getQuantileLowerBound​(double fraction)
      Gets the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
      Parameters:
      fraction - the given normalized rank as a fraction
      Returns:
      the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%. Returns NaN if the sketch is empty.
    • getQuantiles

      public T[] getQuantiles​(double[] fRanks)
      This is a more efficient multiple-query version of getQuantile().

      This returns an array that could have been generated by using getQuantile() with many different fractional ranks, but would be very inefficient. This method incurs the internal set-up overhead once and obtains multiple quantile values in a single query. It is strongly recommend that this method be used instead of multiple calls to getQuantile().

      If the sketch is empty this returns null.

      Parameters:
      fRanks - the given array of fractional (or normalized) ranks in the hypothetical sorted stream of all the input values seen so far. These fRanks must all be in the interval [0.0, 1.0] inclusively.
      Returns:
      array of approximate quantiles of the given fRanks in the same order as in the given fRanks array.
    • getQuantiles

      public T[] getQuantiles​(int evenlySpaced)
      This is also a more efficient multiple-query version of getQuantile() and allows the caller to specify the number of evenly spaced fractional ranks.
      Parameters:
      evenlySpaced - an integer that specifies the number of evenly spaced fractional ranks. This must be a positive integer greater than 1. A value of 2 will return the min and the max value. A value of 3 will return the min, the median and the max value, etc.
      Returns:
      array of approximations to the given fractions in the same order as given fractions array.
    • getRank

      public double getRank​(T value)
      Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1 inclusive.

      The resulting approximation has a probabilistic guarantee that be obtained from the getNormalizedRankError(false) function.

      If the sketch is empty this returns NaN.

      Parameters:
      value - to be ranked
      Returns:
      an approximate rank of the given value
    • getPMF

      public double[] getPMF​(T[] splitPoints)
      Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).

      The resulting approximations have a probabilistic guarantee that be obtained from the getNormalizedRankError(true) function.

      If the sketch is empty this returns null.

      Parameters:
      splitPoints - an array of m unique, monotonically increasing item values that divide the ordered space into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these splitpoints.
      Returns:
      an array of m+1 doubles each of which is an approximation to the fraction of the input stream values (the mass) that fall into one of those intervals. The definition of an "interval" is inclusive of the left splitPoint and exclusive of the right splitPoint, with the exception that the last interval will include maximum value.
    • getCDF

      public double[] getCDF​(T[] splitPoints)
      Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoints (values).

      The resulting approximations have a probabilistic guarantee that be obtained from the getNormalizedRankError(false) function.

      If the sketch is empty this returns null.

      Parameters:
      splitPoints - an array of m unique, monotonically increasing item values that divide the ordered space into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these splitpoints.
      Returns:
      an array of m+1 double values, which are a consecutive approximation to the CDF of the input stream given the splitPoints. The value at array position j of the returned CDF array is the sum of the returned values in positions 0 through j of the returned PMF array.
    • getK

      public int getK()
      Returns the configured value of K
      Returns:
      the configured value of K
    • getMinValue

      public T getMinValue()
      Returns the min value of the stream
      Returns:
      the min value of the stream
    • getMaxValue

      public T getMaxValue()
      Returns the max value of the stream
      Returns:
      the max value of the stream
    • getN

      public long getN()
      Returns the length of the input stream so far.
      Returns:
      the length of the input stream so far
    • getNormalizedRankError

      public double getNormalizedRankError​(boolean pmf)
      Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
      Parameters:
      pmf - if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
      Returns:
      if pmf is true, returns the normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
    • getNormalizedRankError

      public static double getNormalizedRankError​(int k, boolean pmf)
      Gets the normalized rank error given k and pmf. Static method version of the getNormalizedRankError(boolean).
      Parameters:
      k - the configuation parameter
      pmf - if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
      Returns:
      if pmf is true, the normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
    • getKFromEpsilon

      public static int getKFromEpsilon​(double epsilon, boolean pmf)
      Gets the approximate value of k to use given epsilon, the normalized rank error.
      Parameters:
      epsilon - the normalized rank error between zero and one.
      pmf - if true, this function returns the value of k assuming the input epsilon is the desired "double-sided" epsilon for the getPMF() function. Otherwise, this function returns the value of k assuming the input epsilon is the desired "single-sided" epsilon for all the other queries.
      Returns:
      the value of k given a value of epsilon.
    • isEmpty

      public boolean isEmpty()
      Returns true if this sketch is empty
      Returns:
      true if this sketch is empty
    • isDirect

      public boolean isDirect()
      Returns:
      true if this sketch is off-heap
    • isEstimationMode

      public boolean isEstimationMode()
      Returns:
      true if in estimation mode
    • reset

      public void reset()
      Resets this sketch to a virgin state, but retains the original value of k.
    • toByteArray

      public byte[] toByteArray​(ArrayOfItemsSerDe<T> serDe)
      Serialize this sketch to a byte array form.
      Parameters:
      serDe - an instance of ArrayOfItemsSerDe
      Returns:
      byte array of this sketch
    • toByteArray

      public byte[] toByteArray​(boolean ordered, ArrayOfItemsSerDe<T> serDe)
      Serialize this sketch to a byte array form.
      Parameters:
      ordered - if true the base buffer will be ordered (default == false).
      serDe - an instance of ArrayOfItemsSerDe
      Returns:
      this sketch in a byte array form.
    • toString

      public String toString()
      Returns summary information about this sketch.
      Overrides:
      toString in class Object
    • toString

      public String toString​(boolean sketchSummary, boolean dataDetail)
      Returns summary information about this sketch. Used for debugging.
      Parameters:
      sketchSummary - if true includes sketch summary
      dataDetail - if true includes data detail
      Returns:
      summary information about the sketch.
    • toString

      public static String toString​(byte[] byteArr)
      Returns a human readable string of the preamble of a byte array image of an ItemsSketch.
      Parameters:
      byteArr - the given byte array
      Returns:
      a human readable string of the preamble of a byte array image of an ItemsSketch.
    • toString

      public static String toString​(org.apache.datasketches.memory.Memory mem)
      Returns a human readable string of the preamble of a Memory image of an ItemsSketch.
      Parameters:
      mem - the given Memory
      Returns:
      a human readable string of the preamble of a Memory image of an ItemsSketch.
    • downSample

      public ItemsSketch<T> downSample​(int newK)
      From an existing sketch, this creates a new sketch that can have a smaller value of K. The original sketch is not modified.
      Parameters:
      newK - the new value of K that must be smaller than current value of K. It is required that this.getK() = newK * 2^(nonnegative integer).
      Returns:
      the new sketch.
    • getRetainedItems

      public int getRetainedItems()
      Computes the number of retained entries (samples) in the sketch
      Returns:
      the number of retained entries (samples) in the sketch
    • putMemory

      public void putMemory​(org.apache.datasketches.memory.WritableMemory dstMem, ArrayOfItemsSerDe<T> serDe)
      Puts the current sketch into the given Memory if there is sufficient space. Otherwise, throws an error.
      Parameters:
      dstMem - the given memory.
      serDe - an instance of ArrayOfItemsSerDe
    • iterator

      public ItemsSketchIterator<T> iterator()
      Returns:
      the iterator for this class