Class ReqSketch

java.lang.Object
org.apache.datasketches.req.ReqSketch

public class ReqSketch
extends Object
This Relative Error Quantiles Sketch is the Java implementation based on the paper "Relative Error Streaming Quantiles", https://arxiv.org/abs/2004.01668, and loosely derived from a Python prototype written by Pavel Vesely.

This implementation differs from the algorithm described in the paper in the following:

  • The algorithm requires no upper bound on the stream length. Instead, each relative-compactor counts the number of compaction operations performed so far (via variable state). Initially, the relative-compactor starts with INIT_NUMBER_OF_SECTIONS. Each time the number of compactions (variable state) exceeds 2^{numSections - 1}, we double numSections. Note that after merging the sketch with another one variable state may not correspond to the number of compactions performed at a particular level, however, since the state variable never exceeds the number of compactions, the guarantees of the sketch remain valid.
  • The size of each section (variable k and sectionSize in the code and parameter k in the paper) is initialized with a value set by the user via variable k. When the number of sections doubles, we decrease sectionSize by a factor of sqrt(2). This is applied at each level separately. Thus, when we double the number of sections, the nominal compactor size increases by a factor of approx. sqrt(2) (+/- rounding).
  • The merge operation here does not perform "special compactions", which are used in the paper to allow for a tight mathematical analysis of the sketch.

This implementation provides a number of capabilities not discussed in the paper or provided in the Python prototype.

  • The Python prototype only implemented high accuracy for low ranks. This implementation provides the user with the ability to choose either high rank accuracy or low rank accuracy at the time of sketch construction.
  • The Python prototype only implemented a comparison criterion of "≤". This implementation allows the user to switch back and forth between the "≤" criterion and the "<" criterion.
  • This implementation provides extensive debug visibility into the operation of the sketch with two levels of detail output. This is not only useful for debugging, but is a powerful tool to help users understand how the sketch works.
Author:
Edo Liberty, Pavel Vesely, Lee Rhodes
  • Method Summary

    Modifier and Type Method Description
    static ReqSketchBuilder builder()
    Returns a new ReqSketchBuilder
    double[] getCDF​(float[] splitPoints)
    Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).
    boolean getHighRankAccuracy()
    If true, the high ranks are prioritized for better accuracy.
    float getMaxValue()
    Gets the largest value seen by this sketch
    float getMinValue()
    Gets the smallest value seen by this sketch
    long getN()
    Gets the total number of items offered to the sketch.
    double[] getPMF​(float[] splitPoints)
    Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).
    float getQuantile​(double normRank)
    Gets the approximate quantile of the given normalized rank based on the lteq criterion.
    float[] getQuantiles​(double[] normRanks)
    Gets an array of quantiles that correspond to the given array of normalized ranks.
    double getRank​(float value)
    Computes the normalized rank of the given value in the stream.
    double getRankLowerBound​(double rank, int numStdDev)
    returns an approximate lower bound rank of the given noramalized rank.
    double[] getRanks​(float[] values)
    Gets an array of normalized ranks that correspond to the given array of values.
    double getRankUpperBound​(double rank, int numStdDev)
    Returns an approximate upper bound rank of the given rank.
    int getRetainedItems()
    Gets the number of retained items of this sketch
    double getRSE​(int k, double rank, boolean hra, long totalN)
    Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]).
    int getSerializationBytes()
    Gets the number of bytes when serialized.
    static ReqSketch heapify​(org.apache.datasketches.memory.Memory mem)
    Returns an ReqSketch on the heap from a Memory image of the sketch.
    boolean isEmpty()
    Returns true if this sketch is empty.
    boolean isEstimationMode()
    Returns true if this sketch is in estimation mode.
    boolean isLessThanOrEqual()
    Returns the current comparison criterion.
    ReqIterator iterator()
    Returns an iterator for all the items in this sketch.
    ReqSketch merge​(ReqSketch other)
    Merge other sketch into this one.
    ReqSketch reset()
    Resets this sketch by removing all data and setting all data related variables to their virgin state.
    ReqSketch setLessThanOrEqual​(boolean ltEq)
    Sets the chosen criterion for value comparison
    byte[] toByteArray()
    Returns a byte array representation of this sketch.
    String toString()
    Returns a summary of the key parameters of the sketch.
    void update​(float item)
    Updates this sketch with the given item.
    String viewCompactorDetail​(String fmt, boolean allData)
    A detailed, human readable view of the sketch compactors and their data.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Method Details

    • heapify

      public static ReqSketch heapify​(org.apache.datasketches.memory.Memory mem)
      Returns an ReqSketch on the heap from a Memory image of the sketch.
      Parameters:
      mem - The Memory object holding a valid image of an ReqSketch
      Returns:
      an ReqSketch on the heap from a Memory image of the sketch.
    • builder

      public static final ReqSketchBuilder builder()
      Returns a new ReqSketchBuilder
      Returns:
      a new ReqSketchBuilder
    • getCDF

      public double[] getCDF​(float[] splitPoints)
      Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).

      The resulting approximations have a probabilistic guarantee that be obtained, a priori, from the getRSE(int, double, boolean, long) function.

      If the sketch is empty this returns null.

      Parameters:
      splitPoints - an array of m unique, monotonically increasing double values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.
      Returns:
      an array of m+1 double values, which are a consecutive approximation to the CDF of the input stream given the splitPoints. The value at array position j of the returned CDF array is the sum of the returned values in positions 0 through j of the returned PMF array.
    • getHighRankAccuracy

      public boolean getHighRankAccuracy()
      If true, the high ranks are prioritized for better accuracy. Otherwise the low ranks are prioritized for better accuracy. This state is chosen during sketch construction.
      Returns:
      the high ranks accuracy state.
    • getMaxValue

      public float getMaxValue()
      Gets the largest value seen by this sketch
      Returns:
      the largest value seen by this sketch
    • getMinValue

      public float getMinValue()
      Gets the smallest value seen by this sketch
      Returns:
      the smallest value seen by this sketch
    • getN

      public long getN()
      Gets the total number of items offered to the sketch.
      Returns:
      the total number of items offered to the sketch.
    • getPMF

      public double[] getPMF​(float[] splitPoints)
      Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).

      The resulting approximations have a probabilistic guarantee that be obtained, a priori, from the getRSE(int, double, boolean, long) function.

      If the sketch is empty this returns null.

      Parameters:
      splitPoints - an array of m unique, monotonically increasing double values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these splitpoints.
      Returns:
      an array of m+1 doubles each of which is an approximation to the fraction of the input stream values (the mass) that fall into one of those intervals. The definition of an "interval" is inclusive of the left splitPoint and exclusive of the right splitPoint, with the exception that the last interval will include maximum value.
    • getQuantile

      public float getQuantile​(double normRank)
      Gets the approximate quantile of the given normalized rank based on the lteq criterion. The normalized rank must be in the range [0.0, 1.0] (inclusive, inclusive).
      Parameters:
      normRank - the given normalized rank
      Returns:
      the approximate quantile given the normalized rank.
    • getQuantiles

      public float[] getQuantiles​(double[] normRanks)
      Gets an array of quantiles that correspond to the given array of normalized ranks.
      Parameters:
      normRanks - the given array of normalized ranks.
      Returns:
      the array of quantiles that correspond to the given array of normalized ranks. See getQuantile(double)
    • getRank

      public double getRank​(float value)
      Computes the normalized rank of the given value in the stream. The normalized rank is the fraction of values less than the given value; or if lteq is true, the fraction of values less than or equal to the given value.
      Parameters:
      value - the given value
      Returns:
      the normalized rank of the given value in the stream.
    • getRanks

      public double[] getRanks​(float[] values)
      Gets an array of normalized ranks that correspond to the given array of values.
      Parameters:
      values - the given array of values.
      Returns:
      the array of normalized ranks that correspond to the given array of values. See getRank(float)
    • getRankLowerBound

      public double getRankLowerBound​(double rank, int numStdDev)
      returns an approximate lower bound rank of the given noramalized rank.
      Parameters:
      rank - the given rank, a value between 0 and 1.0.
      numStdDev - the number of standard deviations. Must be 1, 2, or 3.
      Returns:
      an approximate lower bound rank.
    • getRankUpperBound

      public double getRankUpperBound​(double rank, int numStdDev)
      Returns an approximate upper bound rank of the given rank.
      Parameters:
      rank - the given rank, a value between 0 and 1.0.
      numStdDev - the number of standard deviations. Must be 1, 2, or 3.
      Returns:
      an approximate upper bound rank.
    • getRetainedItems

      public int getRetainedItems()
      Gets the number of retained items of this sketch
      Returns:
      the number of retained entries of this sketch
    • getRSE

      public double getRSE​(int k, double rank, boolean hra, long totalN)
      Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]). Derived from Lemma 12 in https://arxiv.org/abs/2004.01668v2, but the constant factors were modified based on empirical measurements.
      Parameters:
      k - the given value of k
      rank - the given normalized rank, a number in [0,1].
      hra - if true High Rank Accuracy mode is being selected, otherwise, Low Rank Accuracy.
      totalN - an estimate of the total number of items submitted to the sketch.
      Returns:
      an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]).
    • getSerializationBytes

      public int getSerializationBytes()
      Gets the number of bytes when serialized.
      Returns:
      the number of bytes when serialized.
    • isEmpty

      public boolean isEmpty()
      Returns true if this sketch is empty.
      Returns:
      empty flag
    • isEstimationMode

      public boolean isEstimationMode()
      Returns true if this sketch is in estimation mode.
      Returns:
      estimation mode flag
    • isLessThanOrEqual

      public boolean isLessThanOrEqual()
      Returns the current comparison criterion. If true the value comparison criterion is ≤, otherwise it will be the default, which is <.
      Returns:
      the current comparison criterion
    • iterator

      public ReqIterator iterator()
      Returns an iterator for all the items in this sketch.
      Returns:
      an iterator for all the items in this sketch.
    • merge

      public ReqSketch merge​(ReqSketch other)
      Merge other sketch into this one. The other sketch is not modified.
      Parameters:
      other - sketch to be merged into this one.
      Returns:
      this
    • reset

      public ReqSketch reset()
      Resets this sketch by removing all data and setting all data related variables to their virgin state. The parameters k, highRankAccuracy, reqDebug and LessThanOrEqual will not change.
      Returns:
      this
    • setLessThanOrEqual

      public ReqSketch setLessThanOrEqual​(boolean ltEq)
      Sets the chosen criterion for value comparison
      Parameters:
      ltEq - (Less-than-or Equals) If true, the sketch will use the ≤ criterion for comparing values. Otherwise, the criterion is strictly <, the default. This can be set anytime prior to a getRank(float) or getQuantile(double) or equivalent query.
      Returns:
      this
    • toByteArray

      public byte[] toByteArray()
      Returns a byte array representation of this sketch.
      Returns:
      a byte array representation of this sketch.
    • toString

      public String toString()
      Returns a summary of the key parameters of the sketch.
      Returns:
      a summary of the key parameters of the sketch.
    • update

      public void update​(float item)
      Updates this sketch with the given item.
      Parameters:
      item - the given item
    • viewCompactorDetail

      public String viewCompactorDetail​(String fmt, boolean allData)
      A detailed, human readable view of the sketch compactors and their data. Each compactor string is prepended by the compactor lgWeight, the current number of retained items of the compactor and the current nominal capacity of the compactor.
      Parameters:
      fmt - the format string for the data items; example: "%4.0f".
      allData - all the retained items for the sketch will be output by compactory level. Otherwise, just a summary will be output.
      Returns:
      a detailed view of the compactors and their data