Class DoublesSketch

java.lang.Object
org.apache.datasketches.quantiles.DoublesSketch
Direct Known Subclasses:
CompactDoublesSketch, UpdateDoublesSketch

public abstract class DoublesSketch
extends Object
This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution of real values from a very large stream in a single pass. The analysis is obtained using a getQuantiles(*) function or its inverse functions the Probability Mass Function from getPMF(*) and the Cumulative Distribution Function from getCDF(*).

Consider a large stream of one million values such as packet sizes coming into a network node. The absolute rank of any specific size value is simply its index in the hypothetical sorted array of values. The normalized rank (or fractional rank) is the absolute rank divided by the stream size, in this case one million. The value corresponding to the normalized rank of 0.5 represents the 50th percentile or median value of the distribution, or getQuantile(0.5). Similarly, the 95th percentile is obtained from getQuantile(0.95). Using the getQuantiles(0.0, 1.0) will return the min and max values seen by the sketch.

From the min and max values, for example, 1 and 1000 bytes, you can obtain the PMF from getPMF(100, 500, 900) that will result in an array of 4 fractional values such as {.4, .3, .2, .1}, which means that

  • 40% of the values were < 100,
  • 30% of the values were ≥ 100 and < 500,
  • 20% of the values were ≥ 500 and < 900, and
  • 10% of the values were ≥ 900.
A frequency histogram can be obtained by simply multiplying these fractions by getN(), which is the total count of values received. The getCDF(*) works similarly, but produces the cumulative distribution instead.

The accuracy of this sketch is a function of the configured value k, which also affects the overall size of the sketch. Accuracy of this quantile sketch is always with respect to the normalized rank. A k of 128 produces a normalized, rank error of about 1.7%. For example, the median value returned from getQuantile(0.5) will be between the actual values from the hypothetically sorted array of input values at normalized ranks of 0.483 and 0.517, with a confidence of about 99%.

Table Guide for DoublesSketch Size in Bytes and Approximate Error:
          K => |      16      32      64     128     256     512   1,024
    ~ Error => | 12.145%  6.359%  3.317%  1.725%  0.894%  0.463%  0.239%
             N | Size in Bytes ->
------------------------------------------------------------------------
             0 |       8       8       8       8       8       8       8
             1 |      72      72      72      72      72      72      72
             3 |      72      72      72      72      72      72      72
             7 |     104     104     104     104     104     104     104
            15 |     168     168     168     168     168     168     168
            31 |     296     296     296     296     296     296     296
            63 |     424     552     552     552     552     552     552
           127 |     552     808   1,064   1,064   1,064   1,064   1,064
           255 |     680   1,064   1,576   2,088   2,088   2,088   2,088
           511 |     808   1,320   2,088   3,112   4,136   4,136   4,136
         1,023 |     936   1,576   2,600   4,136   6,184   8,232   8,232
         2,047 |   1,064   1,832   3,112   5,160   8,232  12,328  16,424
         4,095 |   1,192   2,088   3,624   6,184  10,280  16,424  24,616
         8,191 |   1,320   2,344   4,136   7,208  12,328  20,520  32,808
        16,383 |   1,448   2,600   4,648   8,232  14,376  24,616  41,000
        32,767 |   1,576   2,856   5,160   9,256  16,424  28,712  49,192
        65,535 |   1,704   3,112   5,672  10,280  18,472  32,808  57,384
       131,071 |   1,832   3,368   6,184  11,304  20,520  36,904  65,576
       262,143 |   1,960   3,624   6,696  12,328  22,568  41,000  73,768
       524,287 |   2,088   3,880   7,208  13,352  24,616  45,096  81,960
     1,048,575 |   2,216   4,136   7,720  14,376  26,664  49,192  90,152
     2,097,151 |   2,344   4,392   8,232  15,400  28,712  53,288  98,344
     4,194,303 |   2,472   4,648   8,744  16,424  30,760  57,384 106,536
     8,388,607 |   2,600   4,904   9,256  17,448  32,808  61,480 114,728
    16,777,215 |   2,728   5,160   9,768  18,472  34,856  65,576 122,920
    33,554,431 |   2,856   5,416  10,280  19,496  36,904  69,672 131,112
    67,108,863 |   2,984   5,672  10,792  20,520  38,952  73,768 139,304
   134,217,727 |   3,112   5,928  11,304  21,544  41,000  77,864 147,496
   268,435,455 |   3,240   6,184  11,816  22,568  43,048  81,960 155,688
   536,870,911 |   3,368   6,440  12,328  23,592  45,096  86,056 163,880
 1,073,741,823 |   3,496   6,696  12,840  24,616  47,144  90,152 172,072
 2,147,483,647 |   3,624   6,952  13,352  25,640  49,192  94,248 180,264
 4,294,967,295 |   3,752   7,208  13,864  26,664  51,240  98,344 188,456

 

There is more documentation available on datasketches.apache.org.

This is an implementation of the Low Discrepancy Mergeable Quantiles Sketch, using double values, described in section 3.2 of the journal version of the paper "Mergeable Summaries" by Agarwal, Cormode, Huang, Phillips, Wei, and Yi.

This algorithm is independent of the distribution of values, which can be anywhere in the range of the IEEE-754 64-bit doubles.

This algorithm intentionally inserts randomness into the sampling process for values that ultimately get retained in the sketch. The results produced by this algorithm are not deterministic. For example, if the same stream is inserted into two different instances of this sketch, the answers obtained from the two sketches may not be be identical.

Similarly, there may be directional inconsistencies. For example, the resulting array of values obtained from getQuantiles(fractions[]) input into the reverse directional query getPMF(splitPoints[]) may not result in the original fractional values.

Author:
Kevin Lang, Lee Rhodes, Jon Malkin
  • Method Summary

    Modifier and Type Method Description
    static DoublesSketchBuilder builder()
    Returns a new builder
    DoublesSketch downSample​(DoublesSketch srcSketch, int smallerK, org.apache.datasketches.memory.WritableMemory dstMem)
    From an source sketch, create a new sketch that must have a smaller value of K.
    double[] getCDF​(double[] splitPoints)
    Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).
    int getCompactStorageBytes()
    Returns the number of bytes this sketch would require to store in compact form, which is not updatable.
    static int getCompactStorageBytes​(int k, long n)
    Returns the number of bytes a DoublesSketch would require to store in compact form given the values of k and n.
    int getK()
    Returns the configured value of K
    static int getKFromEpsilon​(double epsilon, boolean pmf)
    Gets the approximate value of k to use given epsilon, the normalized rank error.
    abstract double getMaxValue()
    Returns the max value of the stream.
    abstract double getMinValue()
    Returns the min value of the stream.
    abstract long getN()
    Returns the length of the input stream so far.
    double getNormalizedRankError​(boolean pmf)
    Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
    static double getNormalizedRankError​(int k, boolean pmf)
    Gets the normalized rank error given k and pmf.
    double[] getPMF​(double[] splitPoints)
    Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).
    double getQuantile​(double fraction)
    This returns an approximation to the value of the data item that would be preceded by the given fraction of a hypothetical sorted version of the input stream so far.
    double getQuantileLowerBound​(double fraction)
    Gets the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
    double[] getQuantiles​(double[] fRanks)
    This is a more efficient multiple-query version of getQuantile().
    double[] getQuantiles​(int evenlySpaced)
    This is also a more efficient multiple-query version of getQuantile() and allows the caller to specify the number of evenly spaced fractional ranks.
    double getQuantileUpperBound​(double fraction)
    Gets the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
    double getRank​(double value)
    Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1 inclusive.
    int getRetainedItems()
    Computes the number of retained items (samples) in the sketch
    int getStorageBytes()
    Returns the number of bytes this sketch would require to store in native form: compact for a CompactDoublesSketch, non-compact for an UpdateDoublesSketch.
    int getUpdatableStorageBytes()
    Returns the number of bytes this sketch would require to store in updatable form.
    static int getUpdatableStorageBytes​(int k, long n)
    Returns the number of bytes a sketch would require to store in updatable form.
    static DoublesSketch heapify​(org.apache.datasketches.memory.Memory srcMem)
    Heapify takes the sketch image in Memory and instantiates an on-heap Sketch.
    abstract boolean isDirect()
    Returns true if this sketch is direct
    boolean isEmpty()
    Returns true if this sketch is empty
    boolean isEstimationMode()
    Returns true if this sketch is in estimation mode.
    boolean isSameResource​(org.apache.datasketches.memory.Memory that)
    Returns true if the backing resource of this is identical with the backing resource of that.
    DoublesSketchIterator iterator()  
    void putMemory​(org.apache.datasketches.memory.WritableMemory dstMem)
    Puts the current sketch into the given Memory in compact form if there is sufficient space, otherwise, it throws an error.
    void putMemory​(org.apache.datasketches.memory.WritableMemory dstMem, boolean compact)
    Puts the current sketch into the given Memory if there is sufficient space, otherwise, throws an error.
    byte[] toByteArray()
    Serialize this sketch to a byte array.
    byte[] toByteArray​(boolean compact)
    Serialize this sketch in a byte array form.
    String toString()
    Returns summary information about this sketch.
    String toString​(boolean sketchSummary, boolean dataDetail)
    Returns summary information about this sketch.
    static String toString​(byte[] byteArr)
    Returns a human readable string of the preamble of a byte array image of a DoublesSketch.
    static String toString​(org.apache.datasketches.memory.Memory mem)
    Returns a human readable string of the preamble of a Memory image of a DoublesSketch.
    static DoublesSketch wrap​(org.apache.datasketches.memory.Memory srcMem)
    Wrap this sketch around the given Memory image of a DoublesSketch, compact or non-compact.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Method Details

    • builder

      public static final DoublesSketchBuilder builder()
      Returns a new builder
      Returns:
      a new builder
    • heapify

      public static DoublesSketch heapify​(org.apache.datasketches.memory.Memory srcMem)
      Heapify takes the sketch image in Memory and instantiates an on-heap Sketch. The resulting sketch will not retain any link to the source Memory.
      Parameters:
      srcMem - a Memory image of a Sketch. See Memory
      Returns:
      a heap-based Sketch based on the given Memory
    • wrap

      public static DoublesSketch wrap​(org.apache.datasketches.memory.Memory srcMem)
      Wrap this sketch around the given Memory image of a DoublesSketch, compact or non-compact.
      Parameters:
      srcMem - the given Memory image of a DoublesSketch that may have data,
      Returns:
      a sketch that wraps the given srcMem
    • getQuantile

      public double getQuantile​(double fraction)
      This returns an approximation to the value of the data item that would be preceded by the given fraction of a hypothetical sorted version of the input stream so far.

      We note that this method has a fairly large overhead (microseconds instead of nanoseconds) so it should not be called multiple times to get different quantiles from the same sketch. Instead use getQuantiles(), which pays the overhead only once.

      If the sketch is empty this returns Double.NaN.

      Parameters:
      fraction - the specified fractional position in the hypothetical sorted stream. These are also called normalized ranks or fractional ranks. If fraction = 0.0, the true minimum value of the stream is returned. If fraction = 1.0, the true maximum value of the stream is returned.
      Returns:
      the approximation to the value at the above fraction
    • getQuantileUpperBound

      public double getQuantileUpperBound​(double fraction)
      Gets the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
      Parameters:
      fraction - the given normalized rank as a fraction
      Returns:
      the upper bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%. Returns NaN if the sketch is empty.
    • getQuantileLowerBound

      public double getQuantileLowerBound​(double fraction)
      Gets the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%.
      Parameters:
      fraction - the given normalized rank as a fraction
      Returns:
      the lower bound of the value interval in which the true quantile of the given rank exists with a confidence of at least 99%. Returns NaN if the sketch is empty.
    • getQuantiles

      public double[] getQuantiles​(double[] fRanks)
      This is a more efficient multiple-query version of getQuantile().

      This returns an array that could have been generated by using getQuantile() with many different fractional ranks, but would be very inefficient. This method incurs the internal set-up overhead once and obtains multiple quantile values in a single query. It is strongly recommend that this method be used instead of multiple calls to getQuantile().

      If the sketch is empty this returns null.

      Parameters:
      fRanks - the given array of fractional (or normalized) ranks in the hypothetical sorted stream of all the input values seen so far. These fRanks must all be in the interval [0.0, 1.0] inclusively.
      Returns:
      array of approximate quantiles of the given fRanks in the same order as in the given fRanks array.
    • getQuantiles

      public double[] getQuantiles​(int evenlySpaced)
      This is also a more efficient multiple-query version of getQuantile() and allows the caller to specify the number of evenly spaced fractional ranks.

      If the sketch is empty this returns null.

      Parameters:
      evenlySpaced - an integer that specifies the number of evenly spaced fractional ranks. This must be a positive integer greater than 1. A value of 2 will return the min and the max value. A value of 3 will return the min, the median and the max value, etc.
      Returns:
      array of approximations to the given fractions in the same order as given fractions array.
    • getRank

      public double getRank​(double value)
      Returns an approximation to the normalized (fractional) rank of the given value from 0 to 1 inclusive.

      The resulting approximation has a probabilistic guarantee that be obtained from the getNormalizedRankError(false) function.

      If the sketch is empty this returns NaN.

      Parameters:
      value - to be ranked
      Returns:
      an approximate rank of the given value
    • getPMF

      public double[] getPMF​(double[] splitPoints)
      Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).

      The resulting approximations have a probabilistic guarantee that be obtained from the getNormalizedRankError(true) function.

      If the sketch is empty this returns null.

      Parameters:
      splitPoints - an array of m unique, monotonically increasing double values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these splitpoints.
      Returns:
      an array of m+1 doubles each of which is an approximation to the fraction of the input stream values (the mass) that fall into one of those intervals. The definition of an "interval" is inclusive of the left splitPoint and exclusive of the right splitPoint, with the exception that the last interval will include maximum value.
    • getCDF

      public double[] getCDF​(double[] splitPoints)
      Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).

      The resulting approximations have a probabilistic guarantee that be obtained from the getNormalizedRankError(false) function.

      If the sketch is empty this returns null.

      Parameters:
      splitPoints - an array of m unique, monotonically increasing double values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these splitpoints.
      Returns:
      an array of m+1 double values, which are a consecutive approximation to the CDF of the input stream given the splitPoints. The value at array position j of the returned CDF array is the sum of the returned values in positions 0 through j of the returned PMF array.
    • getK

      public int getK()
      Returns the configured value of K
      Returns:
      the configured value of K
    • getMinValue

      public abstract double getMinValue()
      Returns the min value of the stream. If the sketch is empty this returns Double.NaN.
      Returns:
      the min value of the stream
    • getMaxValue

      public abstract double getMaxValue()
      Returns the max value of the stream. If the sketch is empty this returns Double.NaN.
      Returns:
      the max value of the stream
    • getN

      public abstract long getN()
      Returns the length of the input stream so far.
      Returns:
      the length of the input stream so far
    • getNormalizedRankError

      public double getNormalizedRankError​(boolean pmf)
      Gets the approximate rank error of this sketch normalized as a fraction between zero and one.
      Parameters:
      pmf - if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
      Returns:
      if pmf is true, returns the normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
    • getNormalizedRankError

      public static double getNormalizedRankError​(int k, boolean pmf)
      Gets the normalized rank error given k and pmf. Static method version of the getNormalizedRankError(boolean).
      Parameters:
      k - the configuation parameter
      pmf - if true, returns the "double-sided" normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
      Returns:
      if pmf is true, the normalized rank error for the getPMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries.
      See Also:
      KllFloatsSketch
    • getKFromEpsilon

      public static int getKFromEpsilon​(double epsilon, boolean pmf)
      Gets the approximate value of k to use given epsilon, the normalized rank error.
      Parameters:
      epsilon - the normalized rank error between zero and one.
      pmf - if true, this function returns the value of k assuming the input epsilon is the desired "double-sided" epsilon for the getPMF() function. Otherwise, this function returns the value of k assuming the input epsilon is the desired "single-sided" epsilon for all the other queries.
      Returns:
      the value of k given a value of epsilon.
      See Also:
      KllFloatsSketch
    • isEmpty

      public boolean isEmpty()
      Returns true if this sketch is empty
      Returns:
      true if this sketch is empty
    • isDirect

      public abstract boolean isDirect()
      Returns true if this sketch is direct
      Returns:
      true if this sketch is direct
    • isEstimationMode

      public boolean isEstimationMode()
      Returns true if this sketch is in estimation mode.
      Returns:
      true if this sketch is in estimation mode.
    • isSameResource

      public boolean isSameResource​(org.apache.datasketches.memory.Memory that)
      Returns true if the backing resource of this is identical with the backing resource of that. The capacities must be the same. If this is a region, the region offset must also be the same.
      Parameters:
      that - A different non-null object
      Returns:
      true if the backing resource of this is the same as the backing resource of that.
    • toByteArray

      public byte[] toByteArray()
      Serialize this sketch to a byte array. An UpdateDoublesSketch will be serialized in an unordered, non-compact form; a CompactDoublesSketch will be serialized in ordered, compact form. A DirectUpdateDoublesSketch can only wrap a non-compact array, and a DirectCompactDoublesSketch can only wrap a compact array.
      Returns:
      byte array of this sketch
    • toByteArray

      public byte[] toByteArray​(boolean compact)
      Serialize this sketch in a byte array form.
      Parameters:
      compact - if true the sketch will be serialized in compact form. DirectCompactDoublesSketch can wrap() only a compact byte array; DirectUpdateDoublesSketch can wrap() only a non-compact byte array.
      Returns:
      this sketch in a byte array form.
    • toString

      public String toString()
      Returns summary information about this sketch.
      Overrides:
      toString in class Object
    • toString

      public String toString​(boolean sketchSummary, boolean dataDetail)
      Returns summary information about this sketch. Used for debugging.
      Parameters:
      sketchSummary - if true includes sketch summary
      dataDetail - if true includes data detail
      Returns:
      summary information about the sketch.
    • toString

      public static String toString​(byte[] byteArr)
      Returns a human readable string of the preamble of a byte array image of a DoublesSketch.
      Parameters:
      byteArr - the given byte array
      Returns:
      a human readable string of the preamble of a byte array image of a DoublesSketch.
    • toString

      public static String toString​(org.apache.datasketches.memory.Memory mem)
      Returns a human readable string of the preamble of a Memory image of a DoublesSketch.
      Parameters:
      mem - the given Memory
      Returns:
      a human readable string of the preamble of a Memory image of a DoublesSketch.
    • downSample

      public DoublesSketch downSample​(DoublesSketch srcSketch, int smallerK, org.apache.datasketches.memory.WritableMemory dstMem)
      From an source sketch, create a new sketch that must have a smaller value of K. The original sketch is not modified.
      Parameters:
      srcSketch - the sourcing sketch
      smallerK - the new sketch's value of K that must be smaller than this value of K. It is required that this.getK() = smallerK * 2^(nonnegative integer).
      dstMem - the destination Memory. It must not overlap the Memory of this sketch. If null, a heap sketch will be returned, otherwise it will be off-heap.
      Returns:
      the new sketch.
    • getRetainedItems

      public int getRetainedItems()
      Computes the number of retained items (samples) in the sketch
      Returns:
      the number of retained items (samples) in the sketch
    • getCompactStorageBytes

      public int getCompactStorageBytes()
      Returns the number of bytes this sketch would require to store in compact form, which is not updatable.
      Returns:
      the number of bytes this sketch would require to store in compact form.
    • getCompactStorageBytes

      public static int getCompactStorageBytes​(int k, long n)
      Returns the number of bytes a DoublesSketch would require to store in compact form given the values of k and n. The compact form is not updatable.
      Parameters:
      k - the size configuration parameter for the sketch
      n - the number of items input into the sketch
      Returns:
      the number of bytes required to store this sketch in compact form.
    • getStorageBytes

      public int getStorageBytes()
      Returns the number of bytes this sketch would require to store in native form: compact for a CompactDoublesSketch, non-compact for an UpdateDoublesSketch.
      Returns:
      the number of bytes this sketch would require to store in compact form.
    • getUpdatableStorageBytes

      public int getUpdatableStorageBytes()
      Returns the number of bytes this sketch would require to store in updatable form. This uses roughly 2X the storage of the compact form.
      Returns:
      the number of bytes this sketch would require to store in updatable form.
    • getUpdatableStorageBytes

      public static int getUpdatableStorageBytes​(int k, long n)
      Returns the number of bytes a sketch would require to store in updatable form. This uses roughly 2X the storage of the compact form given the values of k and n.
      Parameters:
      k - the size configuration parameter for the sketch
      n - the number of items input into the sketch
      Returns:
      the number of bytes this sketch would require to store in updatable form.
    • putMemory

      public void putMemory​(org.apache.datasketches.memory.WritableMemory dstMem)
      Puts the current sketch into the given Memory in compact form if there is sufficient space, otherwise, it throws an error.
      Parameters:
      dstMem - the given memory.
    • putMemory

      public void putMemory​(org.apache.datasketches.memory.WritableMemory dstMem, boolean compact)
      Puts the current sketch into the given Memory if there is sufficient space, otherwise, throws an error.
      Parameters:
      dstMem - the given memory.
      compact - if true, compacts and sorts the base buffer, which optimizes merge performance at the cost of slightly increased serialization time.
    • iterator

      public DoublesSketchIterator iterator()
      Returns:
      the iterator for this class