Package org.apache.datasketches.req
Class ReqSketch
java.lang.Object
org.apache.datasketches.req.ReqSketch
public class ReqSketch extends Object
This Relative Error Quantiles Sketch is the Java implementation based on the paper
"Relative Error Streaming Quantiles", https://arxiv.org/abs/2004.01668, and loosely derived from
a Python prototype written by Pavel Vesely.
This implementation differs from the algorithm described in the paper in the following:
- The algorithm requires no upper bound on the stream length. Instead, each relative-compactor counts the number of compaction operations performed so far (via variable state). Initially, the relative-compactor starts with INIT_NUMBER_OF_SECTIONS. Each time the number of compactions (variable state) exceeds 2^{numSections - 1}, we double numSections. Note that after merging the sketch with another one variable state may not correspond to the number of compactions performed at a particular level, however, since the state variable never exceeds the number of compactions, the guarantees of the sketch remain valid.
- The size of each section (variable k and sectionSize in the code and parameter k in the paper) is initialized with a value set by the user via variable k. When the number of sections doubles, we decrease sectionSize by a factor of sqrt(2). This is applied at each level separately. Thus, when we double the number of sections, the nominal compactor size increases by a factor of approx. sqrt(2) (+/- rounding).
- The merge operation here does not perform "special compactions", which are used in the paper to allow for a tight mathematical analysis of the sketch.
This implementation provides a number of capabilities not discussed in the paper or provided in the Python prototype.
- The Python prototype only implemented high accuracy for low ranks. This implementation provides the user with the ability to choose either high rank accuracy or low rank accuracy at the time of sketch construction.
- The Python prototype only implemented a comparison criterion of "≤". This implementation allows the user to switch back and forth between the "≤" criterion and the "<" criterion.
- This implementation provides extensive debug visibility into the operation of the sketch with two levels of detail output. This is not only useful for debugging, but is a powerful tool to help users understand how the sketch works.
- Author:
- Edo Liberty, Pavel Vesely, Lee Rhodes
-
Method Summary
Modifier and Type Method Description static ReqSketchBuilderbuilder()Returns a new ReqSketchBuilderdouble[]getCDF(float[] splitPoints)Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).booleangetHighRankAccuracy()If true, the high ranks are prioritized for better accuracy.floatgetMaxValue()Gets the largest value seen by this sketchfloatgetMinValue()Gets the smallest value seen by this sketchlonggetN()Gets the total number of items offered to the sketch.double[]getPMF(float[] splitPoints)Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).floatgetQuantile(double normRank)Gets the approximate quantile of the given normalized rank based on the lteq criterion.float[]getQuantiles(double[] normRanks)Gets an array of quantiles that correspond to the given array of normalized ranks.doublegetRank(float value)Computes the normalized rank of the given value in the stream.doublegetRankLowerBound(double rank, int numStdDev)returns an approximate lower bound rank of the given noramalized rank.double[]getRanks(float[] values)Gets an array of normalized ranks that correspond to the given array of values.doublegetRankUpperBound(double rank, int numStdDev)Returns an approximate upper bound rank of the given rank.intgetRetainedItems()Gets the number of retained items of this sketchdoublegetRSE(int k, double rank, boolean hra, long totalN)Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]).intgetSerializationBytes()Gets the number of bytes when serialized.static ReqSketchheapify(org.apache.datasketches.memory.Memory mem)Returns an ReqSketch on the heap from a Memory image of the sketch.booleanisEmpty()Returns true if this sketch is empty.booleanisEstimationMode()Returns true if this sketch is in estimation mode.booleanisLessThanOrEqual()Returns the current comparison criterion.ReqIteratoriterator()Returns an iterator for all the items in this sketch.ReqSketchmerge(ReqSketch other)Merge other sketch into this one.ReqSketchreset()Resets this sketch by removing all data and setting all data related variables to their virgin state.ReqSketchsetLessThanOrEqual(boolean ltEq)Sets the chosen criterion for value comparisonbyte[]toByteArray()Returns a byte array representation of this sketch.StringtoString()Returns a summary of the key parameters of the sketch.voidupdate(float item)Updates this sketch with the given item.StringviewCompactorDetail(String fmt, boolean allData)A detailed, human readable view of the sketch compactors and their data.
-
Method Details
-
heapify
Returns an ReqSketch on the heap from a Memory image of the sketch.- Parameters:
mem- The Memory object holding a valid image of an ReqSketch- Returns:
- an ReqSketch on the heap from a Memory image of the sketch.
-
builder
Returns a new ReqSketchBuilder- Returns:
- a new ReqSketchBuilder
-
getCDF
public double[] getCDF(float[] splitPoints)Returns an approximation to the Cumulative Distribution Function (CDF), which is the cumulative analog of the PMF, of the input stream given a set of splitPoint (values).The resulting approximations have a probabilistic guarantee that be obtained, a priori, from the getRSE(int, double, boolean, long) function.
If the sketch is empty this returns null.
- Parameters:
splitPoints- an array of m unique, monotonically increasing double values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these split points.- Returns:
- an array of m+1 double values, which are a consecutive approximation to the CDF of the input stream given the splitPoints. The value at array position j of the returned CDF array is the sum of the returned values in positions 0 through j of the returned PMF array.
-
getHighRankAccuracy
public boolean getHighRankAccuracy()If true, the high ranks are prioritized for better accuracy. Otherwise the low ranks are prioritized for better accuracy. This state is chosen during sketch construction.- Returns:
- the high ranks accuracy state.
-
getMaxValue
public float getMaxValue()Gets the largest value seen by this sketch- Returns:
- the largest value seen by this sketch
-
getMinValue
public float getMinValue()Gets the smallest value seen by this sketch- Returns:
- the smallest value seen by this sketch
-
getN
public long getN()Gets the total number of items offered to the sketch.- Returns:
- the total number of items offered to the sketch.
-
getPMF
public double[] getPMF(float[] splitPoints)Returns an approximation to the Probability Mass Function (PMF) of the input stream given a set of splitPoints (values).The resulting approximations have a probabilistic guarantee that be obtained, a priori, from the getRSE(int, double, boolean, long) function.
If the sketch is empty this returns null.
- Parameters:
splitPoints- an array of m unique, monotonically increasing double values that divide the real number line into m+1 consecutive disjoint intervals. The definition of an "interval" is inclusive of the left splitPoint (or minimum value) and exclusive of the right splitPoint, with the exception that the last interval will include the maximum value. It is not necessary to include either the min or max values in these splitpoints.- Returns:
- an array of m+1 doubles each of which is an approximation to the fraction of the input stream values (the mass) that fall into one of those intervals. The definition of an "interval" is inclusive of the left splitPoint and exclusive of the right splitPoint, with the exception that the last interval will include maximum value.
-
getQuantile
public float getQuantile(double normRank)Gets the approximate quantile of the given normalized rank based on the lteq criterion. The normalized rank must be in the range [0.0, 1.0] (inclusive, inclusive).- Parameters:
normRank- the given normalized rank- Returns:
- the approximate quantile given the normalized rank.
-
getQuantiles
public float[] getQuantiles(double[] normRanks)Gets an array of quantiles that correspond to the given array of normalized ranks.- Parameters:
normRanks- the given array of normalized ranks.- Returns:
- the array of quantiles that correspond to the given array of normalized ranks. See getQuantile(double)
-
getRank
public double getRank(float value)Computes the normalized rank of the given value in the stream. The normalized rank is the fraction of values less than the given value; or if lteq is true, the fraction of values less than or equal to the given value.- Parameters:
value- the given value- Returns:
- the normalized rank of the given value in the stream.
-
getRanks
public double[] getRanks(float[] values)Gets an array of normalized ranks that correspond to the given array of values.- Parameters:
values- the given array of values.- Returns:
- the array of normalized ranks that correspond to the given array of values. See getRank(float)
-
getRankLowerBound
public double getRankLowerBound(double rank, int numStdDev)returns an approximate lower bound rank of the given noramalized rank.- Parameters:
rank- the given rank, a value between 0 and 1.0.numStdDev- the number of standard deviations. Must be 1, 2, or 3.- Returns:
- an approximate lower bound rank.
-
getRankUpperBound
public double getRankUpperBound(double rank, int numStdDev)Returns an approximate upper bound rank of the given rank.- Parameters:
rank- the given rank, a value between 0 and 1.0.numStdDev- the number of standard deviations. Must be 1, 2, or 3.- Returns:
- an approximate upper bound rank.
-
getRetainedItems
public int getRetainedItems()Gets the number of retained items of this sketch- Returns:
- the number of retained entries of this sketch
-
getRSE
public double getRSE(int k, double rank, boolean hra, long totalN)Returns an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]). Derived from Lemma 12 in https://arxiv.org/abs/2004.01668v2, but the constant factors were modified based on empirical measurements.- Parameters:
k- the given value of krank- the given normalized rank, a number in [0,1].hra- if true High Rank Accuracy mode is being selected, otherwise, Low Rank Accuracy.totalN- an estimate of the total number of items submitted to the sketch.- Returns:
- an a priori estimate of relative standard error (RSE, expressed as a number in [0,1]).
-
getSerializationBytes
public int getSerializationBytes()Gets the number of bytes when serialized.- Returns:
- the number of bytes when serialized.
-
isEmpty
public boolean isEmpty()Returns true if this sketch is empty.- Returns:
- empty flag
-
isEstimationMode
public boolean isEstimationMode()Returns true if this sketch is in estimation mode.- Returns:
- estimation mode flag
-
isLessThanOrEqual
public boolean isLessThanOrEqual()Returns the current comparison criterion. If true the value comparison criterion is ≤, otherwise it will be the default, which is <.- Returns:
- the current comparison criterion
-
iterator
Returns an iterator for all the items in this sketch.- Returns:
- an iterator for all the items in this sketch.
-
merge
Merge other sketch into this one. The other sketch is not modified.- Parameters:
other- sketch to be merged into this one.- Returns:
- this
-
reset
Resets this sketch by removing all data and setting all data related variables to their virgin state. The parameters k, highRankAccuracy, reqDebug and LessThanOrEqual will not change.- Returns:
- this
-
setLessThanOrEqual
Sets the chosen criterion for value comparison- Parameters:
ltEq- (Less-than-or Equals) If true, the sketch will use the ≤ criterion for comparing values. Otherwise, the criterion is strictly <, the default. This can be set anytime prior to a getRank(float) or getQuantile(double) or equivalent query.- Returns:
- this
-
toByteArray
public byte[] toByteArray()Returns a byte array representation of this sketch.- Returns:
- a byte array representation of this sketch.
-
toString
Returns a summary of the key parameters of the sketch.- Returns:
- a summary of the key parameters of the sketch.
-
update
public void update(float item)Updates this sketch with the given item.- Parameters:
item- the given item
-
viewCompactorDetail
A detailed, human readable view of the sketch compactors and their data. Each compactor string is prepended by the compactor lgWeight, the current number of retained items of the compactor and the current nominal capacity of the compactor.- Parameters:
fmt- the format string for the data items; example: "%4.0f".allData- all the retained items for the sketch will be output by compactory level. Otherwise, just a summary will be output.- Returns:
- a detailed view of the compactors and their data
-