Package org.apache.datasketches.kll

Implementation of a very compact quantiles sketch with lazy compaction scheme and nearly optimal accuracy per retained item. See Optimal Quantile Approximation in Streams.

This is a stochastic streaming sketch that enables near-real time analysis of the approximate distribution of values from a very large stream in a single pass, requiring only that the values are comparable. The analysis is obtained using getQuantile() or getQuantiles() functions or the inverse functions getRank(), getPMF() (Probability Mass Function), and getCDF() (Cumulative Distribution Function).

Given an input stream of N numeric values, the absolute rank of any specific value is defined as its index (0 to N-1) in the hypothetical sorted stream of all N input values.

The normalized rank (rank) of any specific value is defined as its absolute rank divided by N. Thus, the normalized rank is a value in the interval [0.0, 1.0). In the documentation and Javadocs for this sketch absolute rank is never used so any reference to just rank should be interpreted to mean normalized rank.

This sketch is configured with a parameter k, which affects the size of the sketch and its estimation error.

In the research literature, the estimation error is commonly called epsilon (or eps) and is a fraction between zero and one. Larger values of k result in smaller values of epsilon. The epsilon error is always with respect to the rank and cannot be applied to the corresponding values.

The relationship between the normalized rank and the corresponding values can be viewed as a two dimensional monotonic plot with the normalized rank on one axis and the corresponding values on the other axis. If the y-axis is specified as the value-axis and the x-axis as the normalized rank, then y = getQuantile(x) is a monotonically increasing function.

The functions getQuantile(rank) and getQuantiles(...) translate ranks into corresponding values. The functions getRank(value), getCDF(...) (Cumulative Distribution Function), and getPMF(...) (Probability Mass Function) perform the opposite operation and translate values into ranks.

The getPMF(...) function has about 13 to 47% worse rank error (depending on k) than the other queries because the mass of each "bin" of the PMF has "double-sided" error from the upper and lower edges of the bin as a result of a subtraction, as the errors from the two edges can sometimes add.

The default k of 200 yields a "single-sided" epsilon of about 1.33% and a "double-sided" (PMF) epsilon of about 1.65%.

A getQuantile(rank) query has the following guarantees:

  • Let v = getQuantile(r) where r is the rank between zero and one.
  • The value v will be a value from the input stream.
  • Let trueRank be the true rank of v derived from the hypothetical sorted stream of all N values.
  • Let eps = getNormalizedRankError(false).
  • Then r - eps ≤ trueRank ≤ r + eps with a confidence of 99%. Note that the error is on the rank, not the value.

A getRank(value) query has the following guarantees:

  • Let r = getRank(v) where v is a value between the min and max values of the input stream.
  • Let trueRank be the true rank of v derived from the hypothetical sorted stream of all N values.
  • Let eps = getNormalizedRankError(false).
  • Then r - eps ≤ trueRank ≤ r + eps with a confidence of 99%.

A getPMF(...) query has the following guarantees:

  • Let {r1, r2, ..., rm+1} = getPMF(v1, v2, ..., vm) where v1, v2, ..., vm are monotonically increasing values supplied by the user that are part of the monotonic sequence v0 = min, v1, v2, ..., vm, vm+1 = max, and where min and max are the actual minimum and maximum values of the input stream automatically included in the sequence by the getPMF(...) function.
  • Let ri = massi = estimated mass between vi-1 and vi where v0 = min and vm+1 = max.
  • Let trueMass be the true mass between the values of vi, vi+1 derived from the hypothetical sorted stream of all N values.
  • Let eps = getNormalizedRankError(true).
  • Then mass - eps ≤ trueMass ≤ mass + eps with a confidence of 99%.
  • r1 includes the mass of all points between min = v0 and v1.
  • rm+1 includes the mass of all points between vm and max = vm+1.

A getCDF(...) query has the following guarantees;

  • Let {r1, r2, ..., rm+1} = getCDF(v1, v2, ..., vm) where v1, v2, ..., vm) are monotonically increasing values supplied by the user that are part of the monotonic sequence {v0 = min, v1, v2, ..., vm, vm+1 = max}, and where min and max are the actual minimum and maximum values of the input stream automatically included in the sequence by the getCDF(...) function.
  • Let ri = massi = estimated mass between v0 = min and vi.
  • Let trueMass be the true mass between the true ranks of vi, vi+1 derived from the hypothetical sorted stream of all N values.
  • Let eps = getNormalizedRankError(true).
  • then mass - eps ≤ trueMass ≤ mass + eps with a confidence of 99%.
  • r1 includes the mass of all points between min = v0 and v1.
  • rm+1 includes the mass of all points between min = v0 and max = vm+1.

From the above, it might seem like we could make some estimates to bound the value returned from a call to getQuantile(). The sketch, however, does not let us derive error bounds or confidences around values. Because errors are independent, we can approximately bracket a value as shown below, but there are no error estimates available. Additionally, the interval may be quite large for certain distributions.

  • Let v = getQuantile(r), the estimated quantile value of rank r.
  • Let eps = getNormalizedRankError(false).
  • Let vlo = estimated quantile value of rank (r - eps).
  • Let vhi = estimated quantile value of rank (r + eps).
  • Then vlo ≤ v ≤ vhi, with 99% confidence.

Please visit our website: DataSketches Home Page and the Javadocs for more information.

Author:
Kevin Lang, Alexander Saydakov, Lee Rhodes