Class HllSketch

java.lang.Object
org.apache.datasketches.hll.HllSketch

public class HllSketch
extends Object
This is a high performance implementation of Phillipe Flajolet’s HLL sketch but with significantly improved error behavior. If the ONLY use case for sketching is counting uniques and merging, the HLL sketch the HLL sketch is a reasonable choice, although the highest performing in terms of accuracy for storage space consumed is CPC (Compressed Probabilistic Counting). For large enough counts, this HLL version (with HLL_4) can be 2 to 16 times smaller than the Theta sketch family for the same accuracy.

This implementation offers three different types of HLL sketch, each with different trade-offs with accuracy, space and performance. These types are specified with the TgtHllType parameter.

In terms of accuracy, all three types, for the same lgConfigK, have the same error distribution as a function of n, the number of unique values fed to the sketch. The configuration parameter lgConfigK is the log-base-2 of K, where K is the number of buckets or slots for the sketch.

During warmup, when the sketch has only received a small number of unique items (up to about 10% of K), this implementation leverages a new class of estimator algorithms with significantly better accuracy.

This sketch also offers the capability of operating off-heap. Given a WritableMemory object created by the user, the sketch will perform all of its updates and internal phase transitions in that object, which can actually reside either on-heap or off-heap based on how it is configured. In large systems that must update and merge many millions of sketches, having the sketch operate off-heap avoids the serialization and deserialization costs of moving sketches to and from off-heap memory-mapped files, for example, and eliminates big garbage collection delays.

Author:
Lee Rhodes, Kevin Lang
  • Field Summary

    Fields 
    Modifier and Type Field Description
    static TgtHllType DEFAULT_HLL_TYPE
    The default HLL-TYPE is HLL_4
    static int DEFAULT_LG_K
    The default Log_base2 of K
  • Constructor Summary

    Constructors 
    Constructor Description
    HllSketch()
    Constructs a new on-heap sketch with the default lgConfigK and tgtHllType.
    HllSketch​(int lgConfigK)
    Constructs a new on-heap sketch with the default tgtHllType.
    HllSketch​(int lgConfigK, TgtHllType tgtHllType)
    Constructs a new on-heap sketch with the type of HLL sketch to configure.
    HllSketch​(int lgConfigK, TgtHllType tgtHllType, org.apache.datasketches.memory.WritableMemory dstMem)
    Constructs a new sketch with the type of HLL sketch to configure and the given WritableMemory as the destination for the sketch.
  • Method Summary

    Modifier and Type Method Description
    HllSketch copy()
    Return a copy of this sketch onto the Java heap.
    HllSketch copyAs​(TgtHllType tgtHllType)
    Return a deep copy of this sketch onto the Java heap with the specified TgtHllType.
    int getCompactSerializationBytes()
    Gets the size in bytes of the current sketch when serialized using toCompactByteArray().
    double getCompositeEstimate()
    This is less accurate than the getEstimate() method and is automatically used when the sketch has gone through union operations where the more accurate HIP estimator cannot be used.
    double getEstimate()
    Return the cardinality estimate
    int getLgConfigK()
    Gets the lgConfigK.
    double getLowerBound​(int numStdDev)
    Gets the approximate lower error bound given the specified number of Standard Deviations.
    static int getMaxUpdatableSerializationBytes​(int lgConfigK, TgtHllType tgtHllType)
    Returns the maximum size in bytes that this sketch can grow to given lgConfigK.
    double getRelErr​(boolean upperBound, boolean unioned, int lgConfigK, int numStdDev)
    Gets the current (approximate) Relative Error (RE) asymptotic values given several parameters.
    static int getSerializationVersion()
    Returns the current serialization version.
    static int getSerializationVersion​(org.apache.datasketches.memory.Memory mem)
    Returns the current serialization version of the given Memory.
    TgtHllType getTgtHllType()
    Gets the TgtHllType
    int getUpdatableSerializationBytes()
    Gets the size in bytes of the current sketch when serialized using toUpdatableByteArray().
    double getUpperBound​(int numStdDev)
    Gets the approximate upper error bound given the specified number of Standard Deviations.
    static HllSketch heapify​(byte[] byteArray)
    Heapify the given byte array, which must be a valid HllSketch image and may have data.
    static HllSketch heapify​(org.apache.datasketches.memory.Memory srcMem)
    Heapify the given Memory, which must be a valid HllSketch image and may have data.
    boolean isCompact()
    Returns true if the backing memory of this sketch is in compact form.
    boolean isEmpty()
    Returns true if empty
    boolean isEstimationMode()
    This HLL family of sketches and operators is always estimating, even for very small values.
    boolean isMemory()
    Returns true if this sketch was created using Memory.
    boolean isOffHeap()
    Returns true if the backing memory for this sketch is off-heap.
    boolean isSameResource​(org.apache.datasketches.memory.Memory mem)
    Returns true if the given Memory refers to the same underlying resource as this sketch.
    void reset()
    Resets to empty, but does not change the configured values of lgConfigK and tgtHllType.
    byte[] toCompactByteArray()
    Serializes this sketch as a byte array in compact form.
    String toString()
    Human readable summary as a string.
    String toString​(boolean summary, boolean detail, boolean auxDetail)
    Human readable summary with optional detail.
    String toString​(boolean summary, boolean detail, boolean auxDetail, boolean all)
    Human readable summary with optional detail
    static String toString​(byte[] byteArr)
    Returns a human readable string of the preamble of a byte array image of an HllSketch.
    static String toString​(org.apache.datasketches.memory.Memory mem)
    Returns a human readable string of the preamble of a Memory image of an HllSketch.
    byte[] toUpdatableByteArray()
    Serializes this sketch as a byte array in an updatable form.
    void update​(byte[] data)
    Present the given byte array as a potential unique item.
    void update​(char[] data)
    Present the given char array as a potential unique item.
    void update​(double datum)
    Present the given double (or float) datum as a potential unique item.
    void update​(int[] data)
    Present the given integer array as a potential unique item.
    void update​(long datum)
    Present the given long as a potential unique item.
    void update​(long[] data)
    Present the given long array as a potential unique item.
    void update​(String datum)
    Present the given String as a potential unique item.
    void update​(ByteBuffer data)
    Present the given byte buffer as a potential unique item.
    static HllSketch wrap​(org.apache.datasketches.memory.Memory srcMem)
    Wraps the given read-only Memory that must be a image of a valid sketch, which may be in compact or updatable form, and should have data.
    static HllSketch writableWrap​(org.apache.datasketches.memory.WritableMemory srcWmem)
    Wraps the given WritableMemory, which must be a image of a valid updatable sketch, and may have data.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

  • Constructor Details

    • HllSketch

      public HllSketch()
      Constructs a new on-heap sketch with the default lgConfigK and tgtHllType.
    • HllSketch

      public HllSketch​(int lgConfigK)
      Constructs a new on-heap sketch with the default tgtHllType.
      Parameters:
      lgConfigK - The Log2 of K for the target HLL sketch. This value must be between 4 and 21 inclusively.
    • HllSketch

      public HllSketch​(int lgConfigK, TgtHllType tgtHllType)
      Constructs a new on-heap sketch with the type of HLL sketch to configure.
      Parameters:
      lgConfigK - The Log2 of K for the target HLL sketch. This value must be between 4 and 21 inclusively.
      tgtHllType - the desired Hll type.
    • HllSketch

      public HllSketch​(int lgConfigK, TgtHllType tgtHllType, org.apache.datasketches.memory.WritableMemory dstMem)
      Constructs a new sketch with the type of HLL sketch to configure and the given WritableMemory as the destination for the sketch. This WritableMemory is usually configured for off-heap memory. What remains on the java heap is a thin wrapper object that reads and writes to the given WritableMemory.

      The given dstMem is checked for the required capacity as determined by getMaxUpdatableSerializationBytes(int, TgtHllType).

      Parameters:
      lgConfigK - The Log2 of K for the target HLL sketch. This value must be between 4 and 21 inclusively.
      tgtHllType - the desired Hll type.
      dstMem - the destination memory for the sketch.
  • Method Details

    • heapify

      public static final HllSketch heapify​(byte[] byteArray)
      Heapify the given byte array, which must be a valid HllSketch image and may have data.
      Parameters:
      byteArray - the given byte array. This byteArray is not modified and is not retained by the on-heap sketch.
      Returns:
      an HllSketch on the java heap.
    • heapify

      public static final HllSketch heapify​(org.apache.datasketches.memory.Memory srcMem)
      Heapify the given Memory, which must be a valid HllSketch image and may have data.
      Parameters:
      srcMem - the given Memory, which is read-only.
      Returns:
      an HllSketch on the java heap.
    • writableWrap

      public static final HllSketch writableWrap​(org.apache.datasketches.memory.WritableMemory srcWmem)
      Wraps the given WritableMemory, which must be a image of a valid updatable sketch, and may have data. What remains on the java heap is a thin wrapper object that reads and writes to the given WritableMemory, which, depending on how the user configures the WritableMemory, may actually reside on the Java heap or off-heap.

      The given dstMem is checked for the required capacity as determined by getMaxUpdatableSerializationBytes(int, TgtHllType).

      Parameters:
      srcWmem - an writable image of a valid source sketch with data.
      Returns:
      an HllSketch where the sketch data is in the given dstMem.
    • wrap

      public static final HllSketch wrap​(org.apache.datasketches.memory.Memory srcMem)
      Wraps the given read-only Memory that must be a image of a valid sketch, which may be in compact or updatable form, and should have data. Any attempt to update the given source Memory will throw an exception.
      Parameters:
      srcMem - a read-only image of a valid source sketch.
      Returns:
      an HllSketch, where the read-only data of the sketch is in the given srcMem.
    • copy

      public HllSketch copy()
      Return a copy of this sketch onto the Java heap.
      Returns:
      a copy of this sketch onto the Java heap.
    • copyAs

      public HllSketch copyAs​(TgtHllType tgtHllType)
      Return a deep copy of this sketch onto the Java heap with the specified TgtHllType.
      Parameters:
      tgtHllType - the TgtHllType enum
      Returns:
      a deep copy of this sketch with the specified TgtHllType.
    • getCompositeEstimate

      public double getCompositeEstimate()
      This is less accurate than the getEstimate() method and is automatically used when the sketch has gone through union operations where the more accurate HIP estimator cannot be used. This is made public only for error characterization software that exists in separate packages and is not intended for normal use.
      Returns:
      the composite estimate
    • getEstimate

      public double getEstimate()
      Return the cardinality estimate
      Returns:
      the cardinality estimate
    • getLgConfigK

      public int getLgConfigK()
      Gets the lgConfigK.
      Returns:
      the lgConfigK.
    • getCompactSerializationBytes

      public int getCompactSerializationBytes()
      Gets the size in bytes of the current sketch when serialized using toCompactByteArray().
      Returns:
      the size in bytes of the current sketch when serialized using toCompactByteArray().
    • getLowerBound

      public double getLowerBound​(int numStdDev)
      Gets the approximate lower error bound given the specified number of Standard Deviations.
      Parameters:
      numStdDev - This must be an integer between 1 and 3, inclusive. See Number of Standard Deviations
      Returns:
      the lower bound.
    • getMaxUpdatableSerializationBytes

      public static final int getMaxUpdatableSerializationBytes​(int lgConfigK, TgtHllType tgtHllType)
      Returns the maximum size in bytes that this sketch can grow to given lgConfigK. However, for the HLL_4 sketch type, this value can be exceeded in extremely rare cases. If exceeded, it will be larger by only a few percent.
      Parameters:
      lgConfigK - The Log2 of K for the target HLL sketch. This value must be between 4 and 21 inclusively.
      tgtHllType - the desired Hll type
      Returns:
      the maximum size in bytes that this sketch can grow to.
    • getTgtHllType

      public TgtHllType getTgtHllType()
      Gets the TgtHllType
      Returns:
      the TgtHllType enum value
    • getUpdatableSerializationBytes

      public int getUpdatableSerializationBytes()
      Gets the size in bytes of the current sketch when serialized using toUpdatableByteArray().
      Returns:
      the size in bytes of the current sketch when serialized using toUpdatableByteArray().
    • getUpperBound

      public double getUpperBound​(int numStdDev)
      Gets the approximate upper error bound given the specified number of Standard Deviations.
      Parameters:
      numStdDev - This must be an integer between 1 and 3, inclusive. Number of Standard Deviations
      Returns:
      the upper bound.
    • isCompact

      public boolean isCompact()
      Returns true if the backing memory of this sketch is in compact form.
      Returns:
      true if the backing memory of this sketch is in compact form.
    • isEmpty

      public boolean isEmpty()
      Returns true if empty
      Returns:
      true if empty
    • isMemory

      public boolean isMemory()
      Returns true if this sketch was created using Memory.
      Returns:
      true if this sketch was created using Memory.
    • isOffHeap

      public boolean isOffHeap()
      Returns true if the backing memory for this sketch is off-heap.
      Returns:
      true if the backing memory for this sketch is off-heap.
    • isSameResource

      public boolean isSameResource​(org.apache.datasketches.memory.Memory mem)
      Returns true if the given Memory refers to the same underlying resource as this sketch. The capacities must be the same. If this is a region, the region offset must also be the same.

      This is only relevant for HLL_4 sketches that have been configured for off-heap using WritableMemory or Memory. For on-heap sketches or unions this will return false.

      It is rare, but possible, the the off-heap memory that has been allocated to an HLL_4 sketch may not be large enough. If this should happen, the sketch makes a request for more memory from the owner of the resource and then moves itself to this new location. This all happens transparently to the user. This method provides a means for the user to inquire of the sketch if it has, in fact, moved itself.

      Parameters:
      mem - the given Memory
      Returns:
      true if the given Memory refers to the same underlying resource as this sketch or union.
    • reset

      public void reset()
      Resets to empty, but does not change the configured values of lgConfigK and tgtHllType.
    • toCompactByteArray

      public byte[] toCompactByteArray()
      Serializes this sketch as a byte array in compact form. The compact form is smaller in size than the updatable form and read-only. It can be used in union operations as follows:
      
           Union union; HllSketch sk, sk2;
           int lgK = 12;
           sk = new HllSketch(lgK, TgtHllType.HLL_4); //can be 4, 6, or 8
           for (int i = 0; i < (2 << lgK); i++) { sk.update(i); }
           byte[] arr = HllSketch.toCompactByteArray();
           //...
           union = Union.heapify(arr); //initializes the union using data from the array.
           //OR, if used in an off-heap environment:
           union = Union.heapify(Memory.wrap(arr)); //same as above, except from Memory object.
      
           //To recover an updatable heap sketch:
           sk2 = HllSketch.heapify(arr);
           //OR, if used in an off-heap environment:
           sk2 = HllSketch.heapify(Memory.wrap(arr));
       

      The sketch "wrapping" operation skips actual deserialization thus is quite fast. However, any attempt to update the derived HllSketch will result in a Read-only exception.

      Note that in some cases, based on the state of the sketch, the compact form is indistiguishable from the updatable form. In these cases the updatable form is returned and the compact flag bit will not be set.

      Returns:
      this sketch as a compact byte array.
    • toUpdatableByteArray

      public byte[] toUpdatableByteArray()
      Serializes this sketch as a byte array in an updatable form. The updatable form is larger than the compact form. The use of this form is primarily in environments that support updating sketches in off-heap memory. If the sketch is constructed using HLL_8, sketch updating and union updating operations can actually occur in WritableMemory, which can be off-heap:
      
           Union union; HllSketch sk;
           int lgK = 12;
           sk = new HllSketch(lgK, TgtHllType.HLL_8) //must be 8
           for (int i = 0; i < (2 << lgK); i++) { sk.update(i); }
           byte[] arr = sk.toUpdatableByteArray();
           WritableMemory wmem = WritableMemory.wrap(arr);
           //...
           union = Union.writableWrap(wmem); //no deserialization!
       
      Returns:
      this sketch as an updatable byte array.
    • toString

      public String toString​(boolean summary, boolean detail, boolean auxDetail, boolean all)
      Human readable summary with optional detail
      Parameters:
      summary - if true, output the sketch summary
      detail - if true, output the internal data array
      auxDetail - if true, output the internal Aux array, if it exists.
      all - if true, outputs all entries including empty ones
      Returns:
      human readable string with optional detail.
    • toString

      public static String toString​(byte[] byteArr)
      Returns a human readable string of the preamble of a byte array image of an HllSketch.
      Parameters:
      byteArr - the given byte array
      Returns:
      a human readable string of the preamble of a byte array image of an HllSketch.
    • toString

      public static String toString​(org.apache.datasketches.memory.Memory mem)
      Returns a human readable string of the preamble of a Memory image of an HllSketch.
      Parameters:
      mem - the given Memory object
      Returns:
      a human readable string of the preamble of a Memory image of an HllSketch.
    • getSerializationVersion

      public static final int getSerializationVersion()
      Returns the current serialization version.
      Returns:
      the current serialization version.
    • getSerializationVersion

      public static final int getSerializationVersion​(org.apache.datasketches.memory.Memory mem)
      Returns the current serialization version of the given Memory.
      Parameters:
      mem - the given Memory containing a serialized HllSketch image.
      Returns:
      the current serialization version.
    • getRelErr

      public double getRelErr​(boolean upperBound, boolean unioned, int lgConfigK, int numStdDev)
      Gets the current (approximate) Relative Error (RE) asymptotic values given several parameters. This is used primarily for testing.
      Parameters:
      upperBound - return the RE for the Upper Bound, otherwise for the Lower Bound.
      unioned - set true if the sketch is the result of a union operation.
      lgConfigK - the configured value for the sketch.
      numStdDev - the given number of Standard Deviations. This must be an integer between 1 and 3, inclusive. Number of Standard Deviations
      Returns:
      the current (approximate) RelativeError
    • isEstimationMode

      public boolean isEstimationMode()
      This HLL family of sketches and operators is always estimating, even for very small values.
      Returns:
      true
    • toString

      public String toString()
      Human readable summary as a string.
      Overrides:
      toString in class Object
      Returns:
      Human readable summary as a string.
    • toString

      public String toString​(boolean summary, boolean detail, boolean auxDetail)
      Human readable summary with optional detail. Does not list empty entries.
      Parameters:
      summary - if true, output the sketch summary
      detail - if true, output the internal data array
      auxDetail - if true, output the internal Aux array, if it exists.
      Returns:
      human readable string with optional detail.
    • update

      public void update​(long datum)
      Present the given long as a potential unique item.
      Parameters:
      datum - The given long datum.
    • update

      public void update​(double datum)
      Present the given double (or float) datum as a potential unique item. The double will be converted to a long using Double.doubleToLongBits(datum), which normalizes all NaN values to a single NaN representation. Plus and minus zero will be normalized to plus zero. The special floating-point values NaN and +/- Infinity are treated as distinct.
      Parameters:
      datum - The given double datum.
    • update

      public void update​(String datum)
      Present the given String as a potential unique item. The string is converted to a byte array using UTF8 encoding. If the string is null or empty no update attempt is made and the method returns.

      Note: About 2X faster performance can be obtained by first converting the String to a char[] and updating the sketch with that. This bypasses the complexity of the Java UTF_8 encoding. This, of course, will not produce the same internal hash values as updating directly with a String. So be consistent! Unioning two sketches, one fed with strings and the other fed with char[] will be meaningless.

      Parameters:
      datum - The given String.
    • update

      public void update​(ByteBuffer data)
      Present the given byte buffer as a potential unique item. Bytes are read from the current position of the buffer until its limit. If the byte buffer is null or has no bytes remaining, no update attempt is made and the method returns.

      This method will not modify the position, mark, limit, or byte order of the buffer.

      Little-endian order is preferred, but not required. This method may perform better if the provided byte buffer is in little-endian order.

      Parameters:
      data - The given byte buffer.
    • update

      public void update​(byte[] data)
      Present the given byte array as a potential unique item. If the byte array is null or empty no update attempt is made and the method returns.
      Parameters:
      data - The given byte array.
    • update

      public void update​(char[] data)
      Present the given char array as a potential unique item. If the char array is null or empty no update attempt is made and the method returns.

      Note: this will not produce the same output hash values as the update(String) method but will be a little faster as it avoids the complexity of the UTF8 encoding.

      Parameters:
      data - The given char array.
    • update

      public void update​(int[] data)
      Present the given integer array as a potential unique item. If the integer array is null or empty no update attempt is made and the method returns.
      Parameters:
      data - The given int array.
    • update

      public void update​(long[] data)
      Present the given long array as a potential unique item. If the long array is null or empty no update attempt is made and the method returns.
      Parameters:
      data - The given long array.