Class FdtSketch


public class FdtSketch
extends ArrayOfStringsSketch
A Frequent Distinct Tuples sketch.

Suppose our data is a stream of pairs {IP address, User ID} and we want to identify the IP addresses that have the most distinct User IDs. Or conversely, we would like to identify the User IDs that have the most distinct IP addresses. This is a common challenge in the analysis of big data and the FDT sketch helps solve this problem using probabilistic techniques.

More generally, given a multiset of tuples with dimensions {d1,d2, d3, ..., dN}, and a primary subset of dimensions M < N, our task is to identify the combinations of M subset dimensions that have the most frequent number of distinct combinations of the N-M non-primary dimensions.

Please refer to the web page https://datasketches.apache.org/docs/Frequency/FrequentDistinctTuplesSketch.html for a more complete discussion about this sketch.

Author:
Lee Rhodes
  • Field Summary

    Fields inherited from class org.apache.datasketches.tuple.Sketch

    PREAMBLE_LONGS
  • Constructor Summary

    Constructors 
    Constructor Description
    FdtSketch​(double threshold, double rse)
    Create a new instance of Frequent Distinct Tuples sketch with a size determined by the given threshold and rse.
    FdtSketch​(int lgK)
    Create new instance of Frequent Distinct Tuples sketch with the given Log-base2 of required nominal entries.
  • Method Summary

    Modifier and Type Method Description
    CompactSketch<S> compact()
    Converts the current state of the sketch into a compact sketch
    int getCountLessThanThetaLong​(long thetaLong)
    Gets the number of hash values less than the given theta expressed as a long.
    int getCurrentCapacity()
    Get current capacity
    int getLgK()
    Get log_base2 of Nominal Entries
    int getNominalEntries()
    Get configured nominal number of entries
    PostProcessor getPostProcessor()
    Returns the PostProcessor that enables multiple queries against the sketch results.
    PostProcessor getPostProcessor​(Group group, char sep)
    Returns the PostProcessor that enables multiple queries against the sketch results.
    ResizeFactor getResizeFactor()
    Get configured resize factor
    List<Group> getResult​(int[] priKeyIndices, int limit, int numStdDev, char sep)
    Returns an ordered List of Groups of the most frequent distinct population of subset tuples represented by the count of entries of each group.
    int getRetainedEntries()  
    float getSamplingProbability()
    Get configured sampling probability
    protected void insertSummary​(int index, S summary)  
    SketchIterator<S> iterator()
    Returns a SketchIterator
    void reset()
    Resets this sketch an empty state.
    byte[] toByteArray()
    This is to serialize an instance to a byte array.
    void trim()
    Rebuilds reducing the actual number of entries to the nominal number of entries if needed
    void update​(String[] tuple)
    Update the sketch with the given string array tuple.

    Methods inherited from class org.apache.datasketches.tuple.strings.ArrayOfStringsSketch

    update

    Methods inherited from class org.apache.datasketches.tuple.UpdatableSketch

    update, update, update, update, update, update

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Constructor Details

    • FdtSketch

      public FdtSketch​(int lgK)
      Create new instance of Frequent Distinct Tuples sketch with the given Log-base2 of required nominal entries.
      Parameters:
      lgK - Log-base2 of required nominal entries.
    • FdtSketch

      public FdtSketch​(double threshold, double rse)
      Create a new instance of Frequent Distinct Tuples sketch with a size determined by the given threshold and rse.
      Parameters:
      threshold - : the fraction, between zero and 1.0, of the total distinct stream length that defines a "Frequent" (or heavy) item.
      rse - the maximum Relative Standard Error for the estimate of the distinct population of a reported tuple (selected with a primary key) at the threshold.
  • Method Details

    • update

      public void update​(String[] tuple)
      Update the sketch with the given string array tuple.
      Parameters:
      tuple - the given string array tuple.
    • getResult

      public List<Group> getResult​(int[] priKeyIndices, int limit, int numStdDev, char sep)
      Returns an ordered List of Groups of the most frequent distinct population of subset tuples represented by the count of entries of each group.
      Parameters:
      priKeyIndices - these indices define the dimensions used for the Primary Keys.
      limit - the maximum number of groups to return. If this value is ≤ 0, all groups will be returned.
      numStdDev - the number of standard deviations for the upper and lower error bounds, this value is an integer and must be one of 1, 2, or 3. See Number of Standard Deviations
      sep - the separator character
      Returns:
      an ordered List of Groups of the most frequent distinct population of subset tuples represented by the count of entries of each group.
    • getPostProcessor

      public PostProcessor getPostProcessor()
      Returns the PostProcessor that enables multiple queries against the sketch results. This assumes the default Group and the default separator character '|'.
      Returns:
      the PostProcessor
    • getPostProcessor

      public PostProcessor getPostProcessor​(Group group, char sep)
      Returns the PostProcessor that enables multiple queries against the sketch results.
      Parameters:
      group - the Group class to use during post processing.
      sep - the separator character.
      Returns:
      the PostProcessor
    • getRetainedEntries

      public int getRetainedEntries()
      Specified by:
      getRetainedEntries in class Sketch<S extends Summary>
      Returns:
      number of retained entries
    • getCountLessThanThetaLong

      public int getCountLessThanThetaLong​(long thetaLong)
      Description copied from class: Sketch
      Gets the number of hash values less than the given theta expressed as a long.
      Specified by:
      getCountLessThanThetaLong in class Sketch<S extends Summary>
      Parameters:
      thetaLong - the given theta as a long between zero and Long.MAX_VALUE.
      Returns:
      the number of hash values less than the given thetaLong.
    • getNominalEntries

      public int getNominalEntries()
      Get configured nominal number of entries
      Returns:
      nominal number of entries
    • getLgK

      public int getLgK()
      Get log_base2 of Nominal Entries
      Returns:
      log_base2 of Nominal Entries
    • getSamplingProbability

      public float getSamplingProbability()
      Get configured sampling probability
      Returns:
      sampling probability
    • getCurrentCapacity

      public int getCurrentCapacity()
      Get current capacity
      Returns:
      current capacity
    • getResizeFactor

      public ResizeFactor getResizeFactor()
      Get configured resize factor
      Returns:
      resize factor
    • trim

      public void trim()
      Rebuilds reducing the actual number of entries to the nominal number of entries if needed
    • reset

      public void reset()
      Resets this sketch an empty state.
    • compact

      public CompactSketch<S> compact()
      Converts the current state of the sketch into a compact sketch
      Specified by:
      compact in class Sketch<S extends Summary>
      Returns:
      compact sketch
    • toByteArray

      public byte[] toByteArray()
      Description copied from class: Sketch
      This is to serialize an instance to a byte array.
      Specified by:
      toByteArray in class Sketch<S extends Summary>
      Returns:
      serialized representation of the sketch
    • insertSummary

      protected void insertSummary​(int index, S summary)
    • iterator

      public SketchIterator<S> iterator()
      Description copied from class: Sketch
      Returns a SketchIterator
      Specified by:
      iterator in class Sketch<S extends Summary>
      Returns:
      a SketchIterator