Log Sampling

Overall Description

The Ad Forecaster engine requires a sample of the log files to work with. This page describes how to create that sample from the source log files, in order to make a user-based statistically accurate sample.

To create your sample, you should download the jar file containing the class Sharding. This class is parameterized with two parameters that will be provided by ShiftForward. Default values will be 0.03 for alpha and 1024 for beta.

The class provides a single function which tests whether to consider or not each userId. This function has the following signature: public boolean filter(String userId). If the user is selected for the sample the function will return true and his events should be present on the sampled log files.

Since the sampling is based on the user ID, users without an ID or with a dummy ID must be handled in a different way. Please check the Log Files page to know how to deal with such cases.

It is relevant to note that this class is not thread-safe. This is so to allow optimizations for better performance.

Sample size calculator

Use the form below to estimate the required user sample size for your acceptable sampling error.

Data sample size you will need to send:

Example Usage

import com.velocidi.cocker.Sharding;

public class LogFilter {

  public static class LogRow {
    public String userID;

    public long timestamp;

    public double attributeX;

    // Creates a represetation of an original log file
    public static LogRow[] generateRandom(int size) {

      LogRow[] logs = new LogRow[size];

      for (int i = 0; i < size; i++) {
        logs[i] = new LogRow();
        logs[i].userID = "" + (int)(Math.random() * 100000d);
        logs[i].timestamp = (long)(Math.random() * 10000l);
        logs[i].attributeX = Math.random();
      }

      return logs;
    }
  }

  public static void main(String[] args) {

    LogRow[] logs = LogRow.generateRandom(1000);

    double alpha = 0.5;
    int beta = 4;

    Sharding sharding = new Sharding(alpha, beta);

    for (LogRow log : logs) {

      // When true, consider the events from this userID.
      if (sharding.filter(log.userID)) {

        // All log entries from this user should be provided in the sampled logs.
        System.out.println("This user will be considered: " + log.userID);

      }
    }
  }
}