Log Files

Selecting Data

A log row should contain all the information relevant from the event it describes. Furthermore, it should also include any relevant information for targeting, as well as any field by which forecasts should be grouped by.

There are two mandatory fields: a Unique User or Browser ID (or hashed version of it) and a date in ISO8601 format with timezone information. It is up to the client to identify which remaining information should be included in his logs, potentially opting for sending all available data. Below some examples of relevant data are presented.

Inventory source information
  • Inventory source (e.g. RTB provider ID)
  • Publisher ID
  • Site ID
  • Placement ID
  • Placement format / Format ID
  • Page URL
  • Ad request keywords / key-values
Browser information
  • Browser Agent (full string or normalised)
  • Platform (OS, Mobile, etc)
  • Relevant cookies (targetable)
  • Geo location
  • Behavioural / profiling segments
Response / delivery information
  • Campaign, Ad or Bid Request ID
  • Clearing price for the ad request (for RTB)
  • Fixed CPM paid by the campaign (for Direct campaigns)
  • Highest bid made for the ad request (for RTB)

Log Format

The recommended log format is AVRO. AVRO is an open-source data serialization system from the Apache Software Foundation. Besides its simplicity and maturity, ShiftForward recommends adopting AVRO since it automatically validates that all log rows are compliant with an adopted schema. This will prevent invalid data from being serialized, easing the integration process.

While modeling your data using AVRO, you can use all primitive types, Arrays, Maps and Unions. Unions can only be used for specifying nullable values. With a String, that would be ["null", "string"]. Read more about the AVRO Field types.

The example below exemplifies a valid AVRO schema, leveraging the multiple data types available.

{
  "fields": [
    {
      "name": "userId",
      "type": "string"
    },
    {
      "name": "bidPrice",
      "type": [
        "null",
        "double"
      ]
    },
    {
      "name": "bidDateTime",
      "type": [
        "null",
        "long"
      ]
    },
    {
      "name": "stringArray",
      "type": [
        "null",
        {
          "type": "array",
          "items": "string"
        }
      ]
    },
    {
      "name": "stringMap",
      "type": [
        "null",
        {
          "type": "map",
          "values": {
            "type": "string"
          }
        }
      ]
    }
  ],
  "name": "AdForecasterData version 1",
  "namespace": "com.client.v1",
  "type": "record"
}

Handling users without an User ID

When users have AdBlocking software installed or don’t persist cookies it might not be possible to create a stable user ID for them. A common approach in these scenarios is to set the user ID to 0, or some other placeholder value.

Instead, impressions from these users should be made available by replacing the hard-coded user ID with a random UUID. The ID generation should be performed before feeding the log files to the sampler function to ensure that the proper number of these users are picked up.

If there is the need to keep this information during forecasting, a new field should be used to identify that the users lacked a proper ID. An example would be to create a noCookieAllowed or hasAdBlocker field and set it to True on every impression. This will allow to later identify which parts of the inventory come from these users by aggregating results by these newly created fields.

Uploading Log Files

The method for sharing log files is by using an Amazon S3 bucket. Amazon provides detailed documentation on how to do so. The client can choose to create the bucket and provide ShiftForward access to it, or have ShiftForward set it up and provide the required credentials for uploading the log files.

In the bucket, log files should be organized into different folders using the structure /logs/year/month/day/hour/. Example /logs/2016/05/02/21. Always use four digits for the year, and two digits for the month, day and hour. The hour should be in 24h format. Date and time in the folder name must be relative to the timestamp in the log rows converted to the UTC timezone.

Logs can optionally be compressed using gzip or zip.