Selecting Data
A log row should contain all the information relevant from the event it describes. Furthermore, it should also include any relevant information for targeting, as well as any field by which forecasts should be grouped by.
There are two mandatory fields: a Unique User or Browser ID (or hashed version of it) and a date in ISO8601 format with timezone information. It is up to the client to identify which remaining information should be included in his logs, potentially opting for sending all available data. Below some examples of relevant data are presented.
- Inventory source (e.g. RTB provider ID)
- Publisher ID
- Site ID
- Placement ID
- Placement format / Format ID
- Page URL
- Ad request keywords / key-values
- Browser Agent (full string or normalised)
- Platform (OS, Mobile, etc)
- Relevant cookies (targetable)
- Geo location
- Behavioural / profiling segments
- Campaign, Ad or Bid Request ID
- Clearing price for the ad request (for RTB)
- Fixed CPM paid by the campaign (for Direct campaigns)
- Highest bid made for the ad request (for RTB)
Log Format
The recommended log format is AVRO. AVRO is an open-source data serialization system from the Apache Software Foundation. Besides its simplicity and maturity, ShiftForward recommends adopting AVRO since it automatically validates that all log rows are compliant with an adopted schema. This will prevent invalid data from being serialized, easing the integration process.
While modeling your data using AVRO, you can use all primitive types, Arrays,
Maps and Unions. Unions can only be used for specifying nullable values. With
a String, that would be ["null", "string"]
. Read more about the AVRO Field
types.
The example below exemplifies a valid AVRO schema, leveraging the multiple data types available.
{
"fields": [
{
"name": "userId",
"type": "string"
},
{
"name": "bidPrice",
"type": [
"null",
"double"
]
},
{
"name": "bidDateTime",
"type": [
"null",
"long"
]
},
{
"name": "stringArray",
"type": [
"null",
{
"type": "array",
"items": "string"
}
]
},
{
"name": "stringMap",
"type": [
"null",
{
"type": "map",
"values": {
"type": "string"
}
}
]
}
],
"name": "AdForecasterData version 1",
"namespace": "com.client.v1",
"type": "record"
}
Handling users without an User ID
When users have AdBlocking software installed or don’t persist cookies it might not be possible to create a stable user ID for them. A common approach in these scenarios is to set the user ID to 0, or some other placeholder value.
Instead, impressions from these users should be made available by replacing the hard-coded user ID with a random UUID. The ID generation should be performed before feeding the log files to the sampler function to ensure that the proper number of these users are picked up.
If there is the need to keep this information during forecasting, a new field
should be used to identify that the users lacked a proper ID. An example would
be to create a noCookieAllowed
or hasAdBlocker
field and set it to True
on
every impression. This will allow to later identify which parts of the inventory
come from these users by aggregating results by these newly created fields.
Uploading Log Files
The method for sharing log files is by using an Amazon S3 bucket. Amazon provides detailed documentation on how to do so. The client can choose to create the bucket and provide ShiftForward access to it, or have ShiftForward set it up and provide the required credentials for uploading the log files.
In the bucket, log files should be organized into different folders using the
structure /logs/year/month/day/hour/
. Example /logs/2016/05/02/21
. Always
use four digits for the year, and two digits for the month, day and hour. The
hour should be in 24h format. Date and time in the folder name must be relative
to the timestamp in the log rows converted to the UTC timezone.
Logs can optionally be compressed using gzip or zip.