Best Practices for Data Collection at Scale

Our economy is transitioning to becoming data-driven. Businesses are leveraging big data and machine learning to generate insights and make better decisions. There are a lot of awesome things you can do with big data. You can track disease outbreaks, recommend movies and songs, predict the outcome of the next election or optimize your marketing spend on the most highly performing channels. But before you can do all that, there is one basic thing you need to do…collect the data!

Here are four things you should be aware of when collecting data at scale:

1) Data Completeness

When collecting data at high scale, you will be using a distributed system of multiple servers, each collecting a part of the data. Later on, when you process the data, you want to be sure that when you process data for a certain period in time (i.e. all the data that came in between 8:00am and 12:00am), you have the complete dataset for that period. If one of your servers failed to deliver the data it collected, you will have “a hole” in your data.

You can avoid that situation by using service discovery. It can be a tool like Consul or even a simple database table. The service discovery will keep track of all running collection servers. This will allow you to go over the data you wish to process and make sure you don’t process it unless it has been delivered by all of the servers that are active at the time.

2) Handling “Bad” Data

The internet is a wild, wild place. Your collection servers could be probed by bots. The server that was just provisioned to you might have been serving some other system in the past and stale DNS entries could still be sending traffic your way. It could even be packet loss or just human error. You should expect that you may collect data that is not usable. Your system must not fail when such data arrives.

Now you should decide, what is the right place to filter out bad data? Filtering bad data on the collection layer is cost-efficient. You don’t have to pay for storing and processing it. There is only one caveat to that approach though--if you did not collect the data, it’s gone for good. At Convertro we have many times recovered usable data out of bad data that was collected. Sometimes the value of recovering data from a badly tagged marketing campaign is worth the extra cost of storing and filtering it further downstream.

3) Trace

When you collect thousands of incoming data streams, it’s very hard to notice that something may be wrong with one of them.

At Convertro we collect data from multiple sources for multiple clients. We want to be able to notice when there is a drop or spike in one of those thousands of streams. We run anomaly detection algorithms on the data we collect and compare traffic levels to trends observed on similar periods in the past. This allows us to detect issues very quickly and proactively help our clients fix them before they become significant.

4) Belated Data

When you break the barriers of cyberspace and wish to combine the data you collected online with data from the real world, you will realize that different streams of data arrive in different cadences.

At Convertro, data sets contains online ad-views, website visits, online conversions, TV spot logs, in-store purchase, mailed catalogues, weather and many other data points. Online data is collected in real time, but TV data or in-store purchase data could arrive in a delay of hours to weeks. When we get belated data, we learn of something that happened in the past and now we must update our calculations accordingly.

There are several ways to handle belated data:

In some use cases, data that arrives late is no longer relevant. If you have already performed irreversible actions based on the data you had at the time, you cannot change your actions and there is no point in adding the new data. An example for this could be a real time bidding algorithm. If it already made its bid, there is no point in recalculating it.

If you are storing raw data and analyzing it on the fly, you could simply add the new data. It will be picked up the next time someone analyzes it. This approach is possible when there is no high Service Level Agreement on data query time. If you wish to provide your clients with a highly responsive dashboard, you must pre-calculate your data and cannot use this method.

When the calculated metrics are additive (i.e. sum of conversions per day), you could update the aggregated metric with the new data that has been introduced.
You could also apply these methods non additive metrics like average by keeping track of the number of items that went into the aggregation.

If you require a complex metric such as ranking (i.e. the ‘n’ highest spending customers) you will have to rebuild your data set.

At Convertro we have built our system in a way that allows us to go back in time. When a feed of belated data arrives we can go back and “replay” all the data we collected since the earliest data point in that feed to our algorithms. By doing this we could reflect the new knowledge we just obtained to better understand what happened in the past and to better predict the outcome of our future actions. This ability is crucial if you wish to provide one model which accurately combines the online world with the real world out there. Otherwise, you will be modeling each data set separately and combine them using duct tape.

Want to learn more about Marketing Mix Models + Multi-Touch Attribution best practices? Get in touch with one of our experts,

This post was written by Iddo Rachlewski, VP of Research and Development, Convertro

For additional insights on MMM + MTA, check out the following resources: