Data exhaust is literally the modern day data scientist equivalent of the old adage “one man's trash is another man's treasure.”
To keep things simple, we’ll consider data exhaust to be any data which is the byproduct of doing business. As an example, sometimes I use the Postmates application to order fried chicken from Popeye’s for delivery. (For those unfamiliar with Postmates, they are a U.S. courier service that will basically deliver anything from anywhere to you on demand.)
While the business transaction involves Postmates purchasing and picking up the fried chicken from Popeye’s and delivering it to me, the data exhaust generated by the transaction is all the rich extra information such as — the products I ordered, the merchants I ordered from (in this case, Popeye’s), the GPS routes the couriers used, and my individual user preferences within the Postmates application.
So how can this data exhaust prove useful?
A new wave of quant data
Consider the case of a quantitative analyst (or “quant”) — like me. Quants are motivated to distill alpha (excess return against a benchmark) and/or risk insight from data to help inform trading decisions. We quants were basically data scientists before it was cool.
The first wave of quant data could easily fit into regular database tables and covered finance-specific areas such as prices, fundamentals, estimates and ownership.
Then, a second wave of quant data arrived as a departure from structured numerical data to unstructured text. That’s when text mining became possible — and popular — with the advent of big data technologies such as Hadoop and Spark. These technologies empower users to analyze hundreds of millions of news articles, tweets, and other text documents in a very short period of time.
The third wave of quant data just may be data exhaust — trying to mine an unexplored big data content set for a potential edge.
Five lessons we’ve learned about data exhaust
At StarMine Quantitative Analytics, our job is to research and create quant models to sell to hedge funds and asset managers. So we’ve spent a good amount of time on the first and second waves of quant data, and are now exploring this third wave of data exhaust.
Just like hedge funds are doing nowadays, we’re constantly speaking with potential content partners and exploring their data exhaust for alpha. Here are five lessons we’ve learned so far from those interactions:
- Always ask for anonymized (anonymous) data! Individuals are justifiably concerned about how their personal data is used. Quants don’t care about what a particular individual is doing, but are interested in behavior which is aggregated to the company level or macro level. It is in everyone’s best interest that your potential partner anonymize (and maybe even aggregate) their user data. Avoid the headache of privacy concerns by addressing them sensibly at the onset.
- Carefully consider your potential partner’s stakeholders. A credit card company has two main stakeholders: the credit card holders and the merchants which accept the credit cards. As a credit card holder, I personally don’t care how my credit card company uses my anonymized purchase data. However, a huge corporate merchant would understandably be displeased if a credit card company resold its transaction data to financial investors. Best to explore these stakeholder concerns early on rather than having them crop up much later and maybe even halt the partnership, after you’ve already invested a lot of time into your research.
- Think creatively about proxies. So maybe you can’t obtain the holy grail of financial transactions. Instead, think about what types of data would be a proxy for purchasing behavior. For example, GPS data is pretty accurate nowadays and can tell you where a person was and how long they were there. If a person spends a significant amount of time at a store or restaurant, usually they are buying something.
- Brace yourself for a lot of data janitorial work. Chances are that your potential partner didn’t envision monetizing their data exhaust when their business was originally built. Hence, you can’t expect it to come to you seamlessly (like our I/B/E/S feed with sell-side analyst estimates).
- You also going to need to map their company identifiers to the rest of the content in your data lake. Intelligent Tagging can be helpful for parsing company entities from text documents (for some examples, read my previous blog post: Intelligent Tagging for traders: Can celebrities raise the share price?), while Record Matching lets you concord by company name and address. Furthermore, Organization Authority, which is available in more than 25 Thomson Reuters products, can help you understand parent/subsidiary relationships, and QA Direct helps connect identifiers across many different content sets.
Data exhaust without sufficient history may not be immediately useful. An unfortunate side-effect of data exhaust’s recent popularity is that it often comes with a very limited history. Either a potential partner recently came into existence (so hasn’t been around long enough to create a deep history) or an established company only recently started retaining their data exhaust. At StarMine, one or two year’s worth of data is just not robust enough for us to develop a quant model. In other cases, that amount of data might be sufficient enough to chase the alpha they’re seeking. But, be cautious. With a limited history, we have no idea how that data performs during different market regimes.
Read more from Exchange Magazine in the Know 360 app