Skip to content

Our Privacy Statement & Cookie Policy

All Thomson Reuters websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.


With artificial intelligence, there’s nothing sexy about dirty data

Geoff Horrell  Director of Product Incubation

Geoff Horrell  Director of Product Incubation

Artificial intelligence capabilities need a careful allocation of time, strategy and resources to avoid a "garbage in, garbage out" style of operating.

Machine learning and artificial intelligence (AI) can safely be classified as the terms of 2017. Everyone from business consultants to technologists to the media is talking about them, and there’s no wonder why. These capabilities promise to find groundbreaking answers amidst the chaos of the Findustrial Revolution, where the promise of digital transformation is overwhelmed by legacy technology systems, massive amounts of data and emerging innovations that don’t quite hit the spot. However sexy machine learning and AI might seem, they’re only as powerful as the expertise and content behind them. After all, in order for a machine to learn, it needs to be trained. And this training requires clean, organized data and experts who know how the data and the technology work.

A need for organization, structure

Teaching a machine to learn requires feeding it information in an organized and structured manner. Using a systematic approach is essential in measuring whether the output is strong enough to be used. Ultimately, the success of any machine learning capability directly links back to getting the training data right from the start.  Get this wrong and you have a “garbage in, garbage out” situation.  Even the smartest capability can’t process a random mass of unverified text strings and log tables to give a meaningful answer.  Or even worse, the capabilities do actually process that dirty data and deliver the wrong answer entirely.

Data scientists spend most of their time trying to clean up disparate data sets in order to train machines to provide the answers that their business sponsors desperately need. Surprisingly though, it is common to forget that the real world is much messier than the data science lab. There have been many cases of individuals picking up historical data sets, spending a lot of time cleaning them up – over-fitting them to prove exactly what they’re looking for – but then never training them properly to reflect external conditions. Back testing results like this so that they look promising is, in a way, a type of epidemic in the industry, but it’s also a product of not having the right subject-matter expertise in the lab.

Machine learning and AI are only as good as the data they learn from.

Investing in the wrong place

Training machines to understand those critical external conditions and influences that affect their models when put into practice is essential to their success. Say, for example, that a data scientist is trying to train a system to understand the meaning of a news article solely based on its headline. However, if they feed it data from Twitter – a social media platform that doesn’t have headlines – the whole process won’t work. It also could confuse advertisements as articles, or even mix up languages if these differentiations aren’t specified in the training. Machine learning has to be robust enough to cope with the vagaries of the real world and different domain specificities, which are all reflected in the different classes of data.

Most machine learning business cases fail because they invest in the technology tools, but not in the data management expertise. Machine learning and AI are only as good as the data they learn from, and this requires domain experts that can train systems from the ground up; individuals who can find the answers by connecting the right data to the right models.

The technology solutions that are necessary for any business’ success are the ones that make data digestible, not overwhelming. As every company goes through their digital transformation what they need is less frivolity and more practicality. Practicality means clean, structured data managed by trusted experts. After all, machine learning and AI are only as sexy as the clean data and expertise that train them.

Learn more

Visit Innovation @ to learn more about how we are pairing smart data with human expertise and how you can get involved.

More answers