Skip to content
Thomson Reuters

The Financial & Risk business of Thomson Reuters is now Refinitiv. Visit
All names and marks owned by Thomson Reuters, including "Thomson", "Reuters" and the Kinesis logo are used under license from Thomson Reuters and its affiliated companies.

Big Data

Is dirty data spoiling your AI progress?

05 Sep 2017

A visitor looks at the T-800 Terminator android from the movie 'The Terminator'. Photographer: Herwig Prammer
A visitor looks at the T-800 Terminator android from the movie ‘The Terminator’. Photographer: Herwig Prammer

Machine learning and artificial intelligence (AI) are only as powerful as the expertise and content behind them, heightening the need for clean, organized data.

Machines don’t learn, they’re trained. To teach a machine to learn means feeding it information in an organized and structured manner.

Ultimately, the success of any machine learning capability directly links back to getting the training data right from the start.

Seize the opportunity with powerful financial analysis with Thomson Reuters Eikon

Get this wrong and you have the classic situation: garbage in, garbage out.

Even the smartest capability can’t process a random mass of unverified text strings and log tables to give a meaningful answer.

Or even worse, the capabilities do manage to process that dirty data and deliver the wrong answer entirely.

AI data challenges

These days data scientists spend most of their time trying to clean up disparate data sets in order to train machines to provide the answers that their business sponsors desperately need.

Surprisingly though, it is common to forget that the real world is much messier than the data science lab.

There have been many cases of individuals picking up historical data sets, spending a lot of time cleaning them up — over-fitting them to prove exactly what they’re looking for — but then never training them properly to reflect external conditions.

Back testing results like this so that they look promising is, in a way, a type of epidemic in the industry, but it’s also a product of not having the right subject matter expertise in the lab.

Innovation @ Thomson Reuters: where smart data and human expertise lead to trusted answers and creative solutions

External conditions

Training machines to understand those critical external conditions and influences that affect their models when put into practice is essential to their success.

Say, for example, that a data scientist is trying to train a system to understand the meaning of a news article solely based on its headline.

However, if they feed it data from Twitter — a social media platform that doesn’t have headlines — the whole process won’t work.

It also could confuse advertisements as articles, or even mix up languages if these differentiations aren’t specified in the training.

Machine learning has to be robust enough to cope with the vagaries of the real world and different domain specificities, which are all reflected in the different classes of data.

Click the image below to view Thomson Reuters Intelligent Tagging

Intelligent tagging video

Find out how Thomson Reuters is growing its set of open solutions to help customers and partners solve tough business challenges

Connecting the right data

Most machine learning business cases fail because they invest in the technology tools, but not in the data management expertise.

Machine learning and AI are only as good as the data they learn from, and this requires domain experts that can train systems from the ground up; individuals who can find the answers by connecting the right data to the right models.

How to make data intelligent
How to make data intelligent

Find out more about Thomson Reuters Intelligent Tagging

The technology solutions that are necessary for any business’ success are the ones that make data digestible, not overwhelming. As every company goes through their digital transformation what they need is less frivolity and more practicality.

This means clean, structured data managed by trusted experts. After all, machine learning and AI are only as attractive as the clean data and expertise that train them.

Thomson Reuters Eikon

Why consume Tick History via the cloud? Building the complete financial data platform Reviewing MiFID II and the Systematic Internaliser regime Meet Refinitiv Navigating the emerging markets storm A new force in managed data services Cloud adoption shows data security progress Financial platforms: Building a developer community The benefits of AI in investment banking How open platforms shape the future of buy-side trading