According to Forrester, by 2017, 25% of enterprises will have implemented a graph database, while Gartner states that “graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions.”
When Microsoft® announced the acquisition of LinkedIn® – there was another big clue about the future importance of graph – the transcript of the interview between Satya Nadella and Jeff Weiner mentioned “graph” nine times!
So what is a graph database, and how and why is it suddenly emerging as the latest killer application in the “big data landscape”? What are the use cases for graph, and how can customers dip their toes in without having to build a 30-strong data science and engineering team?
Graphs are becoming an increasingly popular and useful tool in the information world – but they are by no means new: In fact, the first graph dates back to the Konigsberg bridge problem which was sub- sequently solved by Swiss mathematician Leonhard
Euler in 1736.2 More recently the notion of a graph as a way of representing relationships between people was popularised by the observation that Kevin Bacon (the actor) is, on average, three degrees of separation from every other actor in the IMDb database. Mathematical theory and practical research (e.g., on Facebook®) has shown that, on average, people have no more than six degrees of separation from each other.
Graph 1: Sean Connery – Kevin Bacon Number = 2 (they have never worked on the same movie)
The algorithm that allows calculation of the minimum number of paths between any two people (their actual separation) was developed in the ‘50s by Dutch mathematician Edsger Dijkstra – but it wasn’t until the advent of big data and the explosion in cheap computing power that such algorithms could really be put to work for use cases like Facebook’s “social graph,” which connects interests and friends so you can find restaurants in Barcelona that your friends like. Google’s PageRank is the algorithm that again leverages graph data (representing the hyperlinks between Web pages) to derive search results.
Just like lists and tables, graphs are a means of organising and representing information. A graph comprises objects and relations between those objects, such that any pair of objects connected by a relation form a simple information “sentence” such as “Dog bites Pat.” So one can think of a graph as being a “map” of many such sentences involving a superset of objects and types of relation. For example, “Pat works at the Royal Mail” and “Dog belongs to Alice” might be held alongside “Dog bites Pat,” so we can infer associations not explicitly stated between objects, and follow relationship “signposts” to related information.
Graph 2: Dog bites Pat
A graph method makes it easy to aggregate data from multiple sources that may differ widely in precision, accuracy and meaning. Anyone can add new information to a graph without affecting or being constrained by what is already there: adding to the sum of knowledge. Conversely, lists and tables are designed before any information is added so that the set of things represented and the information held against each thing is clear. This has the effect of constraining what can be represented.
A table designed to capture a “bites” relation cannot be used to represent a “works at” relationship. Lists and tables inherently reduce the available knowledge to fit a design set in advance, so determining the questions that can be answered in advance as well. On the other hand, the same graph can be used to answer questions with various contexts, whatever they may be, regardless of who created the graph or for what purpose. Whether you’re interested in the dangers of being a postman or in the behaviour of Alice’s pets, the previous graph can provide answers, even if its original purpose was to document Pat’s day.
In the world of financial data, Thomson Reuters considerable data assets are contributing to the formation of a Thomson Reuters Knowledge Graph. This will help our customers to identify both inferred and factual relationships previously unknown. For example, Thomson Reuters has been tracking movements of officers and directors of companies for over 30 years. Our Deals database spans a similar time period. By mapping organisations and people in both data sets to common permanent identifiers (PermIDs), a graph representation is formed exploring which executives are associated with which deals through time. Graphs like this can also be easily connected to other graphs as long as the graph databases share some common standards – typically around how entities (like people or companies) and relationships are represented.
For example, while the IMDb website has not adopted PermIDs to uniquely identify actors, a small handful of individuals in that database are or have been officers or directors of companies that are in Thomson Reuters entity databases – as such they have PermIDs. For example, Ashton Kutcher is an actor, and on the board of Katalyst Media – the firm he founded with Jason Goldberg.
Graph 3: When two worlds collide
Graph 4: Connecting the dots
So by traversing nodes common to graphs, it’s possible to join two otherwise separated data sets. Then the resulting knowledge base (“The Graph”, as in “The Web”) allows users maximum access to information and the ability to individually tailor queries and views, subject only to rights and regulations rather than to technical and physical separation. For example, what is the relationship between Qantas and Kevin Bacon? Well, Australian businessman James Packer was on the board of Qantas, and through his planned nuptials with Mariah Carey (singer and occasional actress) she provides the essential connection between the business and entertainment worlds.
It’s this ability to connect graphs that really drove the acquisition of LinkedIn and Microsoft. At the time Jeff Weiner stated: “What Satya and I are most excited about is when you combine Microsoft’s corporate graph with LinkedIn’s professional graph.”
Customers’ use of the graph
Its cumulative nature makes the graph an especially useful method when sharing and combining data. If everyone in a large organisation, for example, shares what they know by contributing to a graph, the resulting knowledge base can be used in a much more nuanced and flexible way than had they all been forced to contribute to a centrally pre-designed database. The graph method marks a shift in emphasis from data only being created and managed for specific needs, to data being connected to form the collective knowledge of the organisation.
In order to widen the set of questions that might be answered from the banks’ graph, relations can be established to other external graphs. These relations connect knowledge together so that answering subsequent questions can make use of what is in information terms now a bigger graph, made up of the smaller ones.
The good news is that Thomson Reuters has been working on the foundational building blocks to establish perhaps one of the largest high-precision graph databases in the professional world. Leveraging the firm’s vast content assets, plus the high-definition entity identity enabled by Open PermID – Thomson Reuters plans to launch a feed that will expose up to 30 billion relationships linking entity types including securities, people, organisations and events. Connecting this with their own organisational graph will open up huge opportunities for our customers, combining global authoritative perspective with their own organisational knowledge, yielding high-value, locally contextualised answers and insights.
Recent engagements have revealed that many customers have already embarked on their own journey across the graph world – some are investigating, some experimenting, and a select few have implemented full-scale, big-data environments optimised for graph data. Use cases are almost too numerous to list, but range from relationship management and business development, to alpha and idea generation – and of course, risk analytics.
Risk is perhaps the biggest category – since graph databases help to identify hidden or complex relationships which go to the core of fraud detection, supply chain risk analysis and exposure to sanctioned entities. The Panama Papers helped expose such hidden connections and the importance of modeling and connecting entity data as part of the research process.
To get our customers started as quickly and easily as possible, Thomson Reuters award-winning Data Fusion platform is being optimised to enable our customers to rapidly shape their graph-shaped world, and leverage the Thomson Reuters Knowledge Graph.
Put data in context with Open PermId.
Read more from Exchange Magazine in the Know 360 app