Data quality is one of the biggest challenges facing companies today. Many companies have access to large quantities of unstructured data (whether public or private), but lack the proper tools to understand it. Poor quality data can lead to wasted resources and inefficiencies, as employees may spend time cleaning and organizing data that is of poor quality, rather than using that time to analyze and act on the data. If a company is using inaccurate or incomplete data to make important decisions, it could lead to poor decision-making and ultimately harm the company's performance.
At Cognistx, we’re hard at work building AI solutions to automatically extract useful information from large unstructured data sources. Information extraction is commonly used in a variety of applications, such as identifying entities (e.g. names of people, organizations, or locations) and their relationships in a text, extracting key phrases or terms from a document, or summarizing the main points of a text. This is typically done using natural language processing (NLP) techniques, which allow a computer to understand and interpret human language.
To demonstrate the importance of information extraction, let’s pretend we’re a company who wants to create a new board game. Since developing a game costs a lot of money, this can be a risky decision. Before we jump into writing the mechanics, our company decides to do a large review of many other board games. As a company developing a new board game, having lots of data about other board games already on the market can be incredibly valuable for a number of reasons. First, it can help you understand the current state of the market and identify trends and gaps that your game could fill. This can help you make more informed decisions about the design, theme, and mechanics of your game, increasing the chances that it will be successful. Additionally, having data about other games on the market can help you identify potential competitors and understand their strengths and weaknesses. This can inform your marketing and sales strategies, allowing you to position your game in a way that sets it apart from the competition and appeals to your target audience. Finally, having data about other games on the market can help you benchmark your own game's performance and track its success over time. This can give you valuable insights into what is working well and what areas need improvement, allowing you to continuously improve your game and maintain its competitiveness in the market.
Since many board games post their rules online, it is pretty easy for us to assemble a large collection of PDF documents. Let’s focus on three familiar games: Monopoly, Candy Land, and The Game of LIFE.
As you can see, these rule documents have very different formats. Monopoly is a single column, but Candy Land and The Game of LIFE have two columns with images in and around the text.
Before we can extract information from these documents, we need to use a technique called optical character recognition (OCR) to extract the text.
Using advanced machine learning techniques, we can not only get the text for these documents, but ensure that our text is sorted properly when there are multiple columns. Everything looks good, so let’s move on to information extraction.
Since we may have many documents, the first thing we need to know about a game is the name. Let’s use our information extraction pipeline to automatically extract this information from each document.
Since we are using machine learning to automatically do this extraction, we might want to know where it got that information from and how confident our model is that this answer is right. For each of these games, we see that the correct answer is the prediction with highest confidence!
Let’s try understanding something a little bit more complicated about these games. What if we want to know how these games decide which player goes first?
Cool! Our models predict the correct answer in the top 1 or 2 predictions for each game. For Candy Land, our model predicted “Silent” as the first answer, which is just its way of saying that it doesn’t think the answer can be found in the document. However, it is easy for us to verify that “The youngest” is actually the correct answer by looking at the context.
Let’s keep going! Obviously, we might want to know how a player can actually win the game.
Wow! For Monopoly, our models find two correct answers: “The last player left in the game” for the short version of the game and “The richest” for the time limit version. For CandyLand and The Game of LIFE, our models again predict the correct answer in the top 1 or 2 predictions.
Finally, we might want to know what kind of pieces each of these games use. Let’s try asking whether or not these games use dice. Note that for these predictions, our answers should be Yes, No, or indicate that the answer cannot be found.
Since Monopoly is the only game that uses dice, we see that the second prediction is correct. For the other two games, our models suggest that there is either no mention of dice in the game rules and even provides context for what other materials the games use instead (i.e. drawing cards or using a spinner).
Doing all of these extractions manually would have been super tedious and time consuming, especially for many documents. By using Cognistx’s information extraction pipeline, we can automatically turn poor-quality, unstructured data into high-quality, structured data which can be used for many downstream tasks.