the cognistx blog

Data Quality Engine (DQE)

December 11, 2019

An AI Tool to Help Clean Your Data and Find Known and Unknown Anomalies

Introduction:
In any company with an astounding amount of data, it is critical to ensure that the highest level of quality is met within the data in order to guarantee precise and worthy business decisions.

Anomalous data refers to information that is inaccurate, unreliable, and lacking data integrity. The harsh truth is that no industry, organization, or business is safe from having anomalies in their data. If not made aware of and corrected initially, certain anomalies can cause serious backlash in the business and operations.

Cognistx has developed an AI enabled Data Quality Engine (DQE) platform that allows users to assess and interact with their data in respect to their given business rules along with delivering overlooked insights within the data through machine learning and statistical analysis capabilities.

Source data from multiple systems is processed using multiple AI-enabled rules to clean the data and stored in a data lake repository


Anomaly Detection:
Common anomalies in a given data stream include:

Manual Entry Error: Misspells, typos, exclusions in spelling, naming or formatting.

Missing Data: Gaps or empty values within data

Inconsistent Data: Data that does not follow integrity as per the defined constraints.


From a business specific sense, anomalies can be targeted by utilizing rules or statistics-based thresholds which incorporate a sense of logic defined by business standards.

For example, if an online retail business defines having fewer than 90 orders per day for a particular product as abnormal or anomalous, then the given rule would be incorporated within the platform and products with anomalies defined by the rule would be displayed for the user to view. Additionally, if a corrected record comes in the data stream that was classified as an anomaly before, the platform will marked and display the record as ‘resolved’.


High Volume of Streamed Data:
A major blocker for companies with large amounts of data is the ability to efficiently look through the data for inconsistencies. Our cloud based solution has the power to run through and analyze  millions of records in less than an hour - which would take an average company months and perhaps years to go through manually and with basic data display tools.

A major component within our cloud backend is the recalculation module, which ingests real time data and computes adjusted statistical analysis through comparisons with historical data. The recalculation module contributes greatly to the source of live data on the platform and is an integral part of adjusting statistical models per variable (such as client, location, etc.)


Ranking Anomalies:


Our platform is designed to show users the most significant anomalies in data through historical distribution analysis and integration of user specific business rules.


Users have the option to rank anomalies by the most statistically unusual data points along with basic sorting options.


Atypical Records:


Detecting abnormalities in data can be a tedious task, especially when there is no specific rule attached to the determination of given data being abnormal. As users provide a data dictionary, which provides logical explanations of how variables interact within the data, our system can get adjusted to the logic over time and bring attention to potential abnormalities, or atypicals as we like to call them, to the user. This component acts as a background statistical monitor for the data, which becomes further adjusted through user feedback.



Human Feedback Interface:


Before an AI system can get a sense of it’s intelligence, users must first initiate and regulate the gears of the system, which is done through feedback on our platform.  From re-ranking anomalies to confirming atypical behavior, users can provide effective feedback which results in the system’s intelligence becoming more concrete and fine-tuned to the true distinction of normal and anomalous data.


Creating a Rule:
One of the most promising tools of our platform is the “Create an rule” component. If users become curious as to what anomalies they might find using specific logic, our platform allows users to create a brand new rule which incorporates such logic. They can preview the rule and determine its value in the overall system, thus tuning the system even further.

With the above features, Data Quality Engine (DQE) is a comprehensive tool that helps you find data anomalies and abnormalities, and learns over time to help you continuously keep your data clean and actionable.


Recent Posts