the cognistx blog

Benefits and Challenges of Using AI to Digitize Standards

June 12, 2023
By
Uxue Zurutuza, Prachi Dhiman

Benefits and Challenges of Using AI to Digitize Standards

Standards Development Organizations (SDOs) have become increasingly interested in recent years in digitizing their published standards. Firstly, we start off with understanding why SDO’s are interested in digitizing and the use of AI/ML in that process.  Next, we’ll do a deep-dive in our technical approach and strategy.  In conclusion, we will review some of the challenges and how we overcame them.

Benefits of Digitization

There are many benefits to digitizing standards using artificial intelligence (AI) and machine learning (ML), including the following:

  1. Increased accuracy: AI/ML algorithms can accurately recognize and extract information from scanned documents, reducing errors and improving the accuracy of digitized standards.
  2. Improved speed: The process of digitization can be completed more quickly than traditional methods, saving time and resources.
  3. Enhanced searchability: Utilizing AI/ML can improve the searchability of digital standards by automatically tagging and categorizing content, making it easier to find specific information quickly.
  4. Greater efficiency: Using AI/ML can reduce the need for manual labor, saving time and resources and improving overall efficiency.

Digitizing Strategies and Technical Approach

At Cognistx, we have developed various AI solutions to extract useful information from significant unstructured data sources automatically and one of our core industries and area of expertise is the Technical Standards (Aerospace, Automotive, Medical, Construction, etc.). This is achieved by using natural language processing (NLP) techniques, which help a computer to understand and interpret human language, allowing us to create digital standards.

Standards are traditionally distributed and consumed as PDFs. However, this document format has a lot of limitations and drawbacks such as finding information in thousands of standards, change management, and tracking relationships between standards and other entities inside standards.

Cognistx has created a modular approach to be able to transition from PDFs to digital versions of the standards supported by a flexible data model. The data model is composed of multiple types of granular information elements that have been extracted from the standards such as sections, requirements, numerical properties, etc.

To start with, the documents have to be in a readable format for machines. Traditionally we have worked with XML inputs that already contain metadata and tags that help speed up the extraction process. Advanced optical Character Recognition (OCR) tools can also be applied directly to PDFs if XMLs are not available.

‍Once the data is in format ingestible by our extraction pipeline, the pipeline will start by identifying each of the granular elements of information that have to be extracted in each document.

When the information has been split into the right format or granular element, various machine learning (ML) and natural language processing (NLP) techniques are applied to process these elements and extract and standardize the required information. This includes but is not limited to paragraph, sentence, or property classification models, topic understanding, table identification, pattern recognition, etc.

To have training data and evaluation data, some rounds of annotations are done and a feedback loop is included to incorporate this information into the models that continuously learn. This allows us to experiment with different approaches and select the best-performing combination of ML models.

Once the extraction is completed, all the data is stored in a NoSQL Database. This provides flexibility at the time of storing this semi-structured extracted data.

Technical Challenges

One of the main challenges throughout this process has been human inconsistency at the time of generating standards.

Standards have evolved over the years and different committees and content generators are involved in the process. Therefore, a lot of different styles of writing and structuring content need to be handled by the extraction models. This variability requires additional effort at the time of adequating the pipeline, models, and output to the edge cases in order to ensure extraction quality.

In order to encourage the introduction of higher levels of consistency and standardization, SAE and Cognistx have introduced the concept of “Digital Ready Standards” (white paper, webinar, and podcast), providing guidelines and tips on how to avoid introducing challenging structures and content into the standards.

Additionally, machines try to mimic human behavior by learning from inputs and labels provided by them. When things are straightforward and consistent, the learning process is faster and more efficient leading to better metrics. The issue with textual information such as standards is its interpretability and human understanding.

What we saw along this process is that not all experts agreed on the meaning of certain sentences or how information was structured and interpreted. This ambiguity makes the task of machines learning and extracting content more challenging. The way to overcome this situation is by feeding more examples and incorporating feedback tools and output flexibility to adapt and fix potential machine errors.

Conclusion

The digitization of standards by Standards Development Organizations utilizing AI/ML offers many benefits over traditional methods of digitization, including increased accuracy, improved speed, enhanced searchability, and greater efficiency. The digitization is accomplished by utilizing a combination of ML and NLP techniques and models to extract, standardize and re-structure textual PDF content into flexible data models.  However, there are challenges, including inconsistencies and ambiguity in human writing making it difficult for a machine to digitize. Use the following links (white paper, webinar, and podcast) for more details on how to overcome those challenges.

Past Blog Posts