Original post | LinkedIn profile
In the previous article, I introduced the Machine Reading Comprehension model using SQuAD Dataset, Language Models, and Transfer Learning. In this article, I will discuss how to scale the Machine Reading Comprehension model for long documents.
The biggest limitation to the Machine Reading Comprehension model is the input size of the model. Models cannot take the entire document as input because the input size is limited to certain characters. For example, the input size of the Machine Reading Comprehension model using the Google BERT Language Model is limited to 512 characters. Hence the Machine Reading Comprehension model cannot be directly applied to long documents with more than 512 characters.
In this article, I will introduce the reader and retriever architecture which helps to overcome the limitation mentioned above for the Machine Reading Comprehension model. This architecture was first introduced in the Reading Wikipedia to Answer Open-Domain Questions paper. The documents are broken down into chunks of text. A chunk of text doesn’t have to be a paragraph or a sentence, it is a subset of a document. In the rest of the article, I will address these chunks of text as passages. The retriever’s job is to predict top passages that are relevant to the given question. The reader’s job is to run the Machine Reading Comprehension model on the relevant passages predicted by the retriever. In the previous article, we saw how the Reader (BERT) works. In this article, we will discuss different Retriever algorithms.
Search Engines often use Retriever Algorithms to rank web pages based on their relevance to a user search query. In our case, we use retriever algorithms to rank passages from all documents for a given question. Retriever Algorithms take a question and a passage as input and compute relevance score.
Retriever algorithms can be divided into two categories:
Classical or Sparse Retrieval Algorithms:
Classical Retrieval Algorithms go with an assumption that the higher the frequency of a word in a piece of text, the higher the likelihood of the piece of text to be about that word. Hence, if a piece of text contains words from the question with high frequency, the algorithm assumes that it could be relevant to the question. These algorithms are simple to understand and fast to compute. However, they don’t capture word order, word meaning, and grammatical structure while predicting relevance scores.
TF-IDF is a classic and famous informational retrieval algorithm. TF-IDF score is calculated by multiplying TF (Term Frequency) and IDF (Inverse Document Frequency).
Term Frequency is the number of times a word from the question occurs in a passage. Inverse Document Frequency penalizes common or stops words like “the, and, in” while increasing the weight for actual keywords. In our case, the passage from a document containing words from the question with a high TF-IDF score is the relevant passage for the given question.
TF = (Number of repetitions of a word from a question in a passage) / (Number of words in a passage)
IDF =log((Total number of passages) / (Number of passages containing the word)])
Notice that if the word is very common and has appeared in all passages, then the total number of passages will be equal to the number of passages containing the word. Hence IDF = log 1 = 0 making TF-IDF for that word to be zero. (TF-IDF score is often 0 for highly common words like and, the, a, etc.)
BM25 is an optimized version of TF-IDF. Let us consider a passage from a document that has the word pizza 10 times and a passage from another document that has the word pizza 20 times. If pizza is a keyword, then TFIDF gives twice the relevance score to the second passage compared to the first passage. In most applications, it is unfair to give twice or a significant relevance score just because the relevant term has twice the frequency. Moreover, longer passages have an advantage over shorter passages as longer passages will have more words than shorter passages, and the probability of repeating the relevant keyword is high.
BM25 modifies the TF formula such that it saturates TF after a certain number of occurrences of the word in a passage. It also modifies the IDF formula such that passages with different word lengths are normalized, and shorter passages have more preference over longer passages when they have the same frequency of words from the question.
In the above plot, we can see that the BM25 relevance curve increases a lot quicker than the TF-IDF curve but it later gets saturated. For example, the BM25’s relevance score for a word with frequency 20 is slightly higher than frequency 10 but TF-IDF’s relevance score for frequency 20 is twice higher than frequency 10.
More about BM25 and the formula
Neural or Dense Retrieval Algorithms:
Neural or Dense Retrieval Algorithms compute the relevance score using dense representations. The main advantages of Dense Retrievers over Sparse Retrievers are capturing semantic similarity (Leader, Captain, and Head, etc. are similar words) and differentiating ambitious words (river bank and money bank). However, Dense Retrievers are slower when compared to Spare Retrievers like TF-IDF and BM25.
SBERT generates relevance scores by performing the dot product or cosine similarity between the question vector and the vectors of all passages (smaller the angle between the vectors -> higher the dot product -> higher the relevance score). SBERT uses the BERT encoder to generate the question vector for the given question and passage vectors for all passages.