In the first article (part1 of the series), I introduced the Machine Reading Comprehension task. Machine Reading Comprehension is the task of building a system that understands the passage to answer questions related to it.
In the second article (part2 of the series), I introduced the Retriever and Reader architecture to scale the Machine Reading Comprehension task on larger documents. In this article, I will introduce the evaluation metrics for these models.
The performance of the Question Answering Model highly depends on the performance of both the Retriever and Reader models. Retriever and Reader are loosely coupled models and can be independently evaluated. Evaluation helps identify which model is not performing well and further replace or fine-tune it. To evaluate models, we need gold-standard labels which humans manually annotate. Evaluating Reader models is tricker compared to evaluating Retriever models. In this article, we discuss answer quality evaluation for Reader Model.
The outputs from a reader are the relevant answers (start index and end index) from passages across multiple documents for a user query. In the below example, the outputs for the reader are 0 (start index) and 10 (end index) for provided context/passage.
We must compare the gold standard answer/label (Tony Stark) with the predicted answer to evaluate the reader. I will introduce some of the widely used metrics below. Each metric score is for a single prediction. The average of individual prediction scores can be taken as the final score.
The Reader evaluation metrics can be divided into two categories:
Lexical or Keyword-based Evaluation Metrics entirely rely on string comparisons. The exact tokens or words in predicted answer and annotation labels are compared to calculate similarity. These metrics are simple to calculate and easy to understand. However, exact keyword matching is inefficient when a question has multiple correct answers, and annotators miss annotating every possible correct answer for that question.
1.1 Exact Match (EM):
Exact Match is a strict evaluation metric that only gives two scores (0 or 1). EM score is 1 if the answer provided by the annotator is precisely the same as the predicted answer; else, it gives 0 (EM score for “the US” and “the United States” is 0). EM might be a good metric for very short or concise factual answers, but it isn't suitable for long answers.
F1-Score is a looser metric than Exact Match; it considers the average overlap between the answer provided by the annotator and the predicted answer. Before calculating the score, we may need to perform some preprocessing on answers like converting words to lower cases, stemming or lemmatization, etc.
precision = TP/(TP+FP)
True Positive (TP) is the number of words that overlap between the annotated label and the predicted answer.
False Positive (FP) is the number of words present in the predicted answer but missing in the annotated label.
False Negative (FN) is the number of words present in the annotated label but missing in the predicted answer.
Let's consider the below example,
Annotated Answer / Gold Standard Label: Anthony Edward Stark
Predicted Answer by model: Tony Stark
Exact Match Score: 0
F1-Score: 0.67 [TP = 1(Stark), FP = 1(Tony), FN = 2(Anthony, Edward) -> precision = 0.5, recall = 0.398]
In the provided passage, “Anthony Edward Stark” is commonly known as “Tony Stark.” Hence, F1-Score can be a better metric compared to EM.
There are n-gram-based lexical matching algorithms are widely used in language translation applications. These algorithms rely on n-gram overlapping between answers and can also be used for evaluating Question Answering models. However, these algorithms are out of scope for this article.
While lexical metrics like EM and F1 scores are keyword-based and focus on keyword match, neural-based metrics focus on the semantics or meanings of answers. This metric is handy when a question has multiple answers, and the annotator missed labeling some of them as keywords might not, but answers have the same meaning. Relatively, these models are slower than keyword-based metrics because they usually are utilizing some deep learning model, but are closer to human judgment.
BertScore is based on the BERT language model. It computes the contextual embeddings for each word in answer labels and predicted answers. Later, an algorithm like cosine similarity is used to calculate the contextual similarity between each word in answer labels and each word in predicted answers. The highest cosine similarity between a token from label answer and a token from annotations is considered as BertScore. Vanilla BertScore uses the “bert-base-uncased” pre-trained model to calculate embeddings. Alternatively, BertScore can be trained on the relevant dataset before calculating embeddings. For Vanilla BertScore, embeddings are extracted from initial layers, and for trained BertScore, embeddings are extracted from the final layers of the BERT model.
2.2 Bi-Encoder Score
Bi-Encoder Score is based on sentence transformers architecture. It uses two language models to separately calculate embeddings for predicted answers and answer labels. Later cosine similarity is used to calculate the score between contextual embeddings. Before calculating embeddings, two language models are trained on the multi-lingual paraphrase dataset and STS benchmark dataset. One of the advantages of the bi-encoder architecture is that the embeddings of the two text inputs (predicted answers and answer labels) are calculated separately. Moreover, the embeddings of the answer labels can be pre-computed and reused when compared with the predictions of several different reader models. According to the original paper, this pre-computation can almost halve the time needed to run the evaluation.
#run the below code to calculate Bi-Encoder Score
!pip install sentence_transformers
from sentence_transformers import CrossEncoder
model = CrossEncoder(‘T-Systems-onsite/cross-en-de-roberta-sentence-transformer’, max_length=512)
scores = model.predict([[“atleast 1000”, “four thousand”]])
2.3 Semantic Similarity
Semantic Similarity or Semantic Answer Similarity (SAS) uses the “cross-encoder/stsb-roberta-large” language model, which has been trained on the STS benchmark dataset. Unlike Bi-Encoder where two separate models are used, SAS uses a cross-encoder architecture where a predicted answer and a label are separated by a special token to calculate the score. Among all neural-based metrics, cross-encoder model metrics have relatively the strongest correlation with human judgment.
#run the below code to calculate SAS
!pip install sentence_transformers
from sentence_transformers.cross_encoder import CrossEncoder
model = CrossEncoder('cross-encoder/stsb-roberta-large')
scores = model.predict([["Thirty Bucks", "30 $"]])
Print (Scores) #Answer: array([0.29968598], dtype=float32)
Let’s consider the below examples,
Annotated Answer / Gold Standard Label: I love you 3000
Predicted Answer by model: likes and adores very much
Exact Match Score: 0
F1-Score: undefined or generally considered as 0 [TP = 0, FP = 5, FN = 4-> precision = 0, recall = 0]
Bi-Encoder Score: 0.151
Semantic Answer Similarity: 0.482
Annotated Answer / Gold Standard Label: Thirty Bucks
Predicted Answer by model: 30 $
Exact Match Score: 0
F1-Score: undefined or generally considered as 0 [TP = 0, FP = 2, FN = 2-> precision = 0, recall = 0]
Bi-Encoder Score: 0.869
Semantic Answer Similarity: 0.493
In the United States, “Thirty Bucks” and “30$” are the same. Hence, the neural-based evaluation metrics like the SAS score make more sense than EM and F1 scores for this example. The Bi-Encoder Score or SAS score is more relevant to “Reader and Retriever architecture” as the models are beyond the lexical/keyword-based match.
Top-N Accuracy is more of an extension to EM, F1-Score, or SAS score. In general, annotators provide multiple gold standard annotations/labels for a question, and the model predicts various answers. Top-N accuracy takes first N predictions from the model (sorted in descending order of answer confidence) and checks if there is any lexical overlap between gold standard annotations/labels.
Let’s consider the below example (answers are ranked based on descending order of answer confidence)
Annotated Answers / Gold Standard Labels: Anthony Edward Stark, Tony Stark, Stark
Predicted Answer by model: Elon Musk, Tony, Stark
Top-1 Accuracy with F1-Score as Metric: 0 -> Top answer predicted by the model is “Elon Musk.” However, the F-Score for the “Elon Musk” answer is 0 for all the labels provided by annotators.
Top-2 Accuracy with F1-Score as Metric: 1 (for F1-score threshold = 0.5) -> Top two answers predicted by the model are “Elon Musk” and “Tony.” We know the F-Score for the “Elon Musk” answer is 0 for all the labels provided by annotators. However, the F1-Scores for the “Tony” with the label “Tony Stark” is 0.5 (with the rest of the labels[“Anthony Edward Stark,” “Stark”] is 0 as there is no word match). If we consider 0.5 as the threshold, we can consider this example accurate and give Top-2 accuracy as 1.
Kudos, you completed Evaluating Answer Quality for Reader Models for Machine Reading Comprehension Task.
The above-introduced metrics only focus on whether the answers predicted by the question are correct or incorrect when compared with labels provided by annotators. However, they don’t capture the ranking quality of predicted answers. In future articles, I will discuss ranking metrics for Reader and metrics to evaluate Retriever.
Stay tuned for more articles in the Open Domain Question Answering Series!