Bert Transformer Text Similarity in Python

NLP

Image Create by Author

Introduction

Sentence similarity involves determining how closely related two sentences are in meaning. It measures the similarity between two or more sentences using a similarity score.

We often want to measure text similarity in various NLP tasks. Some of those tasks include information retrieval, text summarization, and question answering. One way of achieving that is by using a BERT Transformer model.

So, what is a BERT transformer model? BERT is a pre-trained language model developed by Google. It can be fine-tuned on a specific NLP task by training on large datasets. It can generate high-quality representations of words and sentences, allowing it to understand the context and meaning of the text at a deep level. This is achieved by using a technique called attention, which allows BERT to focus on the most important parts of the sentence while ignoring irrelevant information.

A simplified explanation of the architecture for a sentence-transformer BERT model is it allows it to process two sentences in parallel. The parameters for these two sentences are the same, allowing us to view this more as a single model used many times. The BERT model serves as a base in the architecture, then it is followed by a pooling layer. The pooling layer is a fixed-size representation for input sentences of varying lengths created by using an aggregation like MEAN pooling.

Image Created by Author

This blog aims to introduce readers to the concept of sentence similarity and how it can be achieved using a BERT Transformer model called sentence-transformers/all-MiniLM-L6-v2”. You can read more about it here.

The blog will provide an overview of BERT Transformer and its capabilities by looking at text similarity between mass shootings in the United States with Python code.

Files

In this example, we will use United States mass shooting data from Mother Jones. You can find a link to my files on my GitHub repository: GitHub SolisAnalytics. The files in the repository will contain the entire code.

The Process

We will start the text similarity process between mass shooting cases by importing the necessary packages.

The data has more than twenty columns. For our purposes, we only care about two - “case” and “summary.” The case column tells us the name of the case, while the summary column gives us a description of the case. We will be comparing the text summary of all cases with each other.

Using the pre-trained model, we need to tokenize the text in the summary column. We can accomplish that by storing the case summary text as a list, looping through it, running the pre-trained tokenizer model, storing the input ids and attention mask in their list within a dictionary, and then reformatting the output to a single tensor. I have set the max length parameter to 100 and truncation to true, which can change depending on the text context.

We need to run the token through the pre-trained model. The model output will have an attribute called “last hidden state” that we can use to access the dense vectors, allowing us to perform a mean pooling operation, creating a sentence embedding.

We multiply each value in our text embedding tensors by their attention mask value. This ignores non-real tokens. We then sum the text embedding along the first axis, add the number of values that must be given attention in each position of the tensor, then calculate the mean as the sum of the text embedding activations divided by the number of values that should be given attention in each spot.

Let’s convert the mean pooled output to a NumPy array, and store the case names as a list to reference later.

We need to calculate cosine similarity between all possible combinations of case summaries. One way to achieve that is through the combinations function in Itertools. Let’s create a function that takes the case names and mean pooled arrays and uses that to return a list of all case and mean pooled combinations. They will get stored as variables which we will reference later.

The end result is to compute the similarity score between all case summaries and then extract the top results for comparison. We can create another function that creates empty lists to store the cases and similarity scores. Then, we can create Pandas DataFrame, which we can rank later based on the highest cosine similarity score.

It is simple to retrieve the cases with the highest cosine similarity between pairings. All we have to do is use the “nlargest” methods on a DataFrame that contains our text similarity scores. I will set n=10.

The results are listed below.

The Florida manufacturing shooting and Accent Signage Systems Shooting have the highest cosine similar score at about 0.72. Of course, this comes from a pre-trained mini BERT model; other models can produce different results. Let’s examine the summary text of those two cases to see if they have components in common.

The summary description shows both cases involved a suspect fired before committing the murders. The pre-trained model did a great job of capturing both cases' meanings.

Conclusion

This post discussed implementing a Pre-trained BERT transformer model on a US mass shooting dataset. We first explained the purpose and high-level logic behind a BERT transformer model. We then reviewed the main parts of the code, from creating our text embeddings to calculating the cosine similarity scores for each case pairing in the dataset. Lastly, we looked at the top cases that were most similar based on the cosine similarity scores and discovered that the most similar cases were caused by individuals who were recently laid off from work. I hope you enjoyed this post! I will go over other NLP examples in the future.

Previous
Previous

Common Time Series Metrics Using Darts in Python

Next
Next

Common Supervised Machine Learning Algorithms