Affffraid of the Geek Squad

The F key on my laptop is broken. When I press it, it gets stuck to my finger and I have to press ffffff really hard until it clicks into place. It’s frustrating. And considering that I’m trying to…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Visual question answering with multimodal transformers

PyTorch implementation of VQA models using text and image transformers from Hugging Face

Recent years have seen significant advancements not only in the respective domains of Natural Language Processing (NLP) and Computer Vision (CV) but also in tasks involving multiple modalities (text + image features) such as image captioning, visual question answering (VQA), cross-modal retrieval, visual common-sense reasoning, and more. Among these, VQA has particularly drawn the interest of several researchers.

VQA is a multimodal task wherein, given an image and a natural language question related to the image, the objective is to produce a natural language answer correctly as output.

Multimodal models can be of various forms to capture information from the text and image modalities, along with some cross-modal interaction as well. In fusion models, the information from the text and image encoders are fused into a combined representation to perform the downstream task.

A typical fusion model for a VQA system involves the following steps:

Following are some methods used to perform the individual feature extraction and feature fusion steps:

Types of multimodal data fusion. Image created by the author.

In this article, I explore the idea of late fusion by fine-tuning pretrained text and image transformer models, as they are simpler to train.

We need to create a virtual environment and install the required packages:

To set up the environment for training our multimodal VQA model, we need to import the required modules and set the appropriate device for PyTorch.

The raw dataset contains the actual images separately in the images/ directory. All the question-answer pairs are present on consecutive lines in a .txtfile as shown below:

We run the following script to pre-process these question-answer pairs. It normalizes the questions by removing the image IDs present in the question. The questions and answers, along with the corresponding image IDs extracted during normalization, are stored in a tabular (CSV) format. Moreover, because the original DAQUAR dataset provides only about 54 percent of the question-answer pairs for training (this amounts to only around 6700 samples, which is very less for training), we produce our custom split (80 percent training and 20 percent evaluation) from the overall data.

Now we are set to load this data using this processed dataset. For this, we use the datasets library from Hugging Face. Since we model this task as a multiclass classification task, we should assign labels to every answer. These labels are derived from the indices of the answers in the answer space.

We can also inspect entries present in our training or evaluation dataset (specific or random) using Jupyter notebook:

A random entry from the training dataset after loading and creating labels from the answer-space.

Up to this point, we have just loaded the questions, answers, and corresponding image IDs, along with the labels. To feed the information about the question and actual images batchwise into our multimodal model, we need to define a data collator.

This collator will process the question (text) and the image and return the tokenized text (with attention masks) along with the featurized image (basically, the pixel values). These will be fed into our multimodal transformer model for question answering.

We use AutoTokenizer and AutoFeatureExtractor from Hugging Face transformers to convert the raw images and questions into inputs for featurization using the respective image and text transformers.

As mentioned previously, we use the idea of late fusion to define our multimodal model comprising:

We model VQA as a multiclass classification task. Thus, cross-entropy loss becomes a natural choice for the loss function to be minimized.

Besides training a particular VQA model with multimodal transformers, we intend to experiment with various pre-trained model combinations and evaluate their performance on the DAQUAR dataset.

Pretrained text transformers for experimentation to provide textual features.

Pretrained image transformers for experimentation to provide visual features.

Because we aim to experiment with multiple combinations of text and image transformers, it is reasonable to implement a function for creating the corresponding collators with the respective models.

For demonstration in this article, we will create the collator and model using the tokenizer, feature extractor, and models from pretrained BERT and ViT.

We approach the VQA task as a multiclass classification problem in this article. Hence, accuracy and macro F1 score are straightforward choices as metrics for evaluating the performance of our model. However, because these metrics may often be too restrictive, penalizing almost correct answers (‘tree’ versus ‘plant’) as heavily as incorrect answers (‘tree’ versus ‘table’), we select a metric like WUPS as our primary evaluation metric. Such a metric considers the semantic similarity between the predicted answer and the ground truth.

One option to evaluate open-ended natural language answers is to perform exact string matching. However, it is too stringent and cannot capture the semantic relatedness between the predicted answer and the ground truth. This prompts the use of other metrics that capture the semantic similarity of strings effectively. One such commonly used metric is the Wu and Palmer Similarity (WUPS) Score.

WUPS computes the semantic similarity between two words or phrases based on their longest common subsequence in the taxonomy tree. This score works well for single-word answers (hence, we use it for our task), but may not work for phrases or sentences.

We finally come to the part where we use the previously defined functions to initialize our multimodal model and train it using the Trainer from Hugging Face to abstract away most of the code required for setting up a PyTorch training loop. The hyperparameters such as training epochs, batch size, and so on, are passed to the Trainer by setting the corresponding values in the TrainingArguments.

For this article, we use the following hyperparameters:

Hyperparameters used for training the multimodal model.

The model checkpoints are saved periodically in the indicated output directory based on the information provided in the TrainingArguments.

To use any of the saved model checkpoints for inferencing, the question must be tokenized, and image features must be extracted appropriately (as done in the collator). These would serve as input to the model, with weights loaded from the trained checkpoint. The label predicted by the model is then mapped to the index of the actual answer in the answer space.

A similar approach is followed to train VQA models with various combinations of text and image transformers by changing the text and image arguments while calling the createMultimodalVQACollatorAndModel(...) function.

In summary, we successfully implemented, trained, and evaluated a late fusion type of multimodal transformer model in PyTorch for visual question answering using the DAQUAR dataset. We also learned how to use the model weights from a trained checkpoint to answer questions related to an image. Last, we compared the performance of several models using different text and image transformers to featurize the question and image before performing fusion.

I hope this article has given you a good overview of some of the concepts involved in visual question answering and helped you understand the nuances of training multimodal transformer models in PyTorch for such a task. Feel free to check out the References section below for more details regarding concepts and terms that I might have breezed through in this article. Please leave any feedback or suggestions in the comments section below.

Affffraid of the Geek Squad

Visual question answering with multimodal transformers

PyTorch implementation of VQA models using text and image transformers from Hugging Face

Add a comment

Related posts:

Democratising access to advanced consumer research

In Middle Earth I Am Illiterate

Your Cat Needs To Be Vegan