Multimodal Vision-language Models with Named Entities

Undergraduate Research


Vision-language models can perform tasks such as image-captioning and visual question answering (VQA). By training deep learning based vision-language models on large datasets, current methods are able to get high-performance on vision-language tasks. However, these models often fail to integrate named entities into generated text. In contrast, humans frequently use named entities. In this work, we propose an approach for zero-shot caption integration of named entities.

Figure 1: A caption generated with our model compared to the baseline model.


To build a model that can integrate named-entities into image captions, we use auxiliary classifiers to identity non-generic terms. Specifically, we use ArcFace Face Recognition model and Google Cloud OCR API. The primary challenge this work solves is developing a method such that tokens discovered by auxiliary classifiers can be naturally integrated into text. We extend prior to include a special token modality, which accepts tokens from the auxiliary classifiers. Input modalities are passed to a transformer architecture and combined in a learned way such that each modality is aware the others. A latent representation is decoded with a BERT-style module, then text tokens are auto-repressively selected from either the 1) the generic model vocabulary or 2) newly discovered special tokens.

To score each special token \(1..N\) during auto-regressive steps \(1..T\), we use a bi-linear layer that uses a special token specific weight matrix \(W^{ST}\) and general decoding matrix \(W^{dec}\) \begin{equation} y_{n,t}^{st} = (W^{st}z_n^{st}+b^{st})^T(W^{dec}z_t^{dec}+b^{dec}) \end{equation} to create output logits \(y_{n,t}^{st}\). An overview of our architecture can be seen in Figure 2.

Figure 2: The architecture of our proposed approach. Our architecture design is inspired by the M4C method [1].

Rich Representations

In order to get the model to learn representations, rich semantic information needs to be encoded. To create a rich encoding, we use spatial information (bounding box), source feature (one-hot encoding), visual information (Faster-RCNN feature), and textual information (PHOC, fasttext). We combine these in the following way: \begin{equation} x_i^{spec}=LN(W_1([x_i^{fr};x_i^{ft};x_i^{p}])) + LN(W_2x_i^b) + LN(W_3x_i^s) \end{equation} where \(W_{1..3}\) are learned weight matrices. A diagram of our token encoding method is shown in Figure 3.

Figure 3: Various input information is used to create rich semantic encodings of each special token.

Training Loss

The loss for our model is decoding binary cross entropy \(L_{dbce}\), which is simply the sum of the loss at each time step \(1..T\) \begin{equation} L_{dbce} = \sum_{t=1}^{T_{end}} \frac{L_{bce}(t)}{T_{end}} \end{equation} where \(\mathcal{L}_{bce}\) is sigmoid binary cross entropy: \begin{equation} \mathcal{L}_{bce} = g_n*\log(\sigma(y_n))+(1-g_n)\log(1-\sigma(y_n)). \end{equation}

Training Data

Several components are pretrained prior to the training of the captioning model. The upstream Faster-RCNN object detector is trained on Imagenet; the BERT-style decoder is pretrained on a large text corpora; and the Face Recognition model is trained in MS1Mv2. After pretraining, the image captioning architecture is trained on the TextCaps dataset, which contains captions with in-scene text [2]. While TextCaps provides a strong signal for integrating OCR tokens, it does not often include named entities. Thus, the captioning model has no signal to learn to integrate tokens from the Face Recognition module.

To overcome the lack of named entities in common image-captioning datasets, we collect the Politicians and Athletes in Captions (PAC) dataset to supplement training. The PAC dataset contains 1,600 images with three captions each. The images were collected by scraping collective commons database for well-known athletes and political figures from around the world. The images were captioned with Amazon Mechanical Turk labeling service. Several samples from PAC can be seen in the below figure.

Figure 4: Samples from our image-captioning dataset containing well-known persons.


After training on TextCaps and PAC, the model is able to naturally integrate both special tokens into image captions. In the below samples, it can be seen the models effectively switches between OCR tokens, Face Recognition tokens, and general model vocabulary.

Figure 5: Qualitative results on images from the PAC test-set.

Additionally, we visualize the learned embeddings of separate token types in 2D space. The below Figure shows OCR token embeddings and Face Recognition token embeddings exist in separate subspaces of the latent space. This result corresponds with the observation that our model is capable of using each token type appropriately as shown in the previous Figure.

Figure 6: T-SNE project of special token feature vectors (output of Figure 3). It can be seen the model learns discernibly different latent feature representations for OCR vs Face Recognition tokens.

Additional discussion, quantitative results, and details can be found in our paper here.


This project was completed with my collaborator Zanyar Zohourianshahzadi and project advisor Jugal Kalita.

The work reported in this paper is supported by the National Science Foundation under Grant No. 2050919.

This work was published at the International Conference of Natural Language Processing and the AAAI Undergraduate Consortium.


  1. Hu, R., Singh, A., Darrell, T., & Rohrbach, M. (2020). Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9992-10002).
  2. Sidorov, O., Hu, R., Rohrbach, M., & Singh, A. (2020, August). Textcaps: a dataset for image captioning with reading comprehension. In European conference on computer vision (pp. 742-758). Springer, Cham.