Guy Yariv1, Idan Schwartz2, Yossi Adi1,*, Sagie Benaim1,*
The Hebrew University of Jerusalem, Israel1, Bar-Ilan University2
Equal advising.*

TL;DR

Can we enhance the visual commonsense of language models? Yes, by incorporating multiple predictions, each based on a uniquely generated image and integrated into a pre-trained language model’s decision-making process. This method achieves superior performance on visual commonsense reasoning tasks.

Abstract

Commonsense reasoning is fundamentally based on multimodal knowledge. However, existing large language models (LLMs) are primarily trained using textual data only, limiting their ability to incorporate essential visual information. In contrast, Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning. This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning. To this end, we introduce a method aimed at enhancing LLMs' visual commonsense. Specifically, our method generates multiple images based on the input text prompt and integrates these into the model's decision-making process by mixing their prediction probabilities. To facilitate multimodal grounded language modeling, we employ a late-fusion layer that combines the projected visual features with the output of a pre-trained LLM conditioned on text only. This late-fusion layer enables predictions based on comprehensive image-text knowledge as well as text only when this is required. We evaluate our approach using several visual commonsense reasoning tasks together with traditional NLP tasks, including common sense reasoning and reading comprehension. Our experimental results demonstrate significant superiority over existing baselines. When applied to recent state-of-the-art LLMs (e.g., Llama3), we observe improvements not only in visual common sense but also in traditional NLP benchmarks.

Method

This section outlines the architecture of our method, followed by detailed descriptions of the training and inference processes and an example illustrating these concepts in action.

Illustration of the proposed method for both training and inference
The architecture of our method: orange indicates trainable components and blue indicates frozen components.

The Training Process

Our model utilizes two types of input data: (i) real image pairs with their corresponding text descriptions, and (ii) text prompts with synthetically generated images. The core components of our training architecture include a pre-trained Language Learning Model (LLM) and a Vision Encoder, both of which remain frozen to preserve their pre-trained integrity. Visual features extracted by the Vision Encoder are transformed into a pseudo-textual format by the Visual Token Projector (VTP), allowing for multimodal integration. These visual tokens are then fused with the LLM's textual tokens through the Late Fusion Attention Layer (LFAL), employing an attention mechanism to enhance the model's predictive accuracy by minimizing the negative log-likelihood of the next textual token prediction.


Inference

Inference involves generating multiple images based on the input text, processing these images through a pre-trained language model, and combining their results using a late fusion layer to enhance decision-making.

Image Generation: For each input text, the model generates images conditioned on the k latest sentences or repeats these sentences as needed to produce a sufficient number of images. This is facilitated by the SDXL-Turbo model, chosen for its efficiency in generating diverse and relevant visual content from textual descriptions.

Probability Aggregation: The probabilities of different outcomes are computed for each generated image: \(\sum_{i=1}^{k} P_{\theta}(x_{t} \mid x_{1}, \ldots, x_{t-1}, v_i)\), where \(P_{\theta}\) represents the model's predictive function, \(x_t\) the target response, \(x_{1}, \ldots, x_{t-1}\) the preceding text, and \(v_i\) the generated images.

Alignment Score Calculation: To determine the relevance of each image to the input text, an alignment score is calculated using a CLIP model: \(\sum_{i}^k f(\bar{x}_i, v_i) P(x_{t} \mid x_{1}, \ldots, x_{t-1}, v_i) + (1 - f(\bar{x}_i, v_i))P(x_{t} \mid x_{1}, \ldots, x_{t-1})\). Here, \(f(\bar{x}_i, v_i)\) quantifies the alignment between the text \(\bar{x}_i\) and its corresponding image \(v_i\), influencing how each image affects the final decision.

Inference Example

Example of inference using visual commonsense
An illustrative example of our method at inference. On the LHS, we consider the task of visual commonsense. While Llama3's answer is wrong, our method generates 3 images and places, for the correct class, most weight (score) on the second and third images, thus answering correctly. On the RHS, for text generation, our method generates three images for different parts of the sentence. These images are each used in our method's answer. Llama3's answer, on the other hand, is less visually cohesive.

Results

This section presents the evaluation of our method against relevant baselines across various tasks focused on visual commonsense and textual understanding.

Visual Commonsense Performance

Table 1 showcases the performance of our method in zero-shot object commonsense tasks, including Memory Color, Color Terms, Object Shape, and Relative Size. These results highlight the effectiveness of our approach when compared to other models based on BERT and GPT-2, demonstrating significant improvements in all tasks.

Table 1: Object Commonsense Results
Table 1: Comparison against relevant baselines on a benchmark about visual commonsense.

Extended Evaluation on Various Tasks

Table 2 details the results on a broader set of tasks testing visual commonsense, commonsense reasoning, and reading comprehension. This table provides insights into how our method performs on tasks that test the model's ability to integrate and reason over multimodal data.

Table 2: Comprehensive Evaluation Results
Table 2: Performance on tasks testing visual commonsense, commonsense reasoning, and reading comprehension, showcasing the versatility and robustness of our approach.

Further Information

For a more comprehensive understanding of our work, you can read the full paper available at arXiv. The code for this project is accessible on GitHub.

For inquiries or further collaboration, feel free to reach out via email at guy.yariv@mail.huji.ac.il.

Citation

Please cite our work using the following BibTeX entry:

@misc{yariv2024improving,
    title={Improving Visual Commonsense in Language Models via Multiple Image Generation},
    author={Guy Yariv and Idan Schwartz and Yossi Adi and Sagie Benaim},
    year={2024},
    eprint={2406.13621},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}