Phương pháp diễn giải cho mô hình hỏi đáp hình ảnh tiếng Việt

  • Tiến Huy Nguyễn

Abstract

Interpretability in Visual Question Answering (VQA) models is a topic of interest because the transparency of the model can help users trust and understand its strengths and limitations. Multimodal models like VQA utilize both image and text data, posing challenges for interpretability techniques that are typically designed for a single type of data. In this work, we propose a combination of the Grad-CAM image explanation method and the Transformer-based text explanation method to analyze the behavior of the model. The results obtained from the Vietnamese VQA dataset lead us to the following observations: i) The visual Transformer extracts global features, while ResNet extracts local features; ii) the model struggles with images containing many distracting elements; iii) the text modality has less impact compared to the image modality; iv) the PhoBERT module shows bias toward certain types of questions.

Published
2025-05-20