Phương pháp diễn giải cho mô hình hỏi đáp hình ảnh tiếng Việt

An XAI Method for Vietnamese Visual Question Answering Models

  • Nguyễn Thanh Tân
  • Trần Thị Phương Linh
  • Lê Thanh Tùng
  • Tiến Huy Nguyễn
Keywords: XAI, visual question answering

Abstract

Interpretability in Visual Question Answering (VQA) models is a topic of interest because the transparency of the model can help users trust and understand its strengths and limitations. Multimodal models like VQA utilize both image and text data, posing challenges for interpretability techniques that are typically designed for a single type of data. In this work, we propose a combination of the Grad-CAM image explanation method and the Transformer-based text explanation method to analyze the behavior of the model. The results obtained from the Vietnamese VQA dataset lead us to the following observations: i) The visual Transformer extracts global features, while ResNet extracts local features; ii) the model struggles with images containing many distracting elements; iii) the text modality has less impact compared to the image modality; iv) the PhoBERT module shows bias toward certain types of questions.

Author Biographies

Nguyễn Thanh Tân

Nguyen Thanh Tan received his B.S. degree in Software Technology in 2024 from the University of Science, HoChiMinh City, Vietnam (VNU-HCM). He is currently working as a Software Engineer in the industry. His  research interests center on Deep Learning, with a particular focus on strategies for optimizing the cost and
computational efficiency of Large Language Models (LLMs) for enterprise applications.

Trần Thị Phương Linh

Tran Thi Phuong Linh received her B.S. degree in Computer Science from the University of Science, HoChiMinh City, Vietnam (VNU-HCM) in 2024. She research interests center on deep learning models to make their  predictions more understandable and trustworthy.

Lê Thanh Tùng

Le Thanh Tung earned his B.S. degree in Computer Science from the Honour Program at the University of Science, Vietnam National University, Ho Chi Minh City, in 2012, and his M.S. degree in Information Science from the Japan Advanced Institute of Science and Technology (JAIST) in 2018. He completed his Ph.D. in Information Science at JAIST in 2021. Now, he is a Lecturer of Information Technology at the University of Science, Ho Chi Minh City, Vietnam (HCMUS). His research interests include information retrieval, natural language processing, visual question answering, and multi-modal systems.

Tiến Huy Nguyễn

Nguyen Tien Huy received his B.S. degree in Software Technology in 2010 and his M.S. degree in Computer Science in 2015, both from the University of Science, HoChiMinh City, Vietnam (VNU-HCM). He earned his Ph.D. in Information Science from the Japan Advanced Institute of Science and Technology (JAIST) in 2019. He
is currently a lecturer in the Department of Computer Science, Faculty of Information Technology, VNU-HCM. His research interests include Natural Language Processing and Deep Learning, with a particular focus on intelligent systems and multimodal data analysis.

References

A. Weller, Transparency: Motivations and Challenges. Cham: Springer International Publishing, 2019.

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations

from deep networks via gradient-based localization,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626, 2017.

H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 782–791, 2020.

T. Le, K. Pho, T. Bui, H. T. Nguyen, and M. L. Nguyen, “Object-less vision-language model on visual question classification for blind people,” in Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART, pp. 180–187, SciTePress, Feb. 2022. ICAART 2022, February 3–5, 2022.

A. D. Nguyen, T. Le, and H. T. Nguyen, “Combining multivision embedding in contextual attention for vietnamese visual question answering,” in Image and Video Technology, (Cham), pp. 172–185, Springer International Publishing, 2023.

Y. Lyu, P. P. Liang, Z. Deng, R. Salakhutdinov, and L.- P. Morency, “DIME: Fine-grained interpretations of multimodal models via disentangled local explanations,” 2022. arXiv preprint.

D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach, “Multimodal explanations: Justifying decisions and pointing to the evidence,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8779–8788, 2018.

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929, 2016.

S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (Online), pp. 4190–4197, Association for Computational Linguistics, July 2020.

S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M¨uller, and W. Samek, “On pixel-wise explanations for nonlinear classifier decisions by layer-wise relevance propagation,” PLoS ONE, vol. 10, 2015.

K. Q. Tran, A. T. Nguyen, A. T.-H. Le, and K. V. Nguyen, “Vivqa: Vietnamese visual question answering,” in Pacific Asia Conference on Language, Information and Computation, 2021.

Published
2025-05-20