Phương pháp diễn giải cho mô hình hỏi đáp hình ảnh tiếng Việt
An XAI Method for Vietnamese Visual Question Answering Models
Abstract
Interpretability in Visual Question Answering (VQA) models is a topic of interest because the transparency of the model can help users trust and understand its strengths and limitations. Multimodal models like VQA utilize both image and text data, posing challenges for interpretability techniques that are typically designed for a single type of data. In this work, we propose a combination of the Grad-CAM image explanation method and the Transformer-based text explanation method to analyze the behavior of the model. The results obtained from the Vietnamese VQA dataset lead us to the following observations: i) The visual Transformer extracts global features, while ResNet extracts local features; ii) the model struggles with images containing many distracting elements; iii) the text modality has less impact compared to the image modality; iv) the PhoBERT module shows bias toward certain types of questions.
References
A. Weller, Transparency: Motivations and Challenges. Cham: Springer International Publishing, 2019.
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations
from deep networks via gradient-based localization,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626, 2017.
H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 782–791, 2020.
T. Le, K. Pho, T. Bui, H. T. Nguyen, and M. L. Nguyen, “Object-less vision-language model on visual question classification for blind people,” in Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART, pp. 180–187, SciTePress, Feb. 2022. ICAART 2022, February 3–5, 2022.
A. D. Nguyen, T. Le, and H. T. Nguyen, “Combining multivision embedding in contextual attention for vietnamese visual question answering,” in Image and Video Technology, (Cham), pp. 172–185, Springer International Publishing, 2023.
Y. Lyu, P. P. Liang, Z. Deng, R. Salakhutdinov, and L.- P. Morency, “DIME: Fine-grained interpretations of multimodal models via disentangled local explanations,” 2022. arXiv preprint.
D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach, “Multimodal explanations: Justifying decisions and pointing to the evidence,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8779–8788, 2018.
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929, 2016.
S. Abnar and W. Zuidema, “Quantifying attention flow in transformers,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (Online), pp. 4190–4197, Association for Computational Linguistics, July 2020.
S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M¨uller, and W. Samek, “On pixel-wise explanations for nonlinear classifier decisions by layer-wise relevance propagation,” PLoS ONE, vol. 10, 2015.
K. Q. Tran, A. T. Nguyen, A. T.-H. Le, and K. V. Nguyen, “Vivqa: Vietnamese visual question answering,” in Pacific Asia Conference on Language, Information and Computation, 2021.
