Distillation-Centric Approaches in Visual Question Answering with Mixture of Experts
Abstract
Recent advancements in computer vision and nat- ural language processing were applied to the Visual Question Answering task. Nonetheless, a significant proportion of mod- els exhibiting high accuracy possess extensive architectural components. This has a significant impact on the process of bringing the technology to practical applications such as assistive devices for the blind and visually impaired, and other related fields. Our research focuses on compressing the Visual Question Answering model on the Vietnamese dataset by utilizing the knowledge distillation method. Furthermore, in order to enhance precision, we have also developed a Mixture of ViVQA Experts system that will adapt to each type of question for improving accuracy while increasing only a few parameters and not wasting time retraining the entire system from scratch. With a total of 204M parameters, this approach has reduced the size by 24.51% compared to the original model while only reducing accuracy by 6.59% on the overall test set. More specifically, we have made accuracy improvements on each question type: “number” increased by 1.35% and “color” increased by 0.48% compared to our dis- tillation model. The code and pretrained models are available at: https://github.com/huynhhoanghuy/Distillation ViVQA.
References
A. C. A. M. de Faria, F. d. C. Bastos, J. V. N. A. da Silva, V. L. Fabris, V. d. S. Uchoa, D. G. d. A. Neto, and C. F. G. d. Santos, “Visual question answering: A survey on techniques and common trends in recent literature,” arXiv preprint arXiv:2305.11033, 2023.
B. Xin, N. Xu, Y. Zhai, T. Zhang, Z. Lu, J. Liu, W. Nie, X. Li, and A.-A. Liu, “A comprehensive survey on deep-learning-based visual captioning,” Multimedia Systems, vol. 29, no. 6, pp. 3781–3804, Dec 2023.
q. yue, R. Feng, F. Matsuzawa, T. Arai, and K. Iwata, “A survey of visual reasoning,” 2023.
N. S. Nguyen, V. S. Nguyen, and T. Le, “Advancing vietnamese visual question answering with transformer and convolutional integration,” Computers and Electrical Engi- neering, vol. 119, p. 109474, 2024.
C. Yang, S. Feng, D. Li, H. Shen, G. Wang, and B. Jiang, “Learning content and context with language bias for visual question answering,” in 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1–6.
H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, “VLMo: Uni- fied vision-language pre-training with mixture-of-modality- experts,” in Advances in Neural Information Processing Systems, 2022.
A. D. Nguyen, T. Le, and H. T. Nguyen, “Combining multi- vision embedding in contextual attention for vietnamese visual question answering,” in Image and Video Technology. Cham: Springer International Publishing, 2023, pp. 172–185.
K. V. Tran, K. Van Nguyen, and N. L. T. Nguyen, “Bart- phobeit: Pre-trained sequence-to-sequence and image trans- formers models for vietnamese visual question answering,” in 2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR). IEEE, 2023, pp. 1–6.
V. Joshi, P. Mitra, and S. Bose, “Multi-modal multi-head self-attention for medical vqa,” Multimedia Tools and Ap- plications, Oct 2023.
A. A. Yusuf, C. Feng, X. Mao, R. Ally Duma, M. S. Abood, and A. H. A. Chukkol, “Graph neural networks for visual question answering: a systematic review,” Multimedia Tools and Applications, Nov 2023.
A. Berthelier, T. Chateau, S. Duffner, C. Garcia, and C. Blanc, “Deep model compression and architecture opti- mization for embedded systems: A survey,” Journal of Signal Processing Systems, vol. 93, no. 8, pp. 863–878, Aug 2021.
H. Raj Khan, D. Gupta, and A. Ekbal, “Towards developing a multilingual and code-mixed visual question answering system by knowledge distillation,” in Findings of the Associ- ation for Computational Linguistics: EMNLP 2021. Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 1753–1767.
J. Wang, S. Huang, H. Du, Y. Qin, H. Wang, and W. Zhang, “Mhkd-mvqa: Multimodal hierarchical knowledge distilla- tion for medical visual question answering,” in 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2022, pp. 567–574.
K. Q. Tran, A. T. Nguyen, A. T.-H. Le, and K. V. Nguyen, “ViVQA: Vietnamese visual question answering,” in Pro- ceedings of the 35th Pacific Asia Conference on Language, Information and Computation. Shanghai, China: Associa- tion for Computational Lingustics, 11 2021, pp. 683–691.
Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, “Deep modular co-attention networks for visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6281–6290.
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou, “MobileBERT: a compact task-agnostic BERT for resource- limited devices,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 2158–2170.
X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “TinyBERT: Distilling BERT for nat- ural language understanding,” in Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 4163–4174.
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87, 1991.
M. Nguyen, H. Abbass, and R. McKay, “A novel mixture of experts model based on cooperative coevolution,” Neuro- computing, vol. 70, pp. 155–163, 12 2006.
R. Ebrahimpour, E. Kabir, H. Esteky, and M. R. Yousefi, “View-independent face recognition with mixture of ex- perts,” Neurocomputing, vol. 71, no. 4, pp. 1103–1107, 2008, neural Networks: Algorithms and Applications 50 Years of Artificial Intelligence: a Neuronal Approach.
V. Pahuja, J. Fu, and C. J. Pal, “Learning sparse mixture of experts for visual question answering,” arXiv preprint arXiv:1909.09192, 2019.
G. Liu, J. He, P. Li, S. Zhong, H. Li, and G. He, “Unified transformer with cross-modal mixture experts for remote- sensing visual question answering,” Remote Sensing, vol. 15, no. 19, 2023.
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,” J. Mach. Learn. Res., vol. 23, no. 1, jan 2022.
J. Feng, E. Tang, M. Zeng, Z. Gu, P. Kou, and W. Zheng, “Improving visual question answering for remote sensing via alternate-guided attention and combined loss,” International Journal of Applied Earth Observation and Geoinformation, vol. 122, p. 103427, 2023.
A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324.
L. Hou, C.-P. Yu, and D. Samaras, “Squared earth mover’s distance-based loss for training deep neural networks,” arXiv preprint arXiv:1611.05916, 2016. [28] K. Valmeekam, A. Olmo, S. Sreedharan, and S. Kambham- pati, “Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change),” in NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
