A Novel Hybrid Architecture for Image Captioning

Hue Tran Thi; Ha Le Thi; Hai Pham Van; Long Hoang Do; Lan  Luong Thi Hong

doi:10.32913/mic-ict-research-vn.v2024.n2.1295

Hue Tran Thi Vietnam Academy of Science and Technology, Hanoi, Vietnam; Dai Nam University, Hanoi, Vietnam
Ha Le Thi Hanoi College of Industrial Economics, Hanoi, Vietnam
Hai Pham Van Hanoi University of Science and Technology, Hanoi, Vietnam
Long Hoang Do Hanoi University of Science and Technology, Hanoi, Vietnam
Lan Luong Thi Hong Hanoi University of Industry, Hanoi, Vietnam

DOI: https://doi.org/10.32913/mic-ict-research-vn.v2024.n2.1295

Keywords: Image captioning, transformer model, semantic graph, geometry graph

Abstract

Providing descriptions for things, visual objects, or image metadata is called image captioning. It involves using computer vision and natural language processing (NLP) methods to extract image features. In this work, we introduced SPNERSAN, a novel hybrid architecture that synergizes the Enhanced Relational Self-Attention Network (ER-SAN), which follows a Transformer-like structure, with our newly developed Subgraph Proposal Network (SPN). The primary objective of this combination is to enhance the image captioning task by leveraging the unique capabilities of both components. Our experiments on the Coco dataset demonstrate that the SPNERSAN model significantly improves the quality of generated captions compared with related methods. Our proposed model performs better in five standard metrics: BLEU, METEOR, ROUGR, CIDER, and SPICE. It indicates its effectiveness in producing coherent and contextually appropriate captions.

Author Biographies

Hue Tran Thi, Vietnam Academy of Science and Technology, Hanoi, Vietnam; Dai Nam University, Hanoi, Vietnam

TRAN THI HUE receive the degree in Computer Science from the Military Technical Academy in 2008. She works at Dai Nam University, Hanoi. Her research interests include artificial intelligence, image processing, machine learning, etc

Ha Le Thi, Hanoi College of Industrial Economics, Hanoi, Vietnam

LE THI HA receive the degree in Computer Science from Hanoi University of Industry in 2011. She works at Hanoi College of Economics and Industry and studied for a master’s degree at Hanoi University of Industry. Her research interests include artificial intelligence, image processing, machine learning, soft computing, etc
Email:

Hai Pham Van, Hanoi University of Science and Technology, Hanoi, Vietnam

PHAM VAN HAI receive a Doctor of Engineering degree (Ph.D.) at Ritsumeikan University (Japan). He is an Associate Professor at the School of Information and Communication Technology, Hanoi University of Science and Technology. His major fields include Artificial Intelligence, Knowledge-Based, Big Data, Soft Computing,
Rule-based, and Fuzzy Systems.

Long Hoang Do, Hanoi University of Science and Technology, Hanoi, Vietnam

Do Hoang Long has been graduated from Hanoi University of Science and Technology in 2024. He works at FPT Software and is pursuing a Master’s degree at the same university. His research focuses on Data Mining, Fuzzy Computing, and Artificial Intelligence.

Lan Luong Thi Hong, Hanoi University of Industry, Hanoi, Vietnam

LUONG THI HONG LAN has been graduated from Hanoi University of Technology in 2002 and received a PhD. Degree in Computer Science in Institute of Information Technology, Vietnam Academy of Science and Technology, Hanoi, Vietnam,in 2021. Now, she is working at the Faculty of Information Technology at Hanoi
University of Industry. Her research concentrates on Artificial Intelligence, Data Mining, Soft Computing, and Fuzzy Computing.

References

T. Ghandi, H. Pourreza, and H. Mahyar, “Deep learning approaches on image captioning: A review,” ACM Computing Surveys, vol. 56, no. 3, pp. 1–39, 2023.

A. Selivanov, O. Y. Rogov, D. Chesakov, A. Shelmanov, I. Fedulova, and D. V. Dylov, “Medical image captioning via generative pretrained transformers,” Scientific Reports, vol. 13, no. 1, p. 4171, 2023.

M. Shoman, D. Wang, A. Aboah, and M. Abdel-Aty, “Enhancing traffic safety with parallel dense video captioning for end-to-end event analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7125–7133.

Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in Proceedings of the

IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 971–10 980.

K. Safiya and R. Pandian, “A real-time image captioning framework using computer vision to help the visually impaired,” Multimedia Tools and Applications, vol. 83, no. 20, pp. 59 413–59 438, 2024.

Z. Song, X. Zhou, L. Dong, J. Tan, and L. Guo, “Direction relation transformer for image captioning,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5056–5064.

J. Li, Z. Mao, S. Fang, and H. Li, “Er-san: Enhancedadaptive relation self-attention network for image captioning.” in IJCAI, 2022, pp. 1081–1087.

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.

H. Wu, Y. Liu, H. Cai, and S. He, “Learning transferable perturbations for image captioning,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 18, no. 2, pp. 1–18, 2022.

M. Yuan, B.-K. Bao, Z. Tan, and C. Xu, “Adaptive text denoising network for image caption editing,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 19, no. 1s, pp. 1–18, 2023.

S. Dubey, F. Olimov, M. A. Rafique, J. Kim, and M. Jeon, “Label-attention transformer with geometrically coherent objects for image captioning,” Information Sciences, vol. 623, pp. 812–831, 2023.

T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proceedings of the

European conference on computer vision (ECCV), 2018, pp. 684–699.

X. Yang, H. Zhang, and J. Cai, “Learning to collocate neural modules for image captioning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4250–4260.

Z. Shi, X. Zhou, X. Qiu, and X. Zhu, “Improving image captioning with better use of captions,” arXiv preprint

arXiv:2006.11807, 2020.

A.-A. Liu, Y. Zhai, N. Xu, W. Nie, W. Li, and Y. Zhang, “Region-aware image captioning via interaction learning,”

IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3685–3696, 2021.

A. Abedi, H. Karshenas, and P. Adibi, “Multi-modal reward for visual relationships-based image captioning,” arXiv preprint arXiv:2303.10766, 2023.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio et al., “Graph attention networks,” stat, vol. 1050, no. 20, pp. 10–48 550, 2017.

A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017.

X. Li, G. Zhang, Y.Wu, X. Li, and Y. Zhang, “Rule of thirdsaware reinforcement learning for image aesthetic cropping,” The Visual Computer, vol. 39, no. 11, pp. 5651–5667, 2023.

M. Freitag and Y. Al-Onaizan, “Beam search strategies for neural machine translation,” arXiv preprint arXiv:1702.01806, 2017.

A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137.

J. Li, Z. Mao, H. Li, W. Chen, and Y. Zhang, “Exploring visual relationships via transformer-based graphs for enhanced image captioning,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 5, pp. 1–23, 2024.

A Novel Hybrid Architecture for Image Captioning

Abstract

Author Biographies

References

AIM, SCOPE, INDEXING

EDITORIAL BOARD