A Novel Hybrid Architecture for Image Captioning
Abstract
Providing descriptions for things, visual objects, or image metadata is called image captioning. It involves using computer vision and natural language processing (NLP) methods to extract image features. In this work, we introduced SPNERSAN, a novel hybrid architecture that synergizes the Enhanced Relational Self-Attention Network (ER-SAN), which follows a Transformer-like structure, with our newly developed Subgraph Proposal Network (SPN). The primary objective of this combination is to enhance the image captioning task by leveraging the unique capabilities of both components. Our experiments on the Coco dataset demonstrate that the SPNERSAN model significantly improves the quality of generated captions compared with related methods. Our proposed model performs better in five standard metrics: BLEU, METEOR, ROUGR, CIDER, and SPICE. It indicates its effectiveness in producing coherent and contextually appropriate captions.
References
T. Ghandi, H. Pourreza, and H. Mahyar, “Deep learning approaches on image captioning: A review,” ACM Computing Surveys, vol. 56, no. 3, pp. 1–39, 2023.
A. Selivanov, O. Y. Rogov, D. Chesakov, A. Shelmanov, I. Fedulova, and D. V. Dylov, “Medical image captioning via generative pretrained transformers,” Scientific Reports, vol. 13, no. 1, p. 4171, 2023.
M. Shoman, D. Wang, A. Aboah, and M. Abdel-Aty, “Enhancing traffic safety with parallel dense video captioning for end-to-end event analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7125–7133.
Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 971–10 980.
K. Safiya and R. Pandian, “A real-time image captioning framework using computer vision to help the visually impaired,” Multimedia Tools and Applications, vol. 83, no. 20, pp. 59 413–59 438, 2024.
Z. Song, X. Zhou, L. Dong, J. Tan, and L. Guo, “Direction relation transformer for image captioning,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5056–5064.
J. Li, Z. Mao, S. Fang, and H. Li, “Er-san: Enhancedadaptive relation self-attention network for image captioning.” in IJCAI, 2022, pp. 1081–1087.
S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.
H. Wu, Y. Liu, H. Cai, and S. He, “Learning transferable perturbations for image captioning,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 18, no. 2, pp. 1–18, 2022.
M. Yuan, B.-K. Bao, Z. Tan, and C. Xu, “Adaptive text denoising network for image caption editing,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 19, no. 1s, pp. 1–18, 2023.
S. Dubey, F. Olimov, M. A. Rafique, J. Kim, and M. Jeon, “Label-attention transformer with geometrically coherent objects for image captioning,” Information Sciences, vol. 623, pp. 812–831, 2023.
T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proceedings of the
European conference on computer vision (ECCV), 2018, pp. 684–699.
X. Yang, H. Zhang, and J. Cai, “Learning to collocate neural modules for image captioning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4250–4260.
Z. Shi, X. Zhou, X. Qiu, and X. Zhu, “Improving image captioning with better use of captions,” arXiv preprint
arXiv:2006.11807, 2020.
A.-A. Liu, Y. Zhai, N. Xu, W. Nie, W. Li, and Y. Zhang, “Region-aware image captioning via interaction learning,”
IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3685–3696, 2021.
A. Abedi, H. Karshenas, and P. Adibi, “Multi-modal reward for visual relationships-based image captioning,” arXiv preprint arXiv:2303.10766, 2023.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio et al., “Graph attention networks,” stat, vol. 1050, no. 20, pp. 10–48 550, 2017.
A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017.
X. Li, G. Zhang, Y.Wu, X. Li, and Y. Zhang, “Rule of thirdsaware reinforcement learning for image aesthetic cropping,” The Visual Computer, vol. 39, no. 11, pp. 5651–5667, 2023.
M. Freitag and Y. Al-Onaizan, “Beam search strategies for neural machine translation,” arXiv preprint arXiv:1702.01806, 2017.
A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137.
J. Li, Z. Mao, H. Li, W. Chen, and Y. Zhang, “Exploring visual relationships via transformer-based graphs for enhanced image captioning,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 5, pp. 1–23, 2024.