Hierarchical Vision-Language Alignment for Video Captioning

被引:16
|
作者
Zhang, Junchao [1 ]
Peng, Yuxin [1 ]
机构
[1] Peking Univ, Inst Comp Sci & Technol, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Video captioning; Hierarchical vision-language alignment; Multi-granularity;
D O I
10.1007/978-3-030-05710-7_4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We have witnessed promising advances on video captioning in recent years, which is a challenging task since it is hard to capture the semantic correspondences between visual content and language descriptions. Different granularities of language components (e.g. words, phrases and sentences), are corresponding to different granularities of visual elements (e.g. objects, visual relations and interested regions). These correspondences can provide multi-level alignments and complementary information for transforming visual content to language descriptions. Therefore, we propose an Attention Guided Hierarchical Alignment (AGHA) approach for video captioning. In the proposed approach, hierarchical vision-language alignments, including object-word, relation-phrase, and region-sentence alignments, are extracted from a well-learned model that suits for multiple tasks related to vision and language, which are then embedded into parallel encoder-decoder streams to provide multi-level semantic guidance and rich complementarities on description generation. Besides, multi-granularity visual features are also exploited to obtain the coarse-to-fine understanding on complex video content, where an attention mechanism is applied to extract comprehensive visual discrimination to enhance video captioning. Experimental results on widely-used dataset MSVD demonstrate that AGHA achieves promising improvement on popular evaluation metrics.
引用
收藏
页码:42 / 54
页数:13
相关论文
共 50 条
  • [1] VLCAP: VISION-LANGUAGE WITH CONTRASTIVE LEARNING FOR COHERENT VIDEO PARAGRAPH CAPTIONING
    Yamazaki, Kashu
    Truong, Sang
    Vo, Khoa
    Kidd, Michael
    Rainwater, Chase
    Luu, Khoa
    Le, Ngan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 3656 - 3661
  • [2] Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language
    Liu, An-An
    Xu, Ning
    Wong, Yongkang
    Li, Junnan
    Su, Yu-Ting
    Kankanhalli, Mohan
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2017, 163 : 113 - 125
  • [3] Hierarchical Language Modeling for Dense Video Captioning
    Dave, Jaivik
    Padmavathi, S.
    [J]. INVENTIVE COMPUTATION AND INFORMATION TECHNOLOGIES, ICICIT 2021, 2022, 336 : 421 - 431
  • [4] Unified Vision-Language Pre-Training for Image Captioning and VQA
    Zhou, Luowei
    Palangi, Hamid
    Zhang, Lei
    Hu, Houdong
    Corso, Jason J.
    Gao, Jianfeng
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 13041 - 13049
  • [5] Scaling Up Vision-Language Pre-training for Image Captioning
    Hu, Xiaowei
    Gan, Zhe
    Wang, Jianfeng
    Yang, Zhengyuan
    Liu, Zicheng
    Lu, Yumao
    Wang, Lijuan
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17959 - 17968
  • [6] Multimodal Features Alignment for Vision-Language Object Tracking
    Ye, Ping
    Xiao, Gang
    Liu, Jun
    [J]. REMOTE SENSING, 2024, 16 (07)
  • [7] Enhancing Video Summarization via Vision-Language Embedding
    Plummer, Bryan A.
    Brown, Matthew
    Lazebnik, Svetlana
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1052 - 1060
  • [8] Multi-task Learning of Hierarchical Vision-Language Representation
    Duy-Kien Nguyen
    Okatani, Takayuki
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10484 - 10493
  • [9] Exploring the Synergy Between Vision-Language Pretraining and ChatGPT for Artwork Captioning: A Preliminary Study
    [J]. IMAGE ANALYSIS AND PROCESSING - ICIAP 2023 WORKSHOPS, PT II, 2024, 14366 : 309 - 321
  • [10] Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?
    Wang, Fei
    Ding, Liang
    Rao, Jun
    Liu, Ye
    Shen, Li
    Ding, Changxing
    [J]. ACM Transactions on Multimedia Computing, Communications and Applications, 20 (12):