Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning

被引:0
|
作者
Zhong, Xian [1 ]
Li, Zipeng [1 ]
Chen, Shuqin [2 ]
Jiang, Kui [3 ]
Chen, Chen [4 ]
Ye, Mang [3 ]
机构
[1] Wuhan Univ Technol, Sch Comp Sci & Artificial Intelligence, Wuhan, Peoples R China
[2] Hubei Univ Educ, Coll Comp, Wuhan, Peoples R China
[3] Wuhan Univ, Sch Comp Sci, Wuhan, Peoples R China
[4] Univ Cent Florida, Ctr Res Comp Vis, Orlando, FL USA
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning aims to generate natural language sentences that describe the given video accurately. Existing methods obtain favorable generation by exploring richer visual representations in encode phase or improving the decoding ability. However, the long-tailed problem hinders these attempts at low-frequency tokens, which rarely occur but carry critical semantics, playing a vital role in the detailed generation. In this paper, we introduce a novel Refined Semantic enhancement method towards Frequency Diffusion (RSFD), a captioning model that constantly perceives the linguistic representation of the infrequent tokens. Concretely, a Frequency-Aware Diffusion (FAD) module is proposed to comprehend the semantics of low-frequency tokens to break through generation limitations. In this way, the caption is refined by promoting the absorption of tokens with insufficient occurrence. Based on FAD, we design a Divergent Semantic Supervisor (DSS) module to compensate for the information loss of high-frequency tokens brought by the diffusion process, where the semantics of low-frequency tokens is further emphasized to alleviate the long-tailed problem. Extensive experiments indicate that RSFD outperforms the state-of-the-art methods on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate that the enhancement of low-frequency tokens semantics can obtain a competitive generation effect. Code is available at https://github.com/lzp870/RSFD.
引用
收藏
页码:3724 / 3732
页数:9
相关论文
共 50 条
  • [21] Semantic Tag Augmented XlanV Model for Video Captioning
    Huang, Yiqing
    Xue, Hongwei
    Chen, Jiansheng
    Ma, Huimin
    Ma, Hongbing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4818 - 4822
  • [22] Richer Semantic Visual and Language Representation for Video Captioning
    Tang, Pengjie
    Wang, Hanli
    Wang, Hanzhang
    Xu, Kaisheng
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1871 - 1876
  • [23] Semantic-Conditional Diffusion Networks for Image Captioning
    Luo, Jianjie
    Li, Yehao
    Pan, Yingwei
    Yao, Ting
    Feng, Jianlin
    Chao, Hongyang
    Mei, Tao
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23359 - 23368
  • [24] Towards Few-shot Image Captioning with Cycle-based Compositional Semantic Enhancement Framework
    Zhang, Peng
    Bai, Yang
    Su, Jie
    Huang, Yan
    Long, Yang
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [25] A Video Captioning Method by Semantic Topic-Guided Generation
    Ye, Ou
    Wei, Xinli
    Yu, Zhenhua
    Fu, Yan
    Yang, Ying
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 78 (01): : 1071 - 1093
  • [26] Video Captioning Based on Channel Soft Attention and Semantic Reconstructor
    Lei, Zhou
    Huang, Yiyong
    FUTURE INTERNET, 2021, 13 (02) : 1 - 18
  • [27] Video Captioning With Attention-Based LSTM and Semantic Consistency
    Gao, Lianli
    Guo, Zhao
    Zhang, Hanwang
    Xu, Xing
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) : 2045 - 2055
  • [28] Video captioning algorithm based on mixed training and semantic association
    Chen, Shuqin
    Zhong, Xian
    Huang, Wenxin
    Lu, Yansheng
    Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2023, 51 (11): : 67 - 74
  • [29] Semantic Enhanced Video Captioning with Multi-feature Fusion
    Niu, Tian-Zi
    Dong, Shan-Shan
    Chen, Zhen-Duo
    Luo, Xin
    Guo, Shanqing
    Huang, Zi
    Xu, Xin-Shun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (06)
  • [30] Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
    Lu, Yifan
    Zhang, Ziqi
    Yuan, Chunfeng
    Li, Peng
    Wang, Yan
    Li, Bing
    Hu, Weiming
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3909 - 3917