Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning

被引:0
|
作者
Zhong, Xian [1 ]
Li, Zipeng [1 ]
Chen, Shuqin [2 ]
Jiang, Kui [3 ]
Chen, Chen [4 ]
Ye, Mang [3 ]
机构
[1] Wuhan Univ Technol, Sch Comp Sci & Artificial Intelligence, Wuhan, Peoples R China
[2] Hubei Univ Educ, Coll Comp, Wuhan, Peoples R China
[3] Wuhan Univ, Sch Comp Sci, Wuhan, Peoples R China
[4] Univ Cent Florida, Ctr Res Comp Vis, Orlando, FL USA
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning aims to generate natural language sentences that describe the given video accurately. Existing methods obtain favorable generation by exploring richer visual representations in encode phase or improving the decoding ability. However, the long-tailed problem hinders these attempts at low-frequency tokens, which rarely occur but carry critical semantics, playing a vital role in the detailed generation. In this paper, we introduce a novel Refined Semantic enhancement method towards Frequency Diffusion (RSFD), a captioning model that constantly perceives the linguistic representation of the infrequent tokens. Concretely, a Frequency-Aware Diffusion (FAD) module is proposed to comprehend the semantics of low-frequency tokens to break through generation limitations. In this way, the caption is refined by promoting the absorption of tokens with insufficient occurrence. Based on FAD, we design a Divergent Semantic Supervisor (DSS) module to compensate for the information loss of high-frequency tokens brought by the diffusion process, where the semantics of low-frequency tokens is further emphasized to alleviate the long-tailed problem. Extensive experiments indicate that RSFD outperforms the state-of-the-art methods on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate that the enhancement of low-frequency tokens semantics can obtain a competitive generation effect. Code is available at https://github.com/lzp870/RSFD.
引用
收藏
页码:3724 / 3732
页数:9
相关论文
共 50 条
  • [41] Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning
    Shi, Botian
    Ji, Lei
    Niu, Zhendong
    Duan, Nan
    Zhou, Ming
    Chen, Xilin
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4337 - 4345
  • [42] RWS: Refined Weak Slice for Semantic Segmentation Enhancement
    Rao, Yunbo
    Lv, Qingsong
    Sharf, Andrei
    Cheng, Zhanglin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5704 - 5715
  • [43] Towards Semantic Multimodal Video Annotation
    Grassi, Marco
    Morbidoni, Christian
    Piazza, Francesco
    TOWARD AUTONOMOUS, ADAPTIVE, AND CONTEXT-AWARE MULTIMODAL INTERFACES: THEORETICAL AND PRACTICAL ISSUES, 2011, 6456 : 305 - 316
  • [44] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Wang, Ying
    Huang, Guoheng
    Lin Yuming
    Yuan, Haoliang
    Pun, Chi-Man
    Ling, Wing-Kuen
    Cheng, Lianglun
    APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260
  • [45] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Ying Wang
    Guoheng Huang
    Lin Yuming
    Haoliang Yuan
    Chi-Man Pun
    Wing-Kuen Ling
    Lianglun Cheng
    Applied Intelligence, 2022, 52 : 5241 - 5260
  • [46] Memory-attended semantic context-aware network for video captioning
    Chen, Shuqin
    Zhong, Xian
    Wu, Shifeng
    Sun, Zhixin
    Liu, Wenxuan
    Jia, Xuemei
    Xia, Hongxia
    SOFT COMPUTING, 2021, 28 (Suppl 2) : 425 - 425
  • [47] Memory-attended semantic context-aware network for video captioning
    Chen, Shuqin
    Zhong, Xian
    Wu, Shifeng
    Sun, Zhixin
    Liu, Wenxuan
    Jia, Xuemei
    Xia, Hongxia
    Soft Computing, 2021,
  • [48] Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
    Dong, Shanshan
    Niu, Tianzi
    Luo, Xin
    Liu, Wu
    Xu, Xinshun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [49] Modeling Context-Guided Visual and Linguistic Semantic Feature for Video Captioning
    Sun, Zhixin
    Zhong, Xian
    Chen, Shuqin
    Liu, Wenxuan
    Feng, Duxiu
    Li, Lin
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 677 - 689
  • [50] Constrained region-growing and edge enhancement towards automated semantic video object segmentation
    Gao, L.
    Jiang, J.
    Yang, S. Y.
    ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, PROCEEDINGS, 2006, 4179 : 323 - 331