Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization

被引:0
|
作者
Cui, Chenhao [1 ]
Liang, Xinnian [1 ]
Wu, Shuangzhi [2 ]
Li, Zhoujun [1 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Bytedance, Beijing, Peoples R China
关键词
D O I
10.1109/IJCNN54540.2023.10191104
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Natural language video localization (NLVL) aims to locate the matching span relevant to a given query sentence from an untrimmed video. This task requires not only understanding video and text but also aligning the semantics between video and language. Existing methods obtain vision-language representations via separate encoders, cross-modal interactions are not fine-grained enough, and the semantics are not fully aligned. In this paper, we address the vision-language alignment via joint modeling and contrastive learning. We propose a unified VideoLanguage Representation Network (UniNet), employing a transformer encoder to learn vision-language representations aligned. Simultaneously taking video and text as input, the encoder jointly learns the representations of both and captures the inter-relations between video and text. Then the representations are used by the predictor to locate the grounding video span. Besides, we train our model with contrastive learning to enhance vision-language representations in the training stage. Experiments on three benchmark datasets show that UniNet outperforms the baseline methods and adopting unified representation and contrastive learning can improve vision-language semantic alignment.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] HierVL: Learning Hierarchical Video-Language Embeddings
    Ashutosh, Kumar
    Girdhar, Rohit
    Torresani, Lorenzo
    Grauman, Kristen
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23066 - 23078
  • [2] Exploring Temporal Concurrency for Video-Language Representation Learning
    Zhang, Heng
    Liu, Daqing
    Lv, Zezhong
    Su, Bing
    Tao, Dacheng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15522 - 15532
  • [3] TEVL: Trilinear Encoder for Video-language Representation Learning
    Man, Xin
    Shao, Jie
    Chen, Feiyu
    Zhang, Mingxing
    Shen, Heng Tao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)
  • [4] Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
    Sun, Yuchong
    Xue, Hongwei
    Song, Ruihua
    Liu, Bei
    Yang, Huan
    Fu, Jianlong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [5] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
    Jin, Peng
    Huang, Jinfa
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Song, Guoli
    Clifton, David A.
    Chen, Jie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] Learning Trajectory-Word Alignments for Video-Language Tasks
    Yang, Xu
    Li, Zhangzikang
    Xu, Haiyang
    Zhang, Hanwang
    Ye, Qinghao
    Li, Chenliang
    Yan, Ming
    Zhang, Yu
    Huang, Fei
    Huang, Songfang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2504 - 2514
  • [7] LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
    Li, Linjie
    Can, Zhe
    Lin, Kevin
    Lin, Chung-Ching
    Liu, Zicheng
    Liu, Ce
    Wang, Lijuan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23119 - 23129
  • [8] Depth-Aware Sparse Transformer for Video-Language Learning
    Zhang, Haonan
    Gao, Lianli
    Zeng, Pengpeng
    Hanjalic, Alan
    Shen, Heng Tao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4778 - 4787
  • [9] UniVTG: Towards Unified Video-Language Temporal Grounding
    Lin, Kevin Qinghong
    Zhang, Pengchuan
    Chen, Joya
    Pramanick, Shraman
    Gao, Difei
    Wang, Alex Jinpeng
    Yan, Rui
    Shou, Mike Zheng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2782 - 2792
  • [10] Probabilistic Representations for Video Contrastive Learning
    Park, Jungin
    Lee, Jiyoung
    Kim, Ig-Jae
    Sohn, Kwanghoon
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 14691 - 14701