Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization

被引：0

作者：

Cui, Chenhao ^{[1
]}

Liang, Xinnian ^{[1
]}

Wu, Shuangzhi ^{[2
]}

Li, Zhoujun ^{[1
]}

机构：

[1] Beihang Univ, Beijing, Peoples R China

[2] Bytedance, Beijing, Peoples R China

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

关键词：

D O I：

10.1109/IJCNN54540.2023.10191104

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Natural language video localization (NLVL) aims to locate the matching span relevant to a given query sentence from an untrimmed video. This task requires not only understanding video and text but also aligning the semantics between video and language. Existing methods obtain vision-language representations via separate encoders, cross-modal interactions are not fine-grained enough, and the semantics are not fully aligned. In this paper, we address the vision-language alignment via joint modeling and contrastive learning. We propose a unified VideoLanguage Representation Network (UniNet), employing a transformer encoder to learn vision-language representations aligned. Simultaneously taking video and text as input, the encoder jointly learns the representations of both and captures the inter-relations between video and text. Then the representations are used by the predictor to locate the grounding video span. Besides, we train our model with contrastive learning to enhance vision-language representations in the training stage. Experiments on three benchmark datasets show that UniNet outperforms the baseline methods and adopting unified representation and contrastive learning can improve vision-language semantic alignment.

引用

页数：8

共 50 条

[1] HierVL: Learning Hierarchical Video-Language Embeddings
Ashutosh, Kumar
Girdhar, Rohit
Torresani, Lorenzo
Grauman, Kristen
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23066 - 23078
[2] Exploring Temporal Concurrency for Video-Language Representation Learning
Zhang, Heng
Liu, Daqing
Lv, Zezhong
Su, Bing
Tao, Dacheng
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15522 - 15532
[3] TEVL: Trilinear Encoder for Video-language Representation Learning
Man, Xin
Shao, Jie
Chen, Feiyu
Zhang, Mingxing
Shen, Heng Tao
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)
[4] Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Sun, Yuchong
Xue, Hongwei
Song, Ruihua
Liu, Bei
Yang, Huan
Fu, Jianlong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[5] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Jin, Peng
Huang, Jinfa
Liu, Fenglin
Wu, Xian
Ge, Shen
Song, Guoli
Clifton, David A.
Chen, Jie
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[6] Learning Trajectory-Word Alignments for Video-Language Tasks
Yang, Xu
Li, Zhangzikang
Xu, Haiyang
Zhang, Hanwang
Ye, Qinghao
Li, Chenliang
Yan, Ming
Zhang, Yu
Huang, Fei
Huang, Songfang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2504 - 2514
[7] LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Li, Linjie
Can, Zhe
Lin, Kevin
Lin, Chung-Ching
Liu, Zicheng
Liu, Ce
Wang, Lijuan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23119 - 23129
[8] Depth-Aware Sparse Transformer for Video-Language Learning
Zhang, Haonan
Gao, Lianli
Zeng, Pengpeng
Hanjalic, Alan
Shen, Heng Tao
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4778 - 4787
[9] UniVTG: Towards Unified Video-Language Temporal Grounding
Lin, Kevin Qinghong
Zhang, Pengchuan
Chen, Joya
Pramanick, Shraman
Gao, Difei
Wang, Alex Jinpeng
Yan, Rui
Shou, Mike Zheng
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2782 - 2792
[10] Probabilistic Representations for Video Contrastive Learning
Park, Jungin
Lee, Jiyoung
Kim, Ig-Jae
Sohn, Kwanghoon
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 14691 - 14701

← 1 2 3 4 5 →