Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

被引:9
|
作者
Zhao, Wangbo [1 ,2 ,3 ]
Wang, Kai [1 ]
Chu, Xiangxiang [2 ]
Xue, Fuzhao [1 ]
Wang, Xinchao [1 ]
You, Yang [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Meituan Inc, Beijing, Peoples R China
[3] Northwestern Polytech Univ, Xian, Peoples R China
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52688.2022.01144
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify the performance and the generalization ability of our method compared to the state-of-the-art methods.
引用
收藏
页码:11727 / 11736
页数:10
相关论文
共 50 条
  • [1] Multi-modal molecule structure–text model for text-based retrieval and editing
    Shengchao Liu
    Weili Nie
    Chengpeng Wang
    Jiarui Lu
    Zhuoran Qiao
    Ling Liu
    Jian Tang
    Chaowei Xiao
    Animashree Anandkumar
    [J]. Nature Machine Intelligence, 2023, 5 : 1447 - 1457
  • [2] Multi-modal molecule structure-text model for text-based retrieval and editing
    Liu, Shengchao
    Nie, Weili
    Wang, Chengpeng
    Lu, Jiarui
    Qiao, Zhuoran
    Liu, Ling
    Tang, Jian
    Xiao, Chaowei
    Anandkumar, Animashree
    [J]. NATURE MACHINE INTELLIGENCE, 2023, 5 (12) : 1447 - 1457
  • [3] Multi-modal Broad Learning System for Medical Image and Text-based Classification
    Zhou, Yanhong
    Du, Jie
    Guan, Kai
    Wang, Tianfu
    [J]. 2021 43RD ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY (EMBC), 2021, : 3439 - 3442
  • [4] SIAMCLIM: TEXT-BASED PEDESTRIAN SEARCH VIA MULTI-MODAL SIAMESE CONTRASTIVE LEARNING
    Huang, Runlin
    Wu, Shuyang
    Jie, Leiping
    Zuo, Xinxin
    Zhang, Hui
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1800 - 1804
  • [5] VTLayout: A Multi-Modal Approach for Video Text Layout
    Zhao, Yuxuan
    Ma, Jin
    Qi, Zhongang
    Xie, Zehua
    Luo, Yu
    Kang, Qiusheng
    Shan, Ying
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2775 - 2784
  • [6] A New View of Multi-modal Language Analysis: Audio and Video Features as Text "Styles"
    Sun, Zhongkai
    Sarma, Prathusha K.
    Liang, Yingyu
    Sethares, William
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1956 - 1965
  • [7] A multi-modal approach to story segmentation for news video
    Chaisorn, L
    Chua, TS
    Lee, CH
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2003, 6 (02): : 187 - 208
  • [8] A Multi-Modal Approach to Story Segmentation for News Video
    Lekha Chaisorn
    Tat-Seng Chua
    Chin-Hui Lee
    [J]. World Wide Web, 2003, 6 : 187 - 208
  • [9] PaSeMix: A Multi-modal Partitional Semantic Data Augmentation Method for Text-Based Person Search
    Yuan, Xinpan
    Li, Jiabao
    Gan, Wenguang
    Xia, Wei
    Weng, Yanbin
    [J]. ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT III, ICIC 2024, 2024, 14864 : 468 - 479
  • [10] More than Text: Multi-modal Chinese Word Segmentation
    Zhang, Dong
    Hu, Zheng
    Li, Shoushan
    Wu, Hanqian
    Zhu, Qiaoming
    Zhou, Guodong
    [J]. ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 550 - 557