VIDEO QUESTION ANSWERING USING CLIP-GUIDED VISUAL-TEXT ATTENTION

被引:1
|
作者
Ye, Shuhong [1 ]
Kong, Weikai [1 ]
Yao, Chenglin [1 ]
Ren, Jianfeng [1 ,2 ]
Jiang, Xudong [3 ]
机构
[1] Univ Nottingham Ningbo China, Sch Comp Sci, Ningbo, Peoples R China
[2] Univ Nottingham Ningbo China, Nottingham Ningbo China Beacons Excellence Res &, Ningbo, Peoples R China
[3] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore, Singapore
基金
中国国家自然科学基金;
关键词
Video Question Answering; CLIP; Cross-modal Learning; Cross-domain Learning;
D O I
10.1109/ICIP49359.2023.10222286
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets and outperforms state-of-the-art methods.
引用
收藏
页码:81 / 85
页数:5
相关论文
共 50 条
  • [1] Focal Visual-Text Attention for Visual Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Li, Li-Jia
    Hauptmann, Alexander
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
  • [2] Focal Visual-Text Attention for Memex Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Kalantidis, Yannis
    Li, Li-Jia
    Hauptmann, Alexander G.
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (08) : 1893 - 1908
  • [3] Question Type Guided Attention in Visual Question Answering
    Shi, Yang
    Furlanello, Tommaso
    Zha, Sheng
    Anandkumar, Animashree
    [J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 158 - 175
  • [4] Text-Guided Dual-Branch Attention Network for Visual Question Answering
    Li, Mengfei
    Gu, Li
    Ji, Yi
    Liu, Chunping
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 750 - 760
  • [5] Visual Question Answering using Explicit Visual Attention
    Lioutas, Vasileios
    Passalis, Nikolaos
    Tefas, Anastasios
    [J]. 2018 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2018,
  • [6] Towards Video Text Visual Question Answering: Benchmark and Baseline
    Zhao, Minyi
    Li, Bingjia
    Wang, Jie
    Li, Wanqing
    Zhou, Wenjing
    Zhang, Lan
    Xuyang, Shijie
    Yu, Zhihang
    Yu, Xinkun
    Li, Guangze
    Dai, Aobotao
    Zhou, Shuigeng
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [7] Long video question answering: A Matching-guided Attention Model
    Wang, Weining
    Huang, Yan
    Wang, Liang
    [J]. PATTERN RECOGNITION, 2020, 102
  • [8] CgT-GAN: CLIP-guided Text GAN for Image Captioning
    Yu, Jiarui
    Li, Haoran
    Hao, Yanbin
    Zhu, Bin
    Xu, Tong
    He, Xiangnan
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2252 - 2263
  • [9] SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering
    Luo, Haonan
    Lin, Guosheng
    Liu, Zichuan
    Liu, Fayao
    Tang, Zhenmin
    Yao, Yazhou
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9666 - 9675
  • [10] Multimodal Cross-guided Attention Networks for Visual Question Answering
    Liu, Haibin
    Gong, Shengrong
    Ji, Yi
    Yang, Jianyu
    Xing, Tengfei
    Liu, Chunping
    [J]. PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON COMPUTER MODELING, SIMULATION AND ALGORITHM (CMSA 2018), 2018, 151 : 347 - 353