Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

被引:0
|
作者
Geng, Shijie [1 ]
Gao, Peng [2 ]
Chatterjee, Moitreya [3 ]
Hori, Chiori [4 ]
Le Roux, Jonathan [4 ]
Zhang, Yongfeng [1 ]
Li, Hongsheng [2 ]
Cherian, Anoop [4 ]
机构
[1] Rutgers State Univ, Piscataway, NJ 08854 USA
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[3] Univ Illinois, Urbana, IL USA
[4] Mitsubishi Elect Res Labs MERL, Cambridge, MA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.
引用
收藏
页码:1415 / 1423
页数:9
相关论文
共 50 条
  • [21] MutualFormer: Multi-modal Representation Learning via Cross-Diffusion Attention
    Wang, Xixi
    Wang, Xiao
    Jiang, Bo
    Tang, Jin
    Luo, Bin
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (09) : 3867 - 3888
  • [22] Multi-Modal Dynamic Graph Transformer for Visual Grounding
    Chen, Sijia
    Li, Baochun
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15513 - 15522
  • [23] Conversational multi-modal browser: An integrated multi-modal browser and dialog manager
    Tiwari, A
    Hosn, RA
    Maes, SH
    2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET, PROCEEDINGS, 2003, : 348 - 351
  • [24] OCR-Aware Scene Graph Generation Via Multi-modal Object Representation Enhancement and Logical Bias Learning
    Zhou, Xinyu
    Ji, Zihan
    Zhu, Anna
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 201 - 215
  • [25] Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval
    Zeng, Yawen
    Cao, Da
    Wei, Xiaochi
    Liu, Meng
    Zhao, Zhou
    Qin, Zheng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2215 - 2224
  • [26] Multi-modal graph reasoning for structured video text extraction
    Shi, Weitao
    Wang, Han
    Lou, Xin
    COMPUTERS & ELECTRICAL ENGINEERING, 2023, 107
  • [27] CMGNet: Collaborative multi-modal graph network for video captioning
    Rao, Qi
    Yu, Xin
    Li, Guang
    Zhu, Linchao
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
  • [28] Hierarchical multi-modal video summarization with dynamic sampling
    Yu, Lingjian
    Zhao, Xing
    Xie, Liang
    Liang, Haoran
    Liang, Ronghua
    IET IMAGE PROCESSING, 2024, 18 (14) : 4577 - 4588
  • [29] Multi-modal Graph and Sequence Fusion Learning for Recommendation
    Wang, Zejun
    Wu, Xinglong
    Yang, Hongwei
    He, Hui
    Tai, Yu
    Zhang, Weizhe
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 357 - 369
  • [30] Fast Multi-Modal Unified Sparse Representation Learning
    Verma, Mridula
    Shukla, Kaushal Kumar
    PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 448 - 452