Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

被引:0
|
作者
Geng, Shijie [1 ]
Gao, Peng [2 ]
Chatterjee, Moitreya [3 ]
Hori, Chiori [4 ]
Le Roux, Jonathan [4 ]
Zhang, Yongfeng [1 ]
Li, Hongsheng [2 ]
Cherian, Anoop [4 ]
机构
[1] Rutgers State Univ, Piscataway, NJ 08854 USA
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[3] Univ Illinois, Urbana, IL USA
[4] Mitsubishi Elect Res Labs MERL, Cambridge, MA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.
引用
收藏
页码:1415 / 1423
页数:9
相关论文
共 50 条
  • [41] A Survey of Knowledge Graph Reasoning on Graph Types: Static, Dynamic, and Multi-Modal
    Liang, Ke
    Meng, Lingyuan
    Liu, Meng
    Liu, Yue
    Tu, Wenxuan
    Wang, Siwei
    Zhou, Sihang
    Liu, Xinwang
    Sun, Fuchun
    He, Kunlun
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 9456 - 9478
  • [42] Multi-modal Video Summarization
    Huang, Jia-Hong
    ICMR 2024 - Proceedings of the 2024 International Conference on Multimedia Retrieval, 2024, : 1214 - 1218
  • [43] Multi-modal Video Summarization
    Huang, Jia-Hong
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1214 - 1218
  • [44] Multi-modal knowledge graphs representation learning via multi-headed self-attention
    Wang, Enqiang
    Yu, Qing
    Chen, Yelin
    Slamu, Wushouer
    Luo, Xukang
    INFORMATION FUSION, 2022, 88 : 78 - 85
  • [45] Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering
    Kim, Junyeong
    Ma, Minuk
    Kim, Kyungsu
    Kim, Sungjin
    Yoo, Chang D.
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [46] M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers
    Fu, Tsu-Jui
    Wang, Xin Eric
    Grafton, Scott T.
    Eckstein, Miguel P.
    Wang, William Yang
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10503 - 10512
  • [47] Multi-modal Graph Learning over UMLS Knowledge Graphs
    Burger, Manuel
    Ratsch, Gunnar
    Kuznetsova, Rita
    MACHINE LEARNING FOR HEALTH, ML4H, VOL 225, 2023, 225 : 52 - 81
  • [48] Constrained Bipartite Graph Learning for Imbalanced Multi-Modal Retrieval
    Zhang, Han
    Li, Yiding
    Li, Xuelong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4502 - 4514
  • [49] Heterogeneous Graph Learning for Multi-Modal Medical Data Analysis
    Kim, Sein
    Lee, Namkyeong
    Lee, Junseok
    Hyun, Dongmin
    Park, Chanyoung
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 4, 2023, : 5141 - 5150
  • [50] Collaborative denoised graph contrastive learning for multi-modal recommendation
    Xu, Fuyong
    Zhu, Zhenfang
    Fu, Yixin
    Wang, Ru
    Liu, Peiyu
    INFORMATION SCIENCES, 2024, 679