Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

被引:0
|
作者
Geng, Shijie [1 ]
Gao, Peng [2 ]
Chatterjee, Moitreya [3 ]
Hori, Chiori [4 ]
Le Roux, Jonathan [4 ]
Zhang, Yongfeng [1 ]
Li, Hongsheng [2 ]
Cherian, Anoop [4 ]
机构
[1] Rutgers State Univ, Piscataway, NJ 08854 USA
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[3] Univ Illinois, Urbana, IL USA
[4] Mitsubishi Elect Res Labs MERL, Cambridge, MA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.
引用
收藏
页码:1415 / 1423
页数:9
相关论文
共 50 条
  • [1] Contrastive Multi-Modal Knowledge Graph Representation Learning
    Fang, Quan
    Zhang, Xiaowei
    Hu, Jun
    Wu, Xian
    Xu, Changsheng
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (09) : 8983 - 8996
  • [2] Graph Embedding Contrastive Multi-Modal Representation Learning for Clustering
    Xia, Wei
    Wang, Tianxiu
    Gao, Quanxue
    Yang, Ming
    Gao, Xinbo
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 1170 - 1183
  • [3] MULTI-MODAL REPRESENTATION LEARNING FOR SHORT VIDEO UNDERSTANDING AND RECOMMENDATION
    Guo, Daya
    Hong, Jiangshui
    Luo, Binli
    Yan, Qirui
    Niu, Zhangming
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2019, : 687 - 690
  • [4] Multi-modal Representation Learning for Video Advertisement Content Structuring
    Guo, Daya
    Zeng, Zhaoyang
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4770 - 4774
  • [5] Multi-modal Video Dialog State Tracking in the Wild
    Abdessaied, Adnen
    Shi, Lei
    Bulling, Andreas
    COMPUTER VISION-ECCV 2024, PT LVII, 2025, 15115 : 348 - 365
  • [6] Multi-modal Graph Contrastive Learning for Micro-video Recommendation
    Yi, Zixuan
    Wang, Xi
    Ounis, Iadh
    Macdonald, Craig
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 1807 - 1811
  • [7] Towards Multi-modal Transformers in Federated Learning
    Sun, Guangyu
    Mendieta, Matias
    Dutta, Aritra
    Li, Xin
    Chen, Chen
    COMPUTER VISION - ECCV 2024, PT XV, 2025, 15073 : 229 - 246
  • [8] Multi-modal Network Representation Learning
    Zhang, Chuxu
    Jiang, Meng
    Zhang, Xiangliang
    Ye, Yanfang
    Chawla, Nitesh, V
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3557 - 3558
  • [9] MMKRL: A robust embedding approach for multi-modal knowledge graph representation learning
    Lu, Xinyu
    Wang, Lifang
    Jiang, Zejun
    He, Shichang
    Liu, Shizhong
    APPLIED INTELLIGENCE, 2022, 52 (07) : 7480 - 7497
  • [10] MMKRL: A robust embedding approach for multi-modal knowledge graph representation learning
    Xinyu Lu
    Lifang Wang
    Zejun Jiang
    Shichang He
    Shizhong Liu
    Applied Intelligence, 2022, 52 : 7480 - 7497