Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

被引：0

作者：

Geng, Shijie ^{[1
]}

Gao, Peng ^{[2
]}

Chatterjee, Moitreya ^{[3
]}

Hori, Chiori ^{[4
]}

Le Roux, Jonathan ^{[4
]}

Zhang, Yongfeng ^{[1
]}

Li, Hongsheng ^{[2
]}

Cherian, Anoop ^{[4
]}

机构：

[1] Rutgers State Univ, Piscataway, NJ 08854 USA

[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[3] Univ Illinois, Urbana, IL USA

[4] Mitsubishi Elect Res Labs MERL, Cambridge, MA USA

来源：

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2021年 / 35卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.

引用

页码：1415 / 1423

页数：9

共 50 条

[21] MutualFormer: Multi-modal Representation Learning via Cross-Diffusion Attention
Wang, Xixi
Wang, Xiao
Jiang, Bo
Tang, Jin
Luo, Bin
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (09) : 3867 - 3888
[22] Multi-Modal Dynamic Graph Transformer for Visual Grounding
Chen, Sijia
Li, Baochun
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15513 - 15522
[23] Conversational multi-modal browser: An integrated multi-modal browser and dialog manager
Tiwari, A
Hosn, RA
Maes, SH
2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET, PROCEEDINGS, 2003, : 348 - 351
[24] OCR-Aware Scene Graph Generation Via Multi-modal Object Representation Enhancement and Logical Bias Learning
Zhou, Xinyu
Ji, Zihan
Zhu, Anna
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 201 - 215
[25] Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval
Zeng, Yawen
Cao, Da
Wei, Xiaochi
Liu, Meng
Zhao, Zhou
Qin, Zheng
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2215 - 2224
[26] Multi-modal graph reasoning for structured video text extraction
Shi, Weitao
Wang, Han
Lou, Xin
COMPUTERS & ELECTRICAL ENGINEERING, 2023, 107
[27] CMGNet: Collaborative multi-modal graph network for video captioning
Rao, Qi
Yu, Xin
Li, Guang
Zhu, Linchao
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
[28] Hierarchical multi-modal video summarization with dynamic sampling
Yu, Lingjian
Zhao, Xing
Xie, Liang
Liang, Haoran
Liang, Ronghua
IET IMAGE PROCESSING, 2024, 18 (14) : 4577 - 4588
[29] Multi-modal Graph and Sequence Fusion Learning for Recommendation
Wang, Zejun
Wu, Xinglong
Yang, Hongwei
He, Hui
Tai, Yu
Zhang, Weizhe
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 357 - 369
[30] Fast Multi-Modal Unified Sparse Representation Learning
Verma, Mridula
Shukla, Kaushal Kumar
PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 448 - 452

← 1 2 3 4 5 →