Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

被引：7

作者：

Zhang, Beibei ^{[1
]}

Yu, Fan ^{[1
,2
]}

Gao, Yanxin ^{[1
]}

Ren, Tongwei ^{[1
,2
]}

Wu, Gangshan ^{[1
]}

机构：

[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China

[2] Nanjing Univ, Shenzhen Res Inst, Shenzhen, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

基金：

美国国家科学基金会;

关键词：

Deep video understanding; relationship analysis; interaction analysis; multimodal feature fusion;

D O I：

10.1145/3474085.3479214

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.

引用

页码：4848 / 4852

页数：5

共 50 条

[21] Multimodal and multiscale feature fusion for weakly supervised video anomaly detection
Sun, Wenwen
Cao, Lin
Guo, Yanan
Du, Kangning
SCIENTIFIC REPORTS, 2024, 14 (01):
[22] Multimodal feature extraction and fusion for semantic mining of soccer video: a survey
Oskouie, Payam
Alipour, Sara
Eftekhari-Moghadam, Amir-Masoud
ARTIFICIAL INTELLIGENCE REVIEW, 2014, 42 (02) : 173 - 210
[23] Multimodal Interaction Fusion Network Based on Transformer for Video Captioning
Xu, Hui
Zeng, Pengpeng
Khan, Abdullah Aman
ARTIFICIAL INTELLIGENCE AND ROBOTICS, ISAIR 2022, PT I, 2022, 1700 : 21 - 36
[24] MFC-PPI: protein-protein interaction prediction with multimodal feature fusion and contrastive learning
Zhang, Zhixin
Zhang, Qunhao
Xiao, Jun
Ding, Shanyang
Li, Zhen
JOURNAL OF SUPERCOMPUTING, 2025, 81 (04):
[25] Identification based on feature fusion of multimodal biometrics and deep learning
Medjahed, Chahreddine
Mezzoudj, Freha
Rahmoun, Abdellatif
Charrier, Christophe
INTERNATIONAL JOURNAL OF BIOMETRICS, 2023, 15 (3-4) : 521 - 538
[26] Multimodal feature fusion and exploitation with dual learning and reinforcement learning for recipe generation
Zhang, Mengyang
Tian, Guohui
Gao, Huanbing
Liu, Shaopeng
Zhang, Ying
APPLIED SOFT COMPUTING, 2022, 126
[27] Video Understanding via Convolutional Temporal Pooling Network and Multimodal Feature Fusion
Kwon, Heeseung
Kwak, Suha
Cho, Minsu
PROCEEDINGS OF THE 1ST WORKSHOP AND CHALLENGE ON COMPREHENSIVE VIDEO UNDERSTANDING IN THE WILD (COVIEW'18), 2018, : 35 - 39
[28] CT and MRI image fusion via multimodal feature interaction network
Song, Wenhao
Zeng, Xiangqin
Li, Qilei
Gao, Mingliang
Zhou, Hui
Shi, Junzhi
NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS, 2024, 13 (01):
[29] Multimodal learning model based on video-audio-chat feature fusion for detecting e-sports highlights
Park, Gang -Min
Hyun, Hye-In
Kwon, Hyuk-Yoon
APPLIED SOFT COMPUTING, 2022, 126
[30] Source Code Vulnerability Detection Based on Joint Graph and Multimodal Feature Fusion
Jin, Dun
He, Chengwan
Zou, Quan
Qin, Yan
Wang, Boshu
ELECTRONICS, 2025, 14 (05):

← 1 2 3 4 5 →