Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

被引：7

作者：

Zhang, Beibei ^{[1
]}

Yu, Fan ^{[1
,2
]}

Gao, Yanxin ^{[1
]}

Ren, Tongwei ^{[1
,2
]}

Wu, Gangshan ^{[1
]}

机构：

[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China

[2] Nanjing Univ, Shenzhen Res Inst, Shenzhen, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

基金：

美国国家科学基金会;

关键词：

Deep video understanding; relationship analysis; interaction analysis; multimodal feature fusion;

D O I：

10.1145/3474085.3479214

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.

引用

页码：4848 / 4852

页数：5

共 50 条

[1] Deep Relationship Analysis in Video with Multimodal Feature Fusion
Yu, Fan
Wang, DanDan
Zhang, Beibei
Ren, Tongwei
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4640 - 4644
[2] Special video classification based on multitask learning and multimodal feature fusion
Wu X.-Y.
Gu C.-N.
Wang S.-J.
Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2020, 28 (05): : 1177 - 1186
[3] Research on Feature Extraction and Multimodal Fusion of Video Caption Based on Deep Learning
Chen, Hongjun
Li, Hengyi
Wu, Xueqin
2020 THE 4TH INTERNATIONAL CONFERENCE ON MANAGEMENT ENGINEERING, SOFTWARE ENGINEERING AND SERVICE SCIENCES (ICMSS 2020), 2020, : 73 - 76
[4] A short video sentiment analysis model based on multimodal feature fusion
Shi, Hongyu
SYSTEMS AND SOFT COMPUTING, 2024, 6
[5] Multimodal Feature Learning for Video Captioning
Lee, Sujin
Kim, Incheol
MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
[6] Multimodal Feature Fusion Video Description Model Integrating Attention Mechanisms and Contrastive Learning
Wang Zhihao
Che Zhanbin
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (04) : 388 - 395
[7] Deep joint learning diagnosis of Alzheimer's disease based on multimodal feature fusion
Wang, Jingru
Wen, Shipeng
Liu, Wenjie
Meng, Xianglian
Jiao, Zhuqing
BIODATA MINING, 2024, 17 (01):
[8] Salient feature multimodal image fusion with a joint sparse model and multiscale dictionary learning
Zhang, Chengfang
Feng, Ziliang
Gao, Zhisheng
Jin, Xin
Yan, Dan
Yi, Liangzhong
OPTICAL ENGINEERING, 2020, 59 (05)
[9] Adaptive Learning for Multimodal Fusion in Video Search
Lee, Wen-Yu
Wu, Po-Tun
Hsu, Winston
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2009, 2009, 5879 : 659 - 670
[10] Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA
Abdessaied, Adnen
Sood, Ekta
Bulling, Andreas
PROCEEDINGS OF THE 7TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP, 2022, : 143 - 155

← 1 2 3 4 5 →