Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

被引:7
|
作者
Zhang, Beibei [1 ]
Yu, Fan [1 ,2 ]
Gao, Yanxin [1 ]
Ren, Tongwei [1 ,2 ]
Wu, Gangshan [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
[2] Nanjing Univ, Shenzhen Res Inst, Shenzhen, Peoples R China
基金
美国国家科学基金会;
关键词
Deep video understanding; relationship analysis; interaction analysis; multimodal feature fusion;
D O I
10.1145/3474085.3479214
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.
引用
收藏
页码:4848 / 4852
页数:5
相关论文
共 50 条
  • [1] Deep Relationship Analysis in Video with Multimodal Feature Fusion
    Yu, Fan
    Wang, DanDan
    Zhang, Beibei
    Ren, Tongwei
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4640 - 4644
  • [2] Special video classification based on multitask learning and multimodal feature fusion
    Wu X.-Y.
    Gu C.-N.
    Wang S.-J.
    Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2020, 28 (05): : 1177 - 1186
  • [3] Research on Feature Extraction and Multimodal Fusion of Video Caption Based on Deep Learning
    Chen, Hongjun
    Li, Hengyi
    Wu, Xueqin
    2020 THE 4TH INTERNATIONAL CONFERENCE ON MANAGEMENT ENGINEERING, SOFTWARE ENGINEERING AND SERVICE SCIENCES (ICMSS 2020), 2020, : 73 - 76
  • [4] A short video sentiment analysis model based on multimodal feature fusion
    Shi, Hongyu
    SYSTEMS AND SOFT COMPUTING, 2024, 6
  • [5] Multimodal Feature Learning for Video Captioning
    Lee, Sujin
    Kim, Incheol
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
  • [6] Multimodal Feature Fusion Video Description Model Integrating Attention Mechanisms and Contrastive Learning
    Wang Zhihao
    Che Zhanbin
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (04) : 388 - 395
  • [7] Deep joint learning diagnosis of Alzheimer's disease based on multimodal feature fusion
    Wang, Jingru
    Wen, Shipeng
    Liu, Wenjie
    Meng, Xianglian
    Jiao, Zhuqing
    BIODATA MINING, 2024, 17 (01):
  • [8] Salient feature multimodal image fusion with a joint sparse model and multiscale dictionary learning
    Zhang, Chengfang
    Feng, Ziliang
    Gao, Zhisheng
    Jin, Xin
    Yan, Dan
    Yi, Liangzhong
    OPTICAL ENGINEERING, 2020, 59 (05)
  • [9] Adaptive Learning for Multimodal Fusion in Video Search
    Lee, Wen-Yu
    Wu, Po-Tun
    Hsu, Winston
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2009, 2009, 5879 : 659 - 670
  • [10] Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA
    Abdessaied, Adnen
    Sood, Ekta
    Bulling, Andreas
    PROCEEDINGS OF THE 7TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP, 2022, : 143 - 155