Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

被引:7
|
作者
Zhang, Beibei [1 ]
Yu, Fan [1 ,2 ]
Gao, Yanxin [1 ]
Ren, Tongwei [1 ,2 ]
Wu, Gangshan [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
[2] Nanjing Univ, Shenzhen Res Inst, Shenzhen, Peoples R China
基金
美国国家科学基金会;
关键词
Deep video understanding; relationship analysis; interaction analysis; multimodal feature fusion;
D O I
10.1145/3474085.3479214
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.
引用
收藏
页码:4848 / 4852
页数:5
相关论文
共 50 条
  • [31] Multimodal Machine Learning for Video and Image Analysis
    Ghosh, Shalini
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3608 - 3608
  • [32] Multimodal interaction enhanced representation learning for video emotion recognition
    Xia, Xiaohan
    Zhao, Yong
    Jiang, Dongmei
    FRONTIERS IN NEUROSCIENCE, 2022, 16
  • [33] Multimodal feature fusion in deep learning for comprehensive dental condition classification
    Hsieh, Shang-Ting
    Cheng, Ya-Ai
    JOURNAL OF X-RAY SCIENCE AND TECHNOLOGY, 2024, 32 (02) : 303 - 321
  • [34] HIERARCHICAL MULTI-FEATURE FUSION FOR MULTIMODAL DATA ANALYSIS
    Zhang, Hong
    Chen, Li
    Liu, Jun
    Yuan, Junsong
    2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2014, : 5916 - 5920
  • [35] Sentiment Analysis of Social Media via Multimodal Feature Fusion
    Zhang, Kang
    Geng, Yushui
    Zhao, Jing
    Liu, Jianxin
    Li, Wenxiao
    SYMMETRY-BASEL, 2020, 12 (12): : 1 - 14
  • [36] Character emotion recognition algorithm in small sample video based on multimodal feature fusion
    Xie, Jian
    Chu, Dan
    INTERNATIONAL JOURNAL OF BIOMETRICS, 2025, 17 (1-2) : 1 - 14
  • [37] Quantum-inspired multimodal fusion for video sentiment analysis
    Li, Qiuchi
    Gkoumas, Dimitris
    Lioma, Christina
    Melucci, Massimo
    INFORMATION FUSION, 2021, 65 : 58 - 71
  • [38] Multimodal sentiment analysis based on multi-layer feature fusion and multi-task learning
    Cai, Yujian
    Li, Xingguang
    Zhang, Yingyu
    Li, Jinsong
    Zhu, Fazheng
    Rao, Lin
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [39] Research on the Video Semantic Analysis Framework based on Multiple Feature Fusion and Deep Learning Structure
    Liang, Rui
    Zhu, Qingxin
    2016 2ND INTERNATIONAL CONFERENCE ON SOCIAL SCIENCE, MANAGEMENT AND ECONOMICS (SSME 2016), 2016, : 727 - 733
  • [40] MFD-GDrug: multimodal feature fusion-based deep learning for GPCR-drug interaction prediction
    Gu, Xingyue
    Liu, Junkai
    Yu, Yue
    Xiao, Pengfeng
    Ding, Yijie
    METHODS, 2024, 223 : 75 - 82