Joint Learning for Relationship and Interaction Analysis in Video with Multimodal Feature Fusion

被引:7
|
作者
Zhang, Beibei [1 ]
Yu, Fan [1 ,2 ]
Gao, Yanxin [1 ]
Ren, Tongwei [1 ,2 ]
Wu, Gangshan [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
[2] Nanjing Univ, Shenzhen Res Inst, Shenzhen, Peoples R China
基金
美国国家科学基金会;
关键词
Deep video understanding; relationship analysis; interaction analysis; multimodal feature fusion;
D O I
10.1145/3474085.3479214
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.
引用
收藏
页码:4848 / 4852
页数:5
相关论文
共 50 条
  • [21] Multimodal and multiscale feature fusion for weakly supervised video anomaly detection
    Sun, Wenwen
    Cao, Lin
    Guo, Yanan
    Du, Kangning
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [22] Multimodal feature extraction and fusion for semantic mining of soccer video: a survey
    Oskouie, Payam
    Alipour, Sara
    Eftekhari-Moghadam, Amir-Masoud
    ARTIFICIAL INTELLIGENCE REVIEW, 2014, 42 (02) : 173 - 210
  • [23] Multimodal Interaction Fusion Network Based on Transformer for Video Captioning
    Xu, Hui
    Zeng, Pengpeng
    Khan, Abdullah Aman
    ARTIFICIAL INTELLIGENCE AND ROBOTICS, ISAIR 2022, PT I, 2022, 1700 : 21 - 36
  • [24] MFC-PPI: protein-protein interaction prediction with multimodal feature fusion and contrastive learning
    Zhang, Zhixin
    Zhang, Qunhao
    Xiao, Jun
    Ding, Shanyang
    Li, Zhen
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (04):
  • [25] Identification based on feature fusion of multimodal biometrics and deep learning
    Medjahed, Chahreddine
    Mezzoudj, Freha
    Rahmoun, Abdellatif
    Charrier, Christophe
    INTERNATIONAL JOURNAL OF BIOMETRICS, 2023, 15 (3-4) : 521 - 538
  • [26] Multimodal feature fusion and exploitation with dual learning and reinforcement learning for recipe generation
    Zhang, Mengyang
    Tian, Guohui
    Gao, Huanbing
    Liu, Shaopeng
    Zhang, Ying
    APPLIED SOFT COMPUTING, 2022, 126
  • [27] Video Understanding via Convolutional Temporal Pooling Network and Multimodal Feature Fusion
    Kwon, Heeseung
    Kwak, Suha
    Cho, Minsu
    PROCEEDINGS OF THE 1ST WORKSHOP AND CHALLENGE ON COMPREHENSIVE VIDEO UNDERSTANDING IN THE WILD (COVIEW'18), 2018, : 35 - 39
  • [28] CT and MRI image fusion via multimodal feature interaction network
    Song, Wenhao
    Zeng, Xiangqin
    Li, Qilei
    Gao, Mingliang
    Zhou, Hui
    Shi, Junzhi
    NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS, 2024, 13 (01):
  • [29] Multimodal learning model based on video-audio-chat feature fusion for detecting e-sports highlights
    Park, Gang -Min
    Hyun, Hye-In
    Kwon, Hyuk-Yoon
    APPLIED SOFT COMPUTING, 2022, 126
  • [30] Source Code Vulnerability Detection Based on Joint Graph and Multimodal Feature Fusion
    Jin, Dun
    He, Chengwan
    Zou, Quan
    Qin, Yan
    Wang, Boshu
    ELECTRONICS, 2025, 14 (05):