Multimodal deep representation learning for video classification

被引:43
|
作者
Tian, Haiman [1 ]
Tao, Yudong [2 ]
Pouyanfar, Samira [1 ]
Chen, Shu-Ching [1 ]
Shyu, Mei-Ling [2 ]
机构
[1] Florida Int Univ, Sch Comp & Informat Sci, Miami, FL 33199 USA
[2] Univ Miami, Dept Elect & Comp Engn, Coral Gables, FL 33124 USA
关键词
Multimodal deep learning; Transfer learning; Multi-stage fusion; Disaster management system; EVENT DETECTION;
D O I
10.1007/s11280-018-0548-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Real-world applications usually encounter data with various modalities, each containing valuable information. To enhance these applications, it is essential to effectively analyze all information extracted from different data modalities, while most existing learning models ignore some data types and only focus on a single modality. This paper presents a new multimodal deep learning framework for event detection from videos by leveraging recent advances in deep neural networks. First, several deep learning models are utilized to extract useful information from multiple modalities. Among these are pre-trained Convolutional Neural Networks (CNNs) for visual and audio feature extraction and a word embedding model for textual analysis. Then, a novel fusion technique is proposed that integrates different data representations in two levels, namely frame-level and video-level. Different from the existing multimodal learning algorithms, the proposed framework can reason about a missing data type using other available data modalities. The proposed framework is applied to a new video dataset containing natural disaster classes. The experimental results illustrate the effectiveness of the proposed framework compared to some single modal deep learning models as well as conventional fusion techniques. Specifically, the final accuracy is improved more than 16% and 7% compared to the best results from single modality and fusion models, respectively.
引用
收藏
页码:1325 / 1341
页数:17
相关论文
共 50 条
  • [21] Deep learning-based late fusion of multimodal information for emotion classification of music video
    Yagya Raj Pandeya
    Joonwhoan Lee
    [J]. Multimedia Tools and Applications, 2021, 80 : 2887 - 2905
  • [22] Radar and Video Multimodal Learning for Human Activity Classification
    de Jong, Richard J.
    Heiligers, Matijs J. C.
    de Wit, Jacco J. M.
    Uysal, Faruk
    [J]. 2019 INTERNATIONAL RADAR CONFERENCE (RADAR2019), 2019, : 640 - 645
  • [23] Deep Multimodal Representation Learning from Temporal Data
    Yang, Xitong
    Ramesh, Palghat
    Chitta, Radha
    Madhvanath, Sriganesh
    Bernal, Edgar A.
    Luo, Jiebo
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5066 - 5074
  • [24] Deep learning with multimodal representation for pancancer prognosis prediction
    Cheerla, Anika
    Gevaert, Olivier
    [J]. BIOINFORMATICS, 2019, 35 (14) : I446 - I454
  • [25] Generative Adversarial Networks for Multimodal Representation Learning in Video Hyperlinking
    Vukotic, Vedran
    Raymond, Christian
    Gravier, Guillaume
    [J]. PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 421 - 424
  • [26] Multimodal interaction enhanced representation learning for video emotion recognition
    Xia, Xiaohan
    Zhao, Yong
    Jiang, Dongmei
    [J]. FRONTIERS IN NEUROSCIENCE, 2022, 16
  • [27] Deep multimodal feature fusion for micro-video classification
    Zhang L.
    Cui T.
    Jing P.
    Su Y.
    [J]. Jing, Peiguang (pgjing@tju.edu.cn), 1600, Beijing University of Aeronautics and Astronautics (BUAA) (47): : 478 - 485
  • [28] Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification
    Yang, Xiaodong
    Molchanov, Pavlo
    Kautz, Jan
    [J]. MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 978 - 987
  • [29] Multimodal Social Media Video Classification with Deep Neural Networks
    Trzcinski, Tomasz
    [J]. PHOTONICS APPLICATIONS IN ASTRONOMY, COMMUNICATIONS, INDUSTRY, AND HIGH-ENERGY PHYSICS EXPERIMENTS 2018, 2018, 10808
  • [30] Age classification with deep learning face representation
    Huang, Jin
    Li, Bin
    Zhu, Jia
    Chen, Jian
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (19) : 20231 - 20247