Deep Fusion of Multiple Semantic Cues for Complex Event Recognition

被引:50
|
作者
Zhang, Xishan [1 ,2 ]
Zhang, Hanwang [3 ]
Zhang, Yongdong [1 ]
Yang, Yang [4 ]
Wang, Meng [5 ]
Luan, Huanbo [6 ]
Li, Jintao [1 ]
Chua, Tat-Seng [3 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Natl Univ Singapore, Sch Comp, Singapore 117417, Singapore
[4] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[5] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
[6] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
基金
国家高技术研究发展计划(863计划);
关键词
Multimedia event recognition; deep learning; fusion; CLASSIFICATION; SCALE;
D O I
10.1109/TIP.2015.2511585
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a deep learning strategy to fuse multiple semantic cues for complex event recognition. In particular, we tackle the recognition task by answering how to jointly analyze human actions (who is doing what), objects (what), and scenes (where). First, each type of semantic features (e.g., human action trajectories) is fed into a corresponding multi-layer feature abstraction pathway, followed by a fusion layer connecting all the different pathways. Second, the correlations of how the semantic cues interacting with each other are learned in an unsupervised cross-modality autoencoder fashion. Finally, by fine-tuning a large-margin objective deployed on this deep architecture, we are able to answer the question on how the semantic cues of who, what, and where compose a complex event. As compared with the traditional feature fusion methods (e.g., various early or late strategies), our method jointly learns the essential higher level features that are most effective for fusion and recognition. We perform extensive experiments on two real-world complex event video benchmarks, MED'11 and CCV, and demonstrate that our method outperforms the best published results by 21% and 11%, respectively, on an event recognition task.
引用
收藏
页码:1033 / 1046
页数:14
相关论文
共 50 条
  • [1] Combining multiple deep cues for action recognition
    Ruiqi Wang
    Xinxiao Wu
    [J]. Multimedia Tools and Applications, 2019, 78 : 9933 - 9950
  • [2] Combining multiple deep cues for action recognition
    Wang, Ruiqi
    Wu, Xinxiao
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (08) : 9933 - 9950
  • [3] Semantic Model Vectors for Complex Video Event Recognition
    Merler, Michele
    Huang, Bert
    Xie, Lexing
    Hua, Gang
    Natsev, Apostol
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2012, 14 (01) : 88 - 101
  • [4] Semantic Event Fusion of Different Visual Modality Concepts for Activity Recognition
    Crispim-Junior, Carlos F.
    Buso, Vincent
    Avgerinakis, Konstantinos
    Meditskos, Georgios
    Briassouli, Alexia
    Benois-Pineau, Jenny
    Kompatsiaris, Ioannis
    Bremond, Francois
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (08) : 1598 - 1611
  • [5] Semantic Event Mining in Soccer Video Based on Multiple Feature Fusion
    Liu Hua-Yong
    He Tingting
    [J]. ITCS: 2009 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE, PROCEEDINGS, VOL 2, PROCEEDINGS, 2009, : 297 - 300
  • [6] Integrating Multiple Feature Fusion for Semantic Event Detection in Soccer Video
    Liu Hua-Yong
    He Tingting
    [J]. FIRST IITA INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2009, : 128 - 131
  • [7] Deep multiple classifier fusion for traffic scene recognition
    Fangyu Wu
    Shiyang Yan
    Jeremy S. Smith
    Bailing Zhang
    [J]. Granular Computing, 2021, 6 : 217 - 228
  • [8] Deep multiple classifier fusion for traffic scene recognition
    Wu, Fangyu
    Yan, Shiyang
    Smith, Jeremy S.
    Zhang, Bailing
    [J]. GRANULAR COMPUTING, 2021, 6 (01) : 217 - 228
  • [9] SEMANTIC CUES, RHYME CUES, AND 2 VARIETIES OF RECOGNITION MEMORY
    HORTON, DL
    PAVLICK, TJ
    [J]. BULLETIN OF THE PSYCHONOMIC SOCIETY, 1993, 31 (01) : 16 - 18
  • [10] Confidence-based fusion of multiple feature cues for facial expression recognition
    Ioannou, S
    Wallace, M
    Karpouzis, K
    Raouzaiou, A
    Kollias, S
    [J]. FUZZ-IEEE 2005: PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS: BIGGEST LITTLE CONFERENCE IN THE WORLD, 2005, : 207 - 212