Multimodal Score Fusion with Sparse Low-rank Bilinear Pooling for Egocentric Hand Action Recognition

被引:0
|
作者
Roy, Kankana [1 ,2 ]
机构
[1] Karolinska Inst, Dept Oncol Pathol, S-17177 Stockholm, Sweden
[2] Indian Inst Technol Kharagpur, Dept Comp Sci & Engn, Kharagpur 721302, West Bengal, India
关键词
Bilinear score pooling; egocentric hand action recognition; RGB-D videos; sparse; low rank; CNN; RNN; NEURAL-NETWORKS;
D O I
10.1145/3656044
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the advent of egocentric cameras, there are new challenges where traditional computer vision is not sufficient to handle this kind of video. Moreover, egocentric cameras often offer multiple modalities that need to be modeled jointly to exploit complimentary information. In this article, we propose a sparse low-rank bilinear score pooling approach for egocentric hand action recognition from RGB-D videos. It consists of five blocks: a baseline CNN to encode RGB and depth information for producing classification probabilities; a novel bilinear score pooling block to generate a score matrix; a sparse low-rank matrix recovery block to reduce redundant features, which is common in bilinear pooling; a one-layer CNN for frame-level classification; and an RNN for video-level classification. We proposed to fuse classification probabilities instead of traditional CNN features from RGB and depth modality, involving an effective yet simple sparse low-rank bilinear score pooling to produce a fused RGB-D score matrix. To demonstrate the efficacy of our method, we perform extensive experiments over two large-scale hand action datasets, namely, THU-READ and FPHA, and two smaller datasets, GUN-71 and HAD. We observe that the proposed method outperforms state-of-the-art methods and achieves accuracies of 78.55% and 96.87% over the THU-READ dataset in cross-subject and cross-group settings, respectively. Further, we achieved accuracies of 91.59% and 43.87% over the FPHA and Gun-71 datasets, respectively.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Action Recognition Using Low-Rank Sparse Representation
    Cheng, Shilei
    Gu, Song
    Ye, Maoquan
    Xie, Mei
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (03) : 830 - 834
  • [2] LowFER: Low-rank Bilinear Pooling for Link Prediction
    Amin, Saadullah
    Varanasi, Stalin
    Dunfield, Katherine Ann
    Neumann, Guenter
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [3] Low-rank Bilinear Pooling for Fine-Grained Classification
    Kong, Shu
    Fowlkes, Charless
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 7025 - 7034
  • [4] Dual Low-Rank Multimodal Fusion
    Jin, Tao
    Huang, Siyu
    Li, Yingming
    Zhang, Zhongfei
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 377 - 387
  • [5] Multimodal sparse and low-rank subspace clustering
    Abavisani, Mahdi
    Patel, Vishal M.
    INFORMATION FUSION, 2018, 39 : 168 - 177
  • [6] Sarcasm Detection with Self-matching Networks and Low-rank Bilinear Pooling
    Xiong, Tao
    Zhang, Peiran
    Zhu, Hongbo
    Yang, Yihui
    WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 2115 - 2124
  • [7] Sparse Low-Rank Fusion based Deep Features for Missing Modality Face Recognition
    Shao, Ming
    Ding, Zhengming
    Fu, Yun
    2015 11TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOPS ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG), VOL. 1, 2015,
  • [8] Low-rank Representation Based Action Recognition
    Zhang, Xiangrong
    Yang, Yang
    Jia, Hanghua
    Zhou, Huiyu
    Jiao, Licheng
    PROCEEDINGS OF THE 2014 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2014, : 1812 - 1818
  • [9] Low-rank sparse coding and region of interest pooling for dynamic 3D facial expression recognition
    Zarbakhsh, Payam
    Demirel, Hasan
    SIGNAL IMAGE AND VIDEO PROCESSING, 2018, 12 (08) : 1611 - 1618
  • [10] Low-rank sparse coding and region of interest pooling for dynamic 3D facial expression recognition
    Payam Zarbakhsh
    Hasan Demirel
    Signal, Image and Video Processing, 2018, 12 : 1611 - 1618