A deep co-attentive hand-based video question answering framework using multi-view skeleton

被引:0
|
作者
Razieh Rastgoo
Kourosh Kiani
Sergio Escalera
机构
[1] Semnan University,Department of Electrical and Computer Engineering
[2] University of Barcelona and Computer Vision Center,undefined
来源
关键词
Video question answering (video-QA); Dynamic hand gesture recognition; BERT; Co-attention; RGB video;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper, we present a novel hand –based Video Question Answering framework, entitled Multi-View Video Question Answering (MV-VQA), employing the Single Shot Detector (SSD), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and Co-Attention mechanism with RGB videos as the inputs. Our model includes three main blocks: vision, language, and attention. In the vision block, we employ a novel representation to obtain some efficient multiview features from the hand object using the combination of five 3DCNNs and one LSTM network. To obtain the question embedding, we use the BERT model in language block. Finally, we employ a co-attention mechanism on vision and language features to recognize the final answer. For the first time, we propose such a hand-based Video-QA framework including the multi-view hand skeleton features combined with the question embedding and co-attention mechanism. Our framework is capable of processing the arbitrary numbers of questions in the dataset annotations. There are different application domains for this framework. Here, as an application domain, we applied our framework to dynamic hand gesture recognition for the first time. Since the main object in dynamic hand gesture recognition is the human hand, we performed a step-by-step analysis of the hand detection and multi-view hand skeleton impact on the model performance. Evaluation results on five datasets, including two datasets in VideoQA, two datasets in dynamic hand gesture, and one dataset in hand action recognition show that MV-VQA outperforms state-of-the-art alternatives.
引用
收藏
页码:1401 / 1429
页数:28
相关论文
共 50 条
  • [1] A deep co-attentive hand-based video question answering framework using multi-view skeleton
    Rastgoo, Razieh
    Kiani, Kourosh
    Escalera, Sergio
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (01) : 1401 - 1429
  • [2] Hand sign language recognition using multi-view hand skeleton
    Rastgoo, Razieh
    Kiani, Kourosh
    Escalera, Sergio
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2020, 150
  • [3] Explainable automated seizure detection using attentive deep multi-view networks
    Einizade, Aref
    Nasiri, Samaneh
    Mozafari, Mohsen
    Sardouie, Sepideh Hajipour
    Clifford, Gari D.
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 79
  • [4] A framework for multi-view video coding using layered depth images
    Yoon, SU
    Lee, EK
    Kim, SY
    Ho, YS
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2005, PT 1, 2005, 3767 : 431 - 442
  • [5] Video semantic segmentation using deep multi-view representation learning
    Sellami, Akrem
    Tabbone, Salvatore
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 8133 - 8139
  • [6] MVHANet: multi-view hierarchical aggregation network for skeleton-based hand gesture recognition
    Shaochen Li
    Zhenyu Liu
    Guifang Duan
    Jianrong Tan
    [J]. Signal, Image and Video Processing, 2023, 17 : 2521 - 2529
  • [7] MVHANet: multi-view hierarchical aggregation network for skeleton-based hand gesture recognition
    Li, Shaochen
    Liu, Zhenyu
    Duan, Guifang
    Tan, Jianrong
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (05) : 2521 - 2529
  • [8] Interpretable Multimodal Sentiment Classification Using Deep Multi-View Attentive Network of Image and Text Data
    Al-Tameemi, Israa Khalaf Salman
    Feizi-Derakhshi, Mohammad-Reza
    Pashazadeh, Saeid
    Asadpour, Mohammad
    [J]. IEEE ACCESS, 2023, 11 : 91060 - 91081
  • [9] A framework for representation and processing of multi-view video using the concept of layered depth image
    Yoon, Seung-Uk
    Lee, Eun-Kyung
    Kim, Sung-Yeol
    Ho, Yo-Sung
    [J]. JOURNAL OF VLSI SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2007, 46 (2-3): : 87 - 102
  • [10] A Framework for Representation and Processing of Multi-view Video Using the Concept of Layered Depth Image
    Seung-Uk Yoon
    Eun-Kyung Lee
    Sung-Yeol Kim
    Yo-Sung Ho
    [J]. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 2007, 46 : 87 - 102