Continuous Sign Language Recognition Based on 3D Hand Skeleton Data

被引:0
|
作者
Wang Z. [1 ]
Zhang J. [1 ]
机构
[1] School of Computer Engineering and Science, Shanghai University, Shanghai
关键词
Attention mechanism; Residual network; Sign language recognition; Skeleton data;
D O I
10.3724/SP.J.1089.2021.18816
中图分类号
学科分类号
摘要
In sign language recognition, it is necessary to eliminate the visual problems caused by interference factors such as background and light. Therefore, an end-to-end continuous sign language recognition model is designed using low-redundant skeleton data. Firstly, the shape and trajectory features are extracted from intra frame and inter frame respectively, which can reduce the discreteness of the original samples. Secondly, a series of parallel two-stream residual networks are constructed to fuse shape and trajectory features, further generate the spa-tial-temporal feature sequence. Finally, the attention-based encoder-decoder network is used to realize the map-ping of the fused feature sequence to the translated text. In addition, a new skeleton-based sign language dataset using Leap Motion is collected named LMSLR. Experimental results on the LMSLR dataset and the public CSL dataset show that the proposed model has higher accuracy and lower computational complexity than most models based on video processing. © 2021, Beijing China Science Journal Publishing Co. Ltd. All right reserved.
引用
收藏
页码:1899 / 1907
页数:8
相关论文
共 25 条
  • [1] Koller O, Forster J, Ney H., Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers, Computer Vision and Image Understanding, 141, 11, pp. 108-125, (2015)
  • [2] Camgoz N C, Hadfield S, Koller O, Et al., Neural sign language translation, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 7784-7793, (2018)
  • [3] Huang J, Zhou W G, Zhang Q L, Et al., Video-based sign language recognition without temporal segmentation, Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2257-2264, (2018)
  • [4] Simonyan K, Zisserman A., Two-stream convolutional networks for action recognition in videos
  • [5] Cho K, Merrienboer B V, Gulcehre C, Et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation
  • [6] Shotton J, Fitzgibbon A, Cook M, Et al., Real-time human pose recognition in parts from single depth images, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1297-1304, (2011)
  • [7] de Smedt Q, Wannous H, Vandeborre J P., Skeleton-based dynamic hand gesture recognition, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1206-1214, (2016)
  • [8] Wan C D, Probst T, van Gool L, Et al., Dense 3D regression for hand pose estimation, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 5147-5156, (2018)
  • [9] Zimmermann C, Brox T., Learning to estimate 3D hand pose from single RGB images, Proceedings of the IEEE International Conference on Computer Vision, pp. 4913-4921, (2017)
  • [10] Marin G, Dominio F, Zanuttigh P., Hand gesture recognition with jointly calibrated leap motion and depth sensor, Multimedia Tools and Applications, 75, 22, pp. 14991-15015, (2016)