HMTV: hierarchical multimodal transformer for video highlight query on baseball

被引:0
|
作者
Zhang, Qiaoyun [1 ]
Chang, Chih-Yung [2 ]
Su, Ming-Yang [3 ]
Chang, Hsiang-Chuan [4 ]
Roy, Diptendu Sinha [5 ]
机构
[1] Chuzhou Univ, Sch Comp & Informat Engn, Chuzhou 239000, Peoples R China
[2] Tamkang Univ, Dept Comp Sci & Informat Engn, New Taipei 25137, Taiwan
[3] Ming Chuan Univ, Dept Comp Sci & Informat Engn, Taoyuan 333, Taiwan
[4] Tamkang Univ, Dept Transportat Management, New Taipei 25137, Taiwan
[5] Natl Inst Technol, Dept Comp Sci & Engn, Shillong 793003, India
关键词
Hierarchical multimodal Transformer; BERT; Highlight query; NETWORK;
D O I
10.1007/s00530-024-01479-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increasing popularity of watching baseball videos, there is a growing desire among fans to enjoy the highlights of these videos. However, the extraction of the highlights from lengthy baseball videos faces a significant challenge due to its time-consuming and labor-intensive nature. To address this challenge, this paper proposes a novel mechanism, called Hierarchical Multimodal Transformer for Video query (HMTV). The proposed HMTV incorporates a two-phase involving Coarse-Grained clipping for candidate videos and Fine-Grained identification for highlights. In the Coarse-Grained phase, a pitching detection model is employed to extract relevant candidate videos from baseball videos, encompassing the features of pitch deliveries and pitching. In the Fine-Grained phase, Transformer encoder and pre-trained Bidirectional Encoder Representations from Transformers (BERT) are utilized to capture relationship features between frames of candidate videos and words from users' questions, respectively. These relationship features are then fed into the Video Query (VideoQ) model, implemented by the Text Video Attention (TVA). The VideoQ model identifies the start and end positions of the highlights mentioned in the query within the candidate videos. Simulation results demonstrate that the proposed HMTV significantly improves accuracy of highlights identification in terms of precision, recall, and F1-score.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] Hierarchical attention-based multimodal fusion for video captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Weichen, Sun
    Su, Fei
    Wang, Leiquan
    NEUROCOMPUTING, 2018, 315 : 362 - 370
  • [32] Multimodal-enhanced hierarchical attention network for video captioning
    Maosheng Zhong
    Youde Chen
    Hao Zhang
    Hao Xiong
    Zhixiang Wang
    Multimedia Systems, 2023, 29 : 2469 - 2482
  • [33] Memory-enhanced hierarchical transformer for video paragraph captioning
    Zhang, Benhui
    Gao, Junyu
    Yuan, Yuan
    NEUROCOMPUTING, 2025, 615
  • [34] Hierarchical Conditional Relation Networks for Multimodal Video Question Answering
    Thao Minh Le
    Vuong Le
    Svetha Venkatesh
    Truyen Tran
    International Journal of Computer Vision, 2021, 129 : 3027 - 3050
  • [35] Convolutional Hierarchical Attention Network for Query-Focused Video Summarization
    Xiao, Shuwen
    Zhao, Zhou
    Zhang, Zijian
    Yan, Xiaohui
    Yang, Min
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12426 - 12433
  • [36] Video Referring Expression Comprehension via Transformer with Content-conditioned Query
    Jiang, Ji
    Cao, Meng
    Song, Tengtao
    Chen, Long
    Wang, Yi
    Zou, Yuexian
    PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON DEEP MULTIMODAL LEARNING FOR INFORMATION RETRIEVAL, MMIR 2023, 2023, : 39 - 48
  • [37] I-Brow: Hierarchical and Multimodal Transformer Model for Eyebrows Animation Synthesis
    Fares, Mireille
    Pelachaud, Catherine
    Obin, Nicolas
    ARTIFICIAL INTELLIGENCE IN HCI, AI-HCI 2023, PT II, 2023, 14051 : 435 - 452
  • [38] HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
    Liu, Song
    Fan, Haoqi
    Qian, Shengsheng
    Chen, Yiru
    Ding, Wenkui
    Wang, Zhongyuan
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11895 - 11905
  • [39] Hierarchical Time-Aware Summarization with an Adaptive Transformer for Video Captioning
    Cardoso, Leonardo Vilela
    Guimaraes, Silvio Jamil Ferzoli
    do Patrocinio Jr, Zenilton Kleber Goncalves
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2023, 17 (04) : 569 - 592
  • [40] Video Joint Modelling Based on Hierarchical Transformer for Co-Summarization
    Li, Haopeng
    Ke, Qiuhong
    Gong, Mingming
    Zhang, Rui
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) : 3904 - 3917