HMTV: hierarchical multimodal transformer for video highlight query on baseball

被引:0
|
作者
Zhang, Qiaoyun [1 ]
Chang, Chih-Yung [2 ]
Su, Ming-Yang [3 ]
Chang, Hsiang-Chuan [4 ]
Roy, Diptendu Sinha [5 ]
机构
[1] Chuzhou Univ, Sch Comp & Informat Engn, Chuzhou 239000, Peoples R China
[2] Tamkang Univ, Dept Comp Sci & Informat Engn, New Taipei 25137, Taiwan
[3] Ming Chuan Univ, Dept Comp Sci & Informat Engn, Taoyuan 333, Taiwan
[4] Tamkang Univ, Dept Transportat Management, New Taipei 25137, Taiwan
[5] Natl Inst Technol, Dept Comp Sci & Engn, Shillong 793003, India
关键词
Hierarchical multimodal Transformer; BERT; Highlight query; NETWORK;
D O I
10.1007/s00530-024-01479-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increasing popularity of watching baseball videos, there is a growing desire among fans to enjoy the highlights of these videos. However, the extraction of the highlights from lengthy baseball videos faces a significant challenge due to its time-consuming and labor-intensive nature. To address this challenge, this paper proposes a novel mechanism, called Hierarchical Multimodal Transformer for Video query (HMTV). The proposed HMTV incorporates a two-phase involving Coarse-Grained clipping for candidate videos and Fine-Grained identification for highlights. In the Coarse-Grained phase, a pitching detection model is employed to extract relevant candidate videos from baseball videos, encompassing the features of pitch deliveries and pitching. In the Fine-Grained phase, Transformer encoder and pre-trained Bidirectional Encoder Representations from Transformers (BERT) are utilized to capture relationship features between frames of candidate videos and words from users' questions, respectively. These relationship features are then fed into the Video Query (VideoQ) model, implemented by the Text Video Attention (TVA). The VideoQ model identifies the start and end positions of the highlights mentioned in the query within the candidate videos. Simulation results demonstrate that the proposed HMTV significantly improves accuracy of highlights identification in terms of precision, recall, and F1-score.
引用
收藏
页数:18
相关论文
共 50 条
  • [21] Hierarchical Separable Video Transformer for Snapshot Compressive Imaging
    Wang, Ping
    Zhang, Yulun
    Wang, Lishun
    Yuan, Xin
    COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 104 - 122
  • [22] Real-time highlight detection in baseball video for TVs with time-shift function
    Kim, Hyoung-Gook
    Jeong, Jinguk
    Kim, Jang-Heon
    Kim, Jin Young
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2008, 54 (02) : 831 - 838
  • [23] Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language
    Liu, An-An
    Xu, Ning
    Wong, Yongkang
    Li, Junnan
    Su, Yu-Ting
    Kankanhalli, Mohan
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2017, 163 : 113 - 125
  • [24] Topic-aware video summarization using multimodal transformer
    Zhu, Yubo
    Zhao, Wentian
    Hua, Rui
    Wu, Xinxiao
    PATTERN RECOGNITION, 2023, 140
  • [25] Multimodal Interaction Fusion Network Based on Transformer for Video Captioning
    Xu, Hui
    Zeng, Pengpeng
    Khan, Abdullah Aman
    ARTIFICIAL INTELLIGENCE AND ROBOTICS, ISAIR 2022, PT I, 2022, 1700 : 21 - 36
  • [26] Skim-and-scan transformer: Anew transformer-inspired architecture for video-query based video moment retrieval
    Huo, Shuwei
    Zhou, Yuan
    Chen, Keran
    Xiang, Wei
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 270
  • [27] MCT-VHD: Multi-modal contrastive transformer for video highlight detection
    Jiang, Yinhui
    Luo, Sihui
    Guo, Lijun
    Zhang, Rong
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 101
  • [28] Dual-Stream Multimodal Learning for Topic-Adaptive Video Highlight Detection
    Xiong, Ziwei
    Wang, Han
    PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 272 - 279
  • [29] Hierarchical Conditional Relation Networks for Multimodal Video Question Answering
    Le, Thao Minh
    Le, Vuong
    Venkatesh, Svetha
    Tran, Truyen
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (11) : 3027 - 3050
  • [30] Multimodal-enhanced hierarchical attention network for video captioning
    Zhong, Maosheng
    Chen, Youde
    Zhang, Hao
    Xiong, Hao
    Wang, Zhixiang
    MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482