HMTV: hierarchical multimodal transformer for video highlight query on baseball

被引:0
|
作者
Zhang, Qiaoyun [1 ]
Chang, Chih-Yung [2 ]
Su, Ming-Yang [3 ]
Chang, Hsiang-Chuan [4 ]
Roy, Diptendu Sinha [5 ]
机构
[1] Chuzhou Univ, Sch Comp & Informat Engn, Chuzhou 239000, Peoples R China
[2] Tamkang Univ, Dept Comp Sci & Informat Engn, New Taipei 25137, Taiwan
[3] Ming Chuan Univ, Dept Comp Sci & Informat Engn, Taoyuan 333, Taiwan
[4] Tamkang Univ, Dept Transportat Management, New Taipei 25137, Taiwan
[5] Natl Inst Technol, Dept Comp Sci & Engn, Shillong 793003, India
关键词
Hierarchical multimodal Transformer; BERT; Highlight query; NETWORK;
D O I
10.1007/s00530-024-01479-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increasing popularity of watching baseball videos, there is a growing desire among fans to enjoy the highlights of these videos. However, the extraction of the highlights from lengthy baseball videos faces a significant challenge due to its time-consuming and labor-intensive nature. To address this challenge, this paper proposes a novel mechanism, called Hierarchical Multimodal Transformer for Video query (HMTV). The proposed HMTV incorporates a two-phase involving Coarse-Grained clipping for candidate videos and Fine-Grained identification for highlights. In the Coarse-Grained phase, a pitching detection model is employed to extract relevant candidate videos from baseball videos, encompassing the features of pitch deliveries and pitching. In the Fine-Grained phase, Transformer encoder and pre-trained Bidirectional Encoder Representations from Transformers (BERT) are utilized to capture relationship features between frames of candidate videos and words from users' questions, respectively. These relationship features are then fed into the Video Query (VideoQ) model, implemented by the Text Video Attention (TVA). The VideoQ model identifies the start and end positions of the highlights mentioned in the query within the candidate videos. Simulation results demonstrate that the proposed HMTV significantly improves accuracy of highlights identification in terms of precision, recall, and F1-score.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Hierarchical video summarization based on video structure and highlight
    Geng, Yuliang
    Xu, De
    Feng, Songhe
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, PROCEEDINGS, 2006, 4109 : 226 - 234
  • [2] Hierarchical multimodal transformer to summarize videos
    Zhao, Bin
    Gong, Maoguo
    Li, Xuelong
    NEUROCOMPUTING, 2022, 468 : 360 - 369
  • [3] Incremental Multimodal Query Construction for Video Search
    Xu, Shicheng
    Li, Huan
    Chang, Xiaojun
    Yu, Shoou-I
    Du, Xingzhong
    Li, Xuanchong
    Jiang, Lu
    Mao, Zexi
    Lan, Zhenzhong
    Burger, Susanne
    Hauptmann, Alexander
    ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, : 675 - 678
  • [4] Multimodal Query Suggestion and Searching for Video Search
    Li, Lvsong
    Li, Jing
    PROCEEDINGS OF THE 20TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATION, 2009, : 274 - 278
  • [5] Hierarchical Transformer-based Query by Multiple Documents
    Huang, Zhiqi
    Naseri, Shahrzad
    Bonab, Hamed
    Sarwar, Sheikh Muhammad
    Allan, James
    PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023, 2023, : 105 - 115
  • [6] Enhancing Classification with Hierarchical Scalable Query on Fusion Transformer
    Sahoo, Sudeep Kumar
    Chalasani, Sathish
    Joshi, Abhishek
    Iyer, Kiran Nanjunda
    2023 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, ICCE, 2023,
  • [7] Multimodal Analysis for Deep Video Understanding with Video Language Transformer
    Zhang, Beibei
    Fang, Yaqun
    Ren, Tongwei
    Wu, Gangshan
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7165 - 7169
  • [8] Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
    Moon, WonJun
    Hyun, Sangeek
    Park, SangUk
    Park, Dongchan
    Heo, Jae-Pil
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23023 - 23033
  • [9] MDMMT: Multidomain Multimodal Transformer for Video Retrieval
    Dzabraev, Maksim
    Kalashnikov, Maksim
    Komkov, Stepan
    Petiushko, Aleksandr
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 3349 - 3358
  • [10] MQSS: multimodal query suggestion and searching for video search
    Li, Lusong
    Li, Jing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2011, 54 (01) : 55 - 68