HMTV: hierarchical multimodal transformer for video highlight query on baseball

被引：0

作者：

Zhang, Qiaoyun ^{[1
]}

Chang, Chih-Yung ^{[2
]}

Su, Ming-Yang ^{[3
]}

Chang, Hsiang-Chuan ^{[4
]}

Roy, Diptendu Sinha ^{[5
]}

机构：

[1] Chuzhou Univ, Sch Comp & Informat Engn, Chuzhou 239000, Peoples R China

[2] Tamkang Univ, Dept Comp Sci & Informat Engn, New Taipei 25137, Taiwan

[3] Ming Chuan Univ, Dept Comp Sci & Informat Engn, Taoyuan 333, Taiwan

[4] Tamkang Univ, Dept Transportat Management, New Taipei 25137, Taiwan

[5] Natl Inst Technol, Dept Comp Sci & Engn, Shillong 793003, India

来源：

MULTIMEDIA SYSTEMS | 2024年 / 30卷 / 05期

关键词：

Hierarchical multimodal Transformer; BERT; Highlight query; NETWORK;

D O I：

10.1007/s00530-024-01479-6

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the increasing popularity of watching baseball videos, there is a growing desire among fans to enjoy the highlights of these videos. However, the extraction of the highlights from lengthy baseball videos faces a significant challenge due to its time-consuming and labor-intensive nature. To address this challenge, this paper proposes a novel mechanism, called Hierarchical Multimodal Transformer for Video query (HMTV). The proposed HMTV incorporates a two-phase involving Coarse-Grained clipping for candidate videos and Fine-Grained identification for highlights. In the Coarse-Grained phase, a pitching detection model is employed to extract relevant candidate videos from baseball videos, encompassing the features of pitch deliveries and pitching. In the Fine-Grained phase, Transformer encoder and pre-trained Bidirectional Encoder Representations from Transformers (BERT) are utilized to capture relationship features between frames of candidate videos and words from users' questions, respectively. These relationship features are then fed into the Video Query (VideoQ) model, implemented by the Text Video Attention (TVA). The VideoQ model identifies the start and end positions of the highlights mentioned in the query within the candidate videos. Simulation results demonstrate that the proposed HMTV significantly improves accuracy of highlights identification in terms of precision, recall, and F1-score.

引用

页数：18

共 50 条

[21] Hierarchical Separable Video Transformer for Snapshot Compressive Imaging
Wang, Ping
Zhang, Yulun
Wang, Lishun
Yuan, Xin
COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 104 - 122
[22] Real-time highlight detection in baseball video for TVs with time-shift function
Kim, Hyoung-Gook
Jeong, Jinguk
Kim, Jang-Heon
Kim, Jin Young
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2008, 54 (02) : 831 - 838
[23] Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language
Liu, An-An
Xu, Ning
Wong, Yongkang
Li, Junnan
Su, Yu-Ting
Kankanhalli, Mohan
COMPUTER VISION AND IMAGE UNDERSTANDING, 2017, 163 : 113 - 125
[24] Topic-aware video summarization using multimodal transformer
Zhu, Yubo
Zhao, Wentian
Hua, Rui
Wu, Xinxiao
PATTERN RECOGNITION, 2023, 140
[25] Multimodal Interaction Fusion Network Based on Transformer for Video Captioning
Xu, Hui
Zeng, Pengpeng
Khan, Abdullah Aman
ARTIFICIAL INTELLIGENCE AND ROBOTICS, ISAIR 2022, PT I, 2022, 1700 : 21 - 36
[26] Skim-and-scan transformer: Anew transformer-inspired architecture for video-query based video moment retrieval
Huo, Shuwei
Zhou, Yuan
Chen, Keran
Xiang, Wei
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 270
[27] MCT-VHD: Multi-modal contrastive transformer for video highlight detection
Jiang, Yinhui
Luo, Sihui
Guo, Lijun
Zhang, Rong
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 101
[28] Dual-Stream Multimodal Learning for Topic-Adaptive Video Highlight Detection
Xiong, Ziwei
Wang, Han
PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 272 - 279
[29] Hierarchical Conditional Relation Networks for Multimodal Video Question Answering
Le, Thao Minh
Le, Vuong
Venkatesh, Svetha
Tran, Truyen
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (11) : 3027 - 3050
[30] Multimodal-enhanced hierarchical attention network for video captioning
Zhong, Maosheng
Chen, Youde
Zhang, Hao
Xiong, Hao
Wang, Zhixiang
MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482

← 1 2 3 4 5 →