HMTV: hierarchical multimodal transformer for video highlight query on baseball

被引：0

作者：

Zhang, Qiaoyun ^{[1
]}

Chang, Chih-Yung ^{[2
]}

Su, Ming-Yang ^{[3
]}

Chang, Hsiang-Chuan ^{[4
]}

Roy, Diptendu Sinha ^{[5
]}

机构：

[1] Chuzhou Univ, Sch Comp & Informat Engn, Chuzhou 239000, Peoples R China

[2] Tamkang Univ, Dept Comp Sci & Informat Engn, New Taipei 25137, Taiwan

[3] Ming Chuan Univ, Dept Comp Sci & Informat Engn, Taoyuan 333, Taiwan

[4] Tamkang Univ, Dept Transportat Management, New Taipei 25137, Taiwan

[5] Natl Inst Technol, Dept Comp Sci & Engn, Shillong 793003, India

来源：

MULTIMEDIA SYSTEMS | 2024年 / 30卷 / 05期

关键词：

Hierarchical multimodal Transformer; BERT; Highlight query; NETWORK;

D O I：

10.1007/s00530-024-01479-6

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the increasing popularity of watching baseball videos, there is a growing desire among fans to enjoy the highlights of these videos. However, the extraction of the highlights from lengthy baseball videos faces a significant challenge due to its time-consuming and labor-intensive nature. To address this challenge, this paper proposes a novel mechanism, called Hierarchical Multimodal Transformer for Video query (HMTV). The proposed HMTV incorporates a two-phase involving Coarse-Grained clipping for candidate videos and Fine-Grained identification for highlights. In the Coarse-Grained phase, a pitching detection model is employed to extract relevant candidate videos from baseball videos, encompassing the features of pitch deliveries and pitching. In the Fine-Grained phase, Transformer encoder and pre-trained Bidirectional Encoder Representations from Transformers (BERT) are utilized to capture relationship features between frames of candidate videos and words from users' questions, respectively. These relationship features are then fed into the Video Query (VideoQ) model, implemented by the Text Video Attention (TVA). The VideoQ model identifies the start and end positions of the highlights mentioned in the query within the candidate videos. Simulation results demonstrate that the proposed HMTV significantly improves accuracy of highlights identification in terms of precision, recall, and F1-score.

引用

页数：18

共 50 条

[31] Hierarchical attention-based multimodal fusion for video captioning
Wu, Chunlei
Wei, Yiwei
Chu, Xiaoliang
Weichen, Sun
Su, Fei
Wang, Leiquan
NEUROCOMPUTING, 2018, 315 : 362 - 370
[32] Multimodal-enhanced hierarchical attention network for video captioning
Maosheng Zhong
Youde Chen
Hao Zhang
Hao Xiong
Zhixiang Wang
Multimedia Systems, 2023, 29 : 2469 - 2482
[33] Memory-enhanced hierarchical transformer for video paragraph captioning
Zhang, Benhui
Gao, Junyu
Yuan, Yuan
NEUROCOMPUTING, 2025, 615
[34] Hierarchical Conditional Relation Networks for Multimodal Video Question Answering
Thao Minh Le
Vuong Le
Svetha Venkatesh
Truyen Tran
International Journal of Computer Vision, 2021, 129 : 3027 - 3050
[35] Convolutional Hierarchical Attention Network for Query-Focused Video Summarization
Xiao, Shuwen
Zhao, Zhou
Zhang, Zijian
Yan, Xiaohui
Yang, Min
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12426 - 12433
[36] Video Referring Expression Comprehension via Transformer with Content-conditioned Query
Jiang, Ji
Cao, Meng
Song, Tengtao
Chen, Long
Wang, Yi
Zou, Yuexian
PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON DEEP MULTIMODAL LEARNING FOR INFORMATION RETRIEVAL, MMIR 2023, 2023, : 39 - 48
[37] I-Brow: Hierarchical and Multimodal Transformer Model for Eyebrows Animation Synthesis
Fares, Mireille
Pelachaud, Catherine
Obin, Nicolas
ARTIFICIAL INTELLIGENCE IN HCI, AI-HCI 2023, PT II, 2023, 14051 : 435 - 452
[38] HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
Liu, Song
Fan, Haoqi
Qian, Shengsheng
Chen, Yiru
Ding, Wenkui
Wang, Zhongyuan
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11895 - 11905
[39] Hierarchical Time-Aware Summarization with an Adaptive Transformer for Video Captioning
Cardoso, Leonardo Vilela
Guimaraes, Silvio Jamil Ferzoli
do Patrocinio Jr, Zenilton Kleber Goncalves
INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2023, 17 (04) : 569 - 592
[40] Video Joint Modelling Based on Hierarchical Transformer for Co-Summarization
Li, Haopeng
Ke, Qiuhong
Gong, Mingming
Zhang, Rui
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) : 3904 - 3917

← 1 2 3 4 5 →