Multimodal Video Retrieval and Multimodal Language Modelling

被引：0

作者：

Wang, Hui ^{[1
]}

Kittler, Josef ^{[2
]}

Gales, Mark ^{[3
]}

Cooper, Rob ^{[4
]}

Mulvenna, Maurice ^{[5
]}

Ng, Wing ^{[6
]}

Hua, Yang ^{[1
]}

Gault, Richard ^{[1
]}

Haider, Abbas ^{[1
]}

Wu, Guanfeng ^{[7
]}

机构：

[1] Queens Univ Belfast, Belfast, North Ireland

[2] Univ Surrey, London, England

[3] Univ Cambridge, Cambridge, England

[4] BBC, London, England

[5] Univ Ulster, Belfast, North Ireland

[6] South China Univ Technol China, Guangzhou, Peoples R China

[7] Southwest Jiatong Univ China, Chengdu, Peoples R China

来源：

PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024 | 2024年

关键词：

Information Retrieval; Deep Learning; Large Language Models; Multimodal data retrieval; Multimodal data understanding and interaction;

D O I：

10.1145/3652583.3660001

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As the proliferation of video content continues, and many video archives lack suitable metadata, therefore, video retrieval, particularly through example-based search, has become increasingly crucial. Existing metadata often fails to meet the needs of specific types of searches, especially when videos contain elements from different modalities, such as visual and audio. Consequently, developing video retrieval methods that can handle multi-modal content is essential. In designing our novel video retrieval framework named Multi-modal Video Search by Examples (MVSE)1, we focused on accuracy (precision and recall), efficiency (retrieval time in seconds), interactivity, and extensibility, with key components including advanced data processing and a user-friendly interface aimed at enhancing search effectiveness and user experience. With the advent of Large Language Models (LLMs), the interaction between multimodal data, including image and audio has been transformed with a significant leap forward towards a bigger goal of artificial general intelligence. This workshop aims to bring together experts from diverse domains to explore the possibilities of developing novel ways of multimodal data search, understanding and interaction.

引用

页码：1345 / 1355

页数：11

共 50 条

[21] Joint embeddings with multimodal cues for video-text retrieval
Niluthpol C. Mithun
Juncheng Li
Florian Metze
Amit K. Roy-Chowdhury
International Journal of Multimedia Information Retrieval, 2019, 8 : 3 - 18
[22] Joint embeddings with multimodal cues for video-text retrieval
Mithun, Niluthpol C.
Li, Juncheng
Metze, Florian
Roy-Chowdhury, Amit K.
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2019, 8 (01) : 3 - 18
[23] Multimodal Video Annotation for Retrieval and Discovery of Newsworthy Video in a News Verification Scenario
Nixon, Lyndon
Apostolidis, Evlampios
Markatopoulou, Foteini
Patras, Ioannis
Mezaris, Vasileios
MULTIMEDIA MODELING (MMM 2019), PT I, 2019, 11295 : 143 - 155
[24] M3: Multimodal Memory Modelling for Video Captioning
Wang, Junbo
Wang, Wei
Huang, Yan
Wang, Liang
Tan, Tieniu
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7512 - 7520
[25] Multimodal Transformer for Unaligned Multimodal Language Sequences
Tsai, Yao-Hung Hubert
Bai, Shaojie
Liang, Paul Pu
Kolter, J. Zico
Morency, Louis-Philippe
Salakhutdinov, Ruslan
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 6558 - 6569
[26] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval
Ma, Wentao
Chen, Qingchao
Zhou, Tongqing
Zhao, Shan
Cai, Zhiping
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5486 - 5497
[27] A sports video browsing and retrieval system based on multimodal analysis: SportSBR
Liu, HY
Zhang, H
PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 5077 - 5081
[28] A multimodal and multilevel ranking scheme for large-scale video retrieval
Hoi, Steven C. H.
Lyu, Michael R.
IEEE TRANSACTIONS ON MULTIMEDIA, 2008, 10 (04) : 607 - 619
[29] Research on Video Retrieval Technology based on Multimodal Fusion and Attention Mechanism
Tai, Tianyang
Zeng, Fanfeng
PROCEEDINGS OF 2023 7TH INTERNATIONAL CONFERENCE ON ELECTRONIC INFORMATION TECHNOLOGY AND COMPUTER ENGINEERING, EITCE 2023, 2023, : 470 - 474
[30] A multimodal and multilevel ranking framework for content-based video retrieval
Hoi, Steven C. H.
Lyu, Michael R.
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 1225 - +

← 1 2 3 4 5 →