Multimodal Understanding: Investigating the Capabilities of Large Multimodal Models for Object Detection in XR Applications

被引：0

作者：

Arnold, Rahel ^{[1
]}

Schuldt, Heiko ^{[1
]}

机构：

[1] Univ Basel, Basel, Switzerland

来源：

PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024 | 2024年

关键词：

Object Detection; LLM; LMM; Multimedia Retrieval; Extended Reality;

D O I：

10.1145/3688866.3689126

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Extended Reality (XR), encompassing the concepts of augmented, virtual, and mixed reality, has the potential to offer unprecedented types of user interactions. An essential requirement is the automated understanding of a user's current scene, for instance, in order to provide information via visual overlays, to interact with a user based on conversational interfaces, to provide visual clues on directions, to explain the current scene or even to use the current scene or parts thereof in automated queries. Key to scene understanding and thus to all these user interactions is high quality object detection based on multimodal content - images, videos, audio, etc. Large Multimodal Models (LMMs) seamlessly process text in conjunction with such multimodal content. Therefore, they are an excellent basis for novel XR-based user interactions, given that they provide the necessary detection quality. This paper presents a two-stage analysis: In the first stage, the quality of two of the most prominent LMMs (LLaVA and KOSMOS-2) is compared with the object detector YOLO. The second step exploits Fooocus, a free and open-source AI image generator based on Stable Diffusion for the generation of images based on the descriptions derived in the first step. The second step evaluates the quality of the scene descriptions obtained in stage one. The evaluation results show that LLaVA, KOSMOS-2 and YOLO can all outperform the other approaches depending on the specific research focus. LLaVA achieves the highest recall, KOSMOS-2 results are the best in precision, and YOLO performs much faster and leads with the best F1 score. Fooocus manages to create images containing specific objects while still taking its liberty to omit or add specific objects. Therefore, our study confirmed our hypothesis that LMMs can be integrated into XR-based systems to further research novel XR-based user interactions.

引用

页码：26 / 35

页数：10

共 50 条

[1] Contextual Object Detection with Multimodal Large Language Models
Zang, Yuhang
Li, Wei
Han, Jun
Zhou, Kaiyang
Loy, Chen Change
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 825 - 843
[2] Exploring the Capabilities of Large Multimodal Models on Dense Text
Zhang, Shuo
Yang, Biao
Li, Zhang
Ma, Zhiyin
Liu, Yuliang
Bai, Xiang
DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 281 - 298
[3] Investigating the Catastrophic Forgetting in Multimodal Large Language Models
Zhai, Yuexiang
Tong, Shengbang
Li, Xiao
Cai, Mu
Qu, Qing
Lee, Yong Jae
Ma, Yi
CONFERENCE ON PARSIMONY AND LEARNING, VOL 234, 2024, 234 : 202 - 227
[4] Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University, China
不详
不详
arXiv, 1600,
[5] MULTIMODAL OBJECT DETECTION IN REMOTE SENSING
Belmouhcine, A.
Burnel, J. C.
Courtrai, L.
Pham, M. T.
Lefevre, S.
IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 1245 - 1248
[6] Multimodal Sparse Features for Object Detection
Haker, Martin
Martinetz, Thomas
Barth, Erhardt
ARTIFICIAL NEURAL NETWORKS - ICANN 2009, PT II, 2009, 5769 : 923 - 932
[7] Multimodal Sensing Capabilities for the Detection of Shunt Failure
Gamero, Milenka
Kim, Woo Seok
Hong, Sungcheol
Vorobiev, Daniel
Morgan, Clinton D.
Park, Sung Il
SENSORS, 2021, 21 (05) : 1 - 11
[8] Panel: Multimodal Large Foundation Models
Kankanhalli, Mohan
Worring, Marcel
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9709 - 9709
[9] A survey on multimodal large language models
Yin, Shukang
Fu, Chaoyou
Zhao, Sirui
Li, Ke
Sun, Xing
Xu, Tong
Chen, Enhong
NATIONAL SCIENCE REVIEW, 2024, 11 (12)
[10] A Comprehensive Review of Multimodal XR Applications, Risks, and Ethical Challenges in the Metaverse
Kourtesis, Panagiotis
MULTIMODAL TECHNOLOGIES AND INTERACTION, 2024, 8 (11)

← 1 2 3 4 5 →