Multimodal Understanding: Investigating the Capabilities of Large Multimodal Models for Object Detection in XR Applications

被引：0

作者：

Arnold, Rahel ^{[1
]}

Schuldt, Heiko ^{[1
]}

机构：

[1] Univ Basel, Basel, Switzerland

来源：

PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024 | 2024年

关键词：

Object Detection; LLM; LMM; Multimedia Retrieval; Extended Reality;

D O I：

10.1145/3688866.3689126

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Extended Reality (XR), encompassing the concepts of augmented, virtual, and mixed reality, has the potential to offer unprecedented types of user interactions. An essential requirement is the automated understanding of a user's current scene, for instance, in order to provide information via visual overlays, to interact with a user based on conversational interfaces, to provide visual clues on directions, to explain the current scene or even to use the current scene or parts thereof in automated queries. Key to scene understanding and thus to all these user interactions is high quality object detection based on multimodal content - images, videos, audio, etc. Large Multimodal Models (LMMs) seamlessly process text in conjunction with such multimodal content. Therefore, they are an excellent basis for novel XR-based user interactions, given that they provide the necessary detection quality. This paper presents a two-stage analysis: In the first stage, the quality of two of the most prominent LMMs (LLaVA and KOSMOS-2) is compared with the object detector YOLO. The second step exploits Fooocus, a free and open-source AI image generator based on Stable Diffusion for the generation of images based on the descriptions derived in the first step. The second step evaluates the quality of the scene descriptions obtained in stage one. The evaluation results show that LLaVA, KOSMOS-2 and YOLO can all outperform the other approaches depending on the specific research focus. LLaVA achieves the highest recall, KOSMOS-2 results are the best in precision, and YOLO performs much faster and leads with the best F1 score. Fooocus manages to create images containing specific objects while still taking its liberty to omit or add specific objects. Therefore, our study confirmed our hypothesis that LMMs can be integrated into XR-based systems to further research novel XR-based user interactions.

引用

页码：26 / 35

页数：10

共 50 条

[31] Multimodal Object Detection by Channel Switching and Spatial Attention
Cao, Yue
Bin, Junchi
Hamari, Jozsef
Blasch, Erik
Liu, Zheng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2023, : 403 - 411
[32] Improving Multimodal Object Detection with Individual Sensor Monitoring
Kuhn, Christopher B.
Hofbauer, Markus
Bowen, Ma
Petrovic, Goran
Steinbach, Eckehard
2022 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2022, : 97 - 104
[33] MMRNet: Improving Reliability for Multimodal Object Detection and Segmentation for Bin Picking via Multimodal Redundancy
Chen, Yuhao
Gunraj, Hayden
Zeng, E. Zhixuan
Meyers, Robbie
Gilles, Maximilian
Wong, Alexander
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2023, : 68 - 77
[34] Object Detection via Multimodal Adaptive Feature Fusion
Gao Xiaoqiang
Chang Kan
Ling Mingyang
Yin Mengyu
LASER & OPTOELECTRONICS PROGRESS, 2023, 60 (24)
[35] Multimodal Fusion Object Detection System for Autonomous Vehicles
Person, Michael
Jensen, Mathew
Smith, Anthony O.
Gutierrez, Hector
JOURNAL OF DYNAMIC SYSTEMS MEASUREMENT AND CONTROL-TRANSACTIONS OF THE ASME, 2019, 141 (07):
[36] Multimodal Object Query Initialization for 3D Object Detection
van Geerenstein, Mathijs R.
Ruppel, Felicia
Dietmayers, Klaus
Gavrila, Dariu M.
2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2024), 2024, : 12484 - 12491
[37] YOLOrs: Object Detection in Multimodal Remote Sensing Imagery
Sharma, Manish
Dhanaraj, Mayur
Karnam, Srivallabha
Chachlakis, Dimitris G.
Ptucha, Raymond
Markopoulos, Panos P.
Saber, Eli
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 : 1497 - 1508
[38] Informative Data Selection With Uncertainty for Multimodal Object Detection
Zhang, Xinyu
Li, Zhiwei
Zou, Zhenhong
Gao, Xin
Xiong, Yijin
Jin, Dafeng
Li, Jun
Liu, Huaping
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (10) : 13561 - 13573
[39] Weakly Aligned Feature Fusion for Multimodal Object Detection
Zhang, Lu
Liu, Zhiyong
Zhu, Xiangyu
Song, Zhan
Yang, Xu
Lei, Zhen
Qiao, Hong
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021,
[40] Network Models in Neuroimaging: A Survey of Multimodal Applications
Mancini, Matteo
Cercignani, Mara
FUNDAMENTA INFORMATICAE, 2018, 163 (01) : 63 - 91

← 1 2 3 4 5 →