Multimodal Understanding: Investigating the Capabilities of Large Multimodal Models for Object Detection in XR Applications

被引:0
|
作者
Arnold, Rahel [1 ]
Schuldt, Heiko [1 ]
机构
[1] Univ Basel, Basel, Switzerland
关键词
Object Detection; LLM; LMM; Multimedia Retrieval; Extended Reality;
D O I
10.1145/3688866.3689126
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Extended Reality (XR), encompassing the concepts of augmented, virtual, and mixed reality, has the potential to offer unprecedented types of user interactions. An essential requirement is the automated understanding of a user's current scene, for instance, in order to provide information via visual overlays, to interact with a user based on conversational interfaces, to provide visual clues on directions, to explain the current scene or even to use the current scene or parts thereof in automated queries. Key to scene understanding and thus to all these user interactions is high quality object detection based on multimodal content - images, videos, audio, etc. Large Multimodal Models (LMMs) seamlessly process text in conjunction with such multimodal content. Therefore, they are an excellent basis for novel XR-based user interactions, given that they provide the necessary detection quality. This paper presents a two-stage analysis: In the first stage, the quality of two of the most prominent LMMs (LLaVA and KOSMOS-2) is compared with the object detector YOLO. The second step exploits Fooocus, a free and open-source AI image generator based on Stable Diffusion for the generation of images based on the descriptions derived in the first step. The second step evaluates the quality of the scene descriptions obtained in stage one. The evaluation results show that LLaVA, KOSMOS-2 and YOLO can all outperform the other approaches depending on the specific research focus. LLaVA achieves the highest recall, KOSMOS-2 results are the best in precision, and YOLO performs much faster and leads with the best F1 score. Fooocus manages to create images containing specific objects while still taking its liberty to omit or add specific objects. Therefore, our study confirmed our hypothesis that LMMs can be integrated into XR-based systems to further research novel XR-based user interactions.
引用
收藏
页码:26 / 35
页数:10
相关论文
共 50 条
  • [31] Multimodal Object Detection by Channel Switching and Spatial Attention
    Cao, Yue
    Bin, Junchi
    Hamari, Jozsef
    Blasch, Erik
    Liu, Zheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2023, : 403 - 411
  • [32] Improving Multimodal Object Detection with Individual Sensor Monitoring
    Kuhn, Christopher B.
    Hofbauer, Markus
    Bowen, Ma
    Petrovic, Goran
    Steinbach, Eckehard
    2022 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2022, : 97 - 104
  • [33] MMRNet: Improving Reliability for Multimodal Object Detection and Segmentation for Bin Picking via Multimodal Redundancy
    Chen, Yuhao
    Gunraj, Hayden
    Zeng, E. Zhixuan
    Meyers, Robbie
    Gilles, Maximilian
    Wong, Alexander
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW, 2023, : 68 - 77
  • [34] Object Detection via Multimodal Adaptive Feature Fusion
    Gao Xiaoqiang
    Chang Kan
    Ling Mingyang
    Yin Mengyu
    LASER & OPTOELECTRONICS PROGRESS, 2023, 60 (24)
  • [35] Multimodal Fusion Object Detection System for Autonomous Vehicles
    Person, Michael
    Jensen, Mathew
    Smith, Anthony O.
    Gutierrez, Hector
    JOURNAL OF DYNAMIC SYSTEMS MEASUREMENT AND CONTROL-TRANSACTIONS OF THE ASME, 2019, 141 (07):
  • [36] Multimodal Object Query Initialization for 3D Object Detection
    van Geerenstein, Mathijs R.
    Ruppel, Felicia
    Dietmayers, Klaus
    Gavrila, Dariu M.
    2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2024), 2024, : 12484 - 12491
  • [37] YOLOrs: Object Detection in Multimodal Remote Sensing Imagery
    Sharma, Manish
    Dhanaraj, Mayur
    Karnam, Srivallabha
    Chachlakis, Dimitris G.
    Ptucha, Raymond
    Markopoulos, Panos P.
    Saber, Eli
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 : 1497 - 1508
  • [38] Informative Data Selection With Uncertainty for Multimodal Object Detection
    Zhang, Xinyu
    Li, Zhiwei
    Zou, Zhenhong
    Gao, Xin
    Xiong, Yijin
    Jin, Dafeng
    Li, Jun
    Liu, Huaping
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (10) : 13561 - 13573
  • [39] Weakly Aligned Feature Fusion for Multimodal Object Detection
    Zhang, Lu
    Liu, Zhiyong
    Zhu, Xiangyu
    Song, Zhan
    Yang, Xu
    Lei, Zhen
    Qiao, Hong
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021,
  • [40] Network Models in Neuroimaging: A Survey of Multimodal Applications
    Mancini, Matteo
    Cercignani, Mara
    FUNDAMENTA INFORMATICAE, 2018, 163 (01) : 63 - 91