Multimodal Understanding: Investigating the Capabilities of Large Multimodal Models for Object Detection in XR Applications

被引:0
|
作者
Arnold, Rahel [1 ]
Schuldt, Heiko [1 ]
机构
[1] Univ Basel, Basel, Switzerland
关键词
Object Detection; LLM; LMM; Multimedia Retrieval; Extended Reality;
D O I
10.1145/3688866.3689126
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Extended Reality (XR), encompassing the concepts of augmented, virtual, and mixed reality, has the potential to offer unprecedented types of user interactions. An essential requirement is the automated understanding of a user's current scene, for instance, in order to provide information via visual overlays, to interact with a user based on conversational interfaces, to provide visual clues on directions, to explain the current scene or even to use the current scene or parts thereof in automated queries. Key to scene understanding and thus to all these user interactions is high quality object detection based on multimodal content - images, videos, audio, etc. Large Multimodal Models (LMMs) seamlessly process text in conjunction with such multimodal content. Therefore, they are an excellent basis for novel XR-based user interactions, given that they provide the necessary detection quality. This paper presents a two-stage analysis: In the first stage, the quality of two of the most prominent LMMs (LLaVA and KOSMOS-2) is compared with the object detector YOLO. The second step exploits Fooocus, a free and open-source AI image generator based on Stable Diffusion for the generation of images based on the descriptions derived in the first step. The second step evaluates the quality of the scene descriptions obtained in stage one. The evaluation results show that LLaVA, KOSMOS-2 and YOLO can all outperform the other approaches depending on the specific research focus. LLaVA achieves the highest recall, KOSMOS-2 results are the best in precision, and YOLO performs much faster and leads with the best F1 score. Fooocus manages to create images containing specific objects while still taking its liberty to omit or add specific objects. Therefore, our study confirmed our hypothesis that LMMs can be integrated into XR-based systems to further research novel XR-based user interactions.
引用
收藏
页码:26 / 35
页数:10
相关论文
共 50 条
  • [1] Contextual Object Detection with Multimodal Large Language Models
    Zang, Yuhang
    Li, Wei
    Han, Jun
    Zhou, Kaiyang
    Loy, Chen Change
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 825 - 843
  • [2] Exploring the Capabilities of Large Multimodal Models on Dense Text
    Zhang, Shuo
    Yang, Biao
    Li, Zhang
    Ma, Zhiyin
    Liu, Yuliang
    Bai, Xiang
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 281 - 298
  • [3] Investigating the Catastrophic Forgetting in Multimodal Large Language Models
    Zhai, Yuexiang
    Tong, Shengbang
    Li, Xiao
    Cai, Mu
    Qu, Qing
    Lee, Yong Jae
    Ma, Yi
    CONFERENCE ON PARSIMONY AND LEARNING, VOL 234, 2024, 234 : 202 - 227
  • [4] Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models
    Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University, China
    不详
    不详
    arXiv, 1600,
  • [5] MULTIMODAL OBJECT DETECTION IN REMOTE SENSING
    Belmouhcine, A.
    Burnel, J. C.
    Courtrai, L.
    Pham, M. T.
    Lefevre, S.
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 1245 - 1248
  • [6] Multimodal Sparse Features for Object Detection
    Haker, Martin
    Martinetz, Thomas
    Barth, Erhardt
    ARTIFICIAL NEURAL NETWORKS - ICANN 2009, PT II, 2009, 5769 : 923 - 932
  • [7] Multimodal Sensing Capabilities for the Detection of Shunt Failure
    Gamero, Milenka
    Kim, Woo Seok
    Hong, Sungcheol
    Vorobiev, Daniel
    Morgan, Clinton D.
    Park, Sung Il
    SENSORS, 2021, 21 (05) : 1 - 11
  • [8] Panel: Multimodal Large Foundation Models
    Kankanhalli, Mohan
    Worring, Marcel
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9709 - 9709
  • [9] A survey on multimodal large language models
    Yin, Shukang
    Fu, Chaoyou
    Zhao, Sirui
    Li, Ke
    Sun, Xing
    Xu, Tong
    Chen, Enhong
    NATIONAL SCIENCE REVIEW, 2024, 11 (12)
  • [10] A Comprehensive Review of Multimodal XR Applications, Risks, and Ethical Challenges in the Metaverse
    Kourtesis, Panagiotis
    MULTIMODAL TECHNOLOGIES AND INTERACTION, 2024, 8 (11)