Multimodal Understanding: Investigating the Capabilities of Large Multimodal Models for Object Detection in XR Applications

被引:0
|
作者
Arnold, Rahel [1 ]
Schuldt, Heiko [1 ]
机构
[1] Univ Basel, Basel, Switzerland
关键词
Object Detection; LLM; LMM; Multimedia Retrieval; Extended Reality;
D O I
10.1145/3688866.3689126
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Extended Reality (XR), encompassing the concepts of augmented, virtual, and mixed reality, has the potential to offer unprecedented types of user interactions. An essential requirement is the automated understanding of a user's current scene, for instance, in order to provide information via visual overlays, to interact with a user based on conversational interfaces, to provide visual clues on directions, to explain the current scene or even to use the current scene or parts thereof in automated queries. Key to scene understanding and thus to all these user interactions is high quality object detection based on multimodal content - images, videos, audio, etc. Large Multimodal Models (LMMs) seamlessly process text in conjunction with such multimodal content. Therefore, they are an excellent basis for novel XR-based user interactions, given that they provide the necessary detection quality. This paper presents a two-stage analysis: In the first stage, the quality of two of the most prominent LMMs (LLaVA and KOSMOS-2) is compared with the object detector YOLO. The second step exploits Fooocus, a free and open-source AI image generator based on Stable Diffusion for the generation of images based on the descriptions derived in the first step. The second step evaluates the quality of the scene descriptions obtained in stage one. The evaluation results show that LLaVA, KOSMOS-2 and YOLO can all outperform the other approaches depending on the specific research focus. LLaVA achieves the highest recall, KOSMOS-2 results are the best in precision, and YOLO performs much faster and leads with the best F1 score. Fooocus manages to create images containing specific objects while still taking its liberty to omit or add specific objects. Therefore, our study confirmed our hypothesis that LMMs can be integrated into XR-based systems to further research novel XR-based user interactions.
引用
收藏
页码:26 / 35
页数:10
相关论文
共 50 条
  • [41] Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models
    Wang, Youze
    Hu, Wenbo
    Dong, Yinpeng
    Liu, Jing
    Zhang, Hanwang
    Hong, Richang
    IEEE Transactions on Circuits and Systems for Video Technology,
  • [42] On Explaining Multimodal Hateful Meme Detection Models
    Hee, Ming Shan
    Lee, Roy Ka-Wei
    Chong, Wen-Haw
    PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 3651 - 3655
  • [43] Face detection using multimodal density models
    Yang, MH
    Kriegman, D
    Ahuja, N
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2001, 84 (02) : 264 - 284
  • [44] A New Era in Human Factors Engineering: A Survey of the Applications and Prospects of Large Multimodal Models
    Li, Fan
    Han, Su
    Lee, Ching-Hung
    Feng, Shanshan
    Jiang, Zhuoxuan
    Sun, Zhu
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION, 2025,
  • [45] Large language models and multimodal foundation models for precision oncology
    Truhn, Daniel
    Eckardt, Jan-Niklas
    Ferber, Dyke
    Kather, Jakob Nikolas
    NPJ PRECISION ONCOLOGY, 2024, 8 (01)
  • [46] Large language models and multimodal foundation models for precision oncology
    Daniel Truhn
    Jan-Niklas Eckardt
    Dyke Ferber
    Jakob Nikolas Kather
    npj Precision Oncology, 8
  • [47] Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models
    Chen, Zheyi
    Xu, Liuchang
    Zheng, Hongting
    Chen, Luyao
    Tolba, Amr
    Zhao, Liang
    Yu, Keping
    Feng, Hailin
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 80 (02): : 1753 - 1808
  • [48] Multimodal Learning of Keypoint Predictive Models for Visual Object Manipulation
    Bechtle, Sarah
    Das, Neha
    Meier, Franziska
    IEEE TRANSACTIONS ON ROBOTICS, 2023, 39 (02) : 1212 - 1224
  • [49] Multimodal Object Classification Models Inspired by Multisensory Integration in the Brain
    Amerineni, Rajesh
    Gupta, Resh S.
    Gupta, Lalit
    BRAIN SCIENCES, 2019, 9 (01)
  • [50] CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models
    Chen, Zixin
    Lin, Hongzhan
    Luo, Ziyang
    Cheng, Mingfei
    Ma, Jing
    Chen, Guang
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 9663 - 9687