Unmasking large language models by means of OpenAI GPT-4 and Google AI: A deep instruction-based analysis

被引:1
|
作者
Zahid, Idrees A. [1 ]
Joudar, Shahad Sabbar [1 ]
Albahri, A. S. [2 ]
Albahri, O. S. [3 ,4 ]
Alamoodi, A. H. [5 ,6 ]
Santamaria, Jose [7 ]
Alzubaidi, Laith [8 ,9 ]
机构
[1] Univ Technol Baghdad, Baghdad, Iraq
[2] Imam Jaafar Al Sadiq Univ, Tech Coll, Baghdad, Iraq
[3] Australian Tech & Management Coll, Melbourne, Australia
[4] Mazaya Univ Coll, Comp Tech Engn Dept, Nasiriyah, Iraq
[5] Appl Sci Private Univ, Appl Sci Res Ctr, Amman, Jordan
[6] Middle East Univ, MEU Res Unit, Amman, Jordan
[7] Univ Jaen, Dept Comp Sci, Jaen 23071, Spain
[8] Queensland Univ Technol, Sch Mech Med & Proc Engn, Brisbane, Qld 4000, Australia
[9] Queensland Univ Technol, Ctr Data Sci, Brisbane, Qld 4000, Australia
来源
基金
澳大利亚研究理事会;
关键词
OpenAI GPT-4; Google AI; Instruction-based analysis; Sarcasm detection; Deception avoidance; Transformers; ARTIFICIAL-INTELLIGENCE; CHALLENGES;
D O I
10.1016/j.iswa.2024.200431
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have become a hot topic in AI due to their ability to mimic human conversation. This study compares the open artificial intelligence generative pretrained transformer-4 (GPT-4) model, based on the (GPT), and Google's artificial intelligence (AI), which is based on the Bidirectional Encoder Representations from Transformers (BERT) framework in terms of the defined capabilities and the built-in architecture. Both LLMs are prominent in AI applications. First, eight different capabilities were identified to evaluate these models, i.e. translation accuracy, text generation, factuality, creativity, intellect, deception avoidance, sentiment classification, and sarcasm detection. Next, each capability was assessed using instructions. Additionally, a categorized LLM evaluation system has been developed by means of using ten research questions per category based on this paper's main contributions from a prompt engineering perspective. It should be highlighted that GPT-4 and Google AI successfully answered 85 % and 68,7 % of the study prompts, respectively. It has been noted that GPT-4 better understands prompts than Google AI, even with verbal flaws, and tolerates grammatical errors. Moreover, the GPT-4 based approach was more precise, accurate, and succinct than Google AI, which was sometimes verbose and less realistic. While GPT-4 beats Google AI in terms of translation accuracy, text generation, factuality, intellectuality, creativity, and deception avoidance, Google AI outperforms the former when considering sarcasm detection. Both sentiment classification models did work properly. More importantly, a human panel of judges was used to assess and evaluate the model comparisons. Statistical analysis of the judges' ratings revealed more robust results based on examining the specific uses, limitations, and expectations of both GPT-4 and Google AI-based approaches. Finally, the two approaches' transformers, parameter sizes, and attention mechanisms have been examined.
引用
收藏
页数:18
相关论文
共 50 条
  • [21] Performance of large language models in the National Dental Licensing Examination in China: a comparative analysis of ChatGPT, GPT-4, and New Bing
    Hu, Ziyang
    Xu, Zhe
    Shi, Ping
    Zhang, Dandan
    Yue, Qu
    Zhang, Jiexia
    Lei, Xin
    Lin, Zitong
    INTERNATIONAL JOURNAL OF COMPUTERIZED DENTISTRY, 2024, 27 (04)
  • [22] The Implementation of Multimodal Large Language Models for Hydrological Applications: A Comparative Study of GPT-4 Vision, Gemini, LLaVa, and Multimodal-GPT
    Kadiyala, Likith Anoop
    Mermer, Omer
    Samuel, Dinesh Jackson
    Sermet, Yusuf
    Demir, Ibrahim
    HYDROLOGY, 2024, 11 (09)
  • [23] ProtChat: An AI Multi-Agent for Automated Protein Analysis Leveraging GPT-4 and Protein Language Model
    Huang, Huazhen
    Shi, Xianguo
    Lei, Hongyang
    Hu, Fan
    Cai, Yunpeng
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2024, 65 (01) : 62 - 70
  • [24] Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing
    Carlo A. Mallio
    Andrea C. Sertorio
    Caterina Bernetti
    Bruno Beomonte Zobel
    La radiologia medica, 2023, 128 : 808 - 812
  • [25] Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing
    Mallio, Carlo A.
    Sertorio, Andrea C.
    Bernetti, Caterina
    Beomonte Zobel, Bruno
    RADIOLOGIA MEDICA, 2023, 128 (07): : 808 - 812
  • [26] Evaluating the GPT-3.5 and GPT-4 Large Language Models for Zero-Shot Classification of South African Violent Event Data
    Kotze, Eduan
    Senekal, Burgert A.
    2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, BIG DATA, COMPUTING AND DATA COMMUNICATION SYSTEMS, ICABCD 2024, 2024,
  • [27] Stratified Evaluation of Large Language Model GPT-4's Question-Answering In Surgery reveals AI Knowledge Gaps
    Lonergan, Rebecca Murphy
    Curry, Jake
    Dhas, Kallpana
    Simmons, Benno
    BRITISH JOURNAL OF SURGERY, 2024, 111
  • [28] GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models
    Liao, Haicheng
    Shen, Huanming
    Li, Zhenning
    Wang, Chengyue
    Li, Guofa
    Bie, Yiming
    Xu, Chengzhong
    COMMUNICATIONS IN TRANSPORTATION RESEARCH, 2024, 4
  • [29] On the Use of Large Language Models at Solving Math Problems: A Comparison Between GPT-4, LlaMA-2 and Gemini
    Navarro, Alejandro L. Garcia
    Koneva, Nataliia
    Hernandez, Jose Alberto
    Sanchez-Macian, Alfonso
    INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2025,
  • [30] The potential of GPT-4 advanced data analysis for radiomics-based machine learning models
    Foltyn-Dumitru, Martha
    Rastogi, Aditya
    Cho, Jaeyoung
    Schell, Marianne
    Mahmutoglu, Mustafa Ahmed
    Kessler, Tobias
    Sahm, Felix
    Wick, Wolfgang
    Bendszus, Martin
    Brugnara, Gianluca
    Vollmuth, Philipp
    NEURO-ONCOLOGY ADVANCES, 2025, 7 (01)