Unmasking large language models by means of OpenAI GPT-4 and Google AI: A deep instruction-based analysis

被引:1
|
作者
Zahid, Idrees A. [1 ]
Joudar, Shahad Sabbar [1 ]
Albahri, A. S. [2 ]
Albahri, O. S. [3 ,4 ]
Alamoodi, A. H. [5 ,6 ]
Santamaria, Jose [7 ]
Alzubaidi, Laith [8 ,9 ]
机构
[1] Univ Technol Baghdad, Baghdad, Iraq
[2] Imam Jaafar Al Sadiq Univ, Tech Coll, Baghdad, Iraq
[3] Australian Tech & Management Coll, Melbourne, Australia
[4] Mazaya Univ Coll, Comp Tech Engn Dept, Nasiriyah, Iraq
[5] Appl Sci Private Univ, Appl Sci Res Ctr, Amman, Jordan
[6] Middle East Univ, MEU Res Unit, Amman, Jordan
[7] Univ Jaen, Dept Comp Sci, Jaen 23071, Spain
[8] Queensland Univ Technol, Sch Mech Med & Proc Engn, Brisbane, Qld 4000, Australia
[9] Queensland Univ Technol, Ctr Data Sci, Brisbane, Qld 4000, Australia
来源
基金
澳大利亚研究理事会;
关键词
OpenAI GPT-4; Google AI; Instruction-based analysis; Sarcasm detection; Deception avoidance; Transformers; ARTIFICIAL-INTELLIGENCE; CHALLENGES;
D O I
10.1016/j.iswa.2024.200431
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have become a hot topic in AI due to their ability to mimic human conversation. This study compares the open artificial intelligence generative pretrained transformer-4 (GPT-4) model, based on the (GPT), and Google's artificial intelligence (AI), which is based on the Bidirectional Encoder Representations from Transformers (BERT) framework in terms of the defined capabilities and the built-in architecture. Both LLMs are prominent in AI applications. First, eight different capabilities were identified to evaluate these models, i.e. translation accuracy, text generation, factuality, creativity, intellect, deception avoidance, sentiment classification, and sarcasm detection. Next, each capability was assessed using instructions. Additionally, a categorized LLM evaluation system has been developed by means of using ten research questions per category based on this paper's main contributions from a prompt engineering perspective. It should be highlighted that GPT-4 and Google AI successfully answered 85 % and 68,7 % of the study prompts, respectively. It has been noted that GPT-4 better understands prompts than Google AI, even with verbal flaws, and tolerates grammatical errors. Moreover, the GPT-4 based approach was more precise, accurate, and succinct than Google AI, which was sometimes verbose and less realistic. While GPT-4 beats Google AI in terms of translation accuracy, text generation, factuality, intellectuality, creativity, and deception avoidance, Google AI outperforms the former when considering sarcasm detection. Both sentiment classification models did work properly. More importantly, a human panel of judges was used to assess and evaluate the model comparisons. Statistical analysis of the judges' ratings revealed more robust results based on examining the specific uses, limitations, and expectations of both GPT-4 and Google AI-based approaches. Finally, the two approaches' transformers, parameter sizes, and attention mechanisms have been examined.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] InstructEdit: Instruction-Based Knowledge Editing for Large Language Models
    Zhang, Ningyu
    Tian, Bozhong
    Cheng, Siyuan
    Liang, Xiaozhuan
    Hu, Yi
    Xue, Kouying
    Gou, Yanjie
    Chen, Xi
    Chen, Huajun
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 6633 - 6641
  • [2] Exploring the potential utility of AI large language models for medical ethics: an expert panel evaluation of GPT-4
    Balas, Michael
    Wadden, Jordan Joseph
    Hebert, Philip C.
    Mathison, Eric
    Warren, Marika D.
    Seavilleklein, Victoria
    Wyzynski, Daniel
    Callahan, Alison
    Crawford, Sean A.
    Arjmand, Parnian
    Ing, Edsel B.
    JOURNAL OF MEDICAL ETHICS, 2024, 50 (02) : 90 - 96
  • [3] FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models
    Bhatia, Gagan
    Nagoudi, El Moatez Billah
    Cavusoglu, Hasan
    Abdul-Mageed, Muhammad
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 13064 - 13087
  • [4] How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini
    Irmici, Giovanni
    Cozzi, Andrea
    Della Pepa, Gianmarco
    De Berardinis, Claudia
    D'Ascoli, Elisa
    Cellina, Michaela
    Ce, Maurizio
    Depretto, Catherine
    Scaperrotta, Gianfranco
    RADIOLOGIA MEDICA, 2024, 129 (10): : 1463 - 1467
  • [5] Large Language Models as AI-Powered Educational Assistants: Comparing GPT-4 and Gemini for Writing Teaching Cases
    Lang, Guido
    Triantoro, Tamilla
    Sharp, Jason H.
    Journal of Information Systems Education, 35 (03): : 390 - 407
  • [6] Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard
    Farhat, Faiza
    Chaudhry, Beenish Moalla
    Nadeem, Mohammad
    Sohail, Shahab Saquib
    Madsen, Dag Oivind
    JMIR MEDICAL EDUCATION, 2024, 10
  • [7] Fine-Tuning Large Language Models for Ontology Engineering: A Comparative Analysis of GPT-4 and Mistral
    Doumanas, Dimitrios
    Soularidis, Andreas
    Spiliotopoulos, Dimitris
    Vassilakis, Costas
    Kotis, Konstantinos
    APPLIED SCIENCES-BASEL, 2025, 15 (04):
  • [8] Harnessing Large Language Models for Structured Reporting in Breast Ultrasound: A Comparative Study of Open AI (GPT-4.0) and Microsoft Bing (GPT-4)
    Liu, ChaoXu
    Wei, MinYan
    Qin, Yu
    Zhang, MeiXiang
    Jiang, Huan
    Xu, JiaLe
    Zhang, YuNing
    Hua, Qing
    Hou, YiQing
    Dong, YiJie
    Xia, ShuJun
    Li, Ning
    Zhou, JianQiao
    ULTRASOUND IN MEDICINE AND BIOLOGY, 2024, 50 (11): : 1697 - 1703
  • [9] Large language models such as ChatGPT and GPT-4 for patient-centered care in radiology
    Fink, Matthias A.
    RADIOLOGIE, 2023, 63 (09): : 665 - 671
  • [10] ChatGPT, GPT-4, and Other Large Language Models: The Next Revolution for Clinical Microbiology?
    Egli, Adrian
    CLINICAL INFECTIOUS DISEASES, 2023, 77 (09) : 1322 - 1328