Unmasking large language models by means of OpenAI GPT-4 and Google AI: A deep instruction-based analysis

被引:1
|
作者
Zahid, Idrees A. [1 ]
Joudar, Shahad Sabbar [1 ]
Albahri, A. S. [2 ]
Albahri, O. S. [3 ,4 ]
Alamoodi, A. H. [5 ,6 ]
Santamaria, Jose [7 ]
Alzubaidi, Laith [8 ,9 ]
机构
[1] Univ Technol Baghdad, Baghdad, Iraq
[2] Imam Jaafar Al Sadiq Univ, Tech Coll, Baghdad, Iraq
[3] Australian Tech & Management Coll, Melbourne, Australia
[4] Mazaya Univ Coll, Comp Tech Engn Dept, Nasiriyah, Iraq
[5] Appl Sci Private Univ, Appl Sci Res Ctr, Amman, Jordan
[6] Middle East Univ, MEU Res Unit, Amman, Jordan
[7] Univ Jaen, Dept Comp Sci, Jaen 23071, Spain
[8] Queensland Univ Technol, Sch Mech Med & Proc Engn, Brisbane, Qld 4000, Australia
[9] Queensland Univ Technol, Ctr Data Sci, Brisbane, Qld 4000, Australia
来源
基金
澳大利亚研究理事会;
关键词
OpenAI GPT-4; Google AI; Instruction-based analysis; Sarcasm detection; Deception avoidance; Transformers; ARTIFICIAL-INTELLIGENCE; CHALLENGES;
D O I
10.1016/j.iswa.2024.200431
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have become a hot topic in AI due to their ability to mimic human conversation. This study compares the open artificial intelligence generative pretrained transformer-4 (GPT-4) model, based on the (GPT), and Google's artificial intelligence (AI), which is based on the Bidirectional Encoder Representations from Transformers (BERT) framework in terms of the defined capabilities and the built-in architecture. Both LLMs are prominent in AI applications. First, eight different capabilities were identified to evaluate these models, i.e. translation accuracy, text generation, factuality, creativity, intellect, deception avoidance, sentiment classification, and sarcasm detection. Next, each capability was assessed using instructions. Additionally, a categorized LLM evaluation system has been developed by means of using ten research questions per category based on this paper's main contributions from a prompt engineering perspective. It should be highlighted that GPT-4 and Google AI successfully answered 85 % and 68,7 % of the study prompts, respectively. It has been noted that GPT-4 better understands prompts than Google AI, even with verbal flaws, and tolerates grammatical errors. Moreover, the GPT-4 based approach was more precise, accurate, and succinct than Google AI, which was sometimes verbose and less realistic. While GPT-4 beats Google AI in terms of translation accuracy, text generation, factuality, intellectuality, creativity, and deception avoidance, Google AI outperforms the former when considering sarcasm detection. Both sentiment classification models did work properly. More importantly, a human panel of judges was used to assess and evaluate the model comparisons. Statistical analysis of the judges' ratings revealed more robust results based on examining the specific uses, limitations, and expectations of both GPT-4 and Google AI-based approaches. Finally, the two approaches' transformers, parameter sizes, and attention mechanisms have been examined.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] Evaluating large language models for surgical chart review of second stage implant-based breast reconstruction: a comparative analysis of manual review, GPT-3.5 Turbo, and GPT-4 Turbo
    Lakhlani, Devi
    Dadhania, Dhruv
    Nazerali, Rahim
    EUROPEAN JOURNAL OF PLASTIC SURGERY, 2025, 48 (01)
  • [32] Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses
    Savelka, Jaromir
    Agarwal, Arav
    An, Marshall
    Bogart, Chris
    Sakr, Majd
    PROCEEDINGS OF THE 2023 ACM CONFERENCE ON INTERNATIONAL COMPUTING EDUCATION RESEARCH V.1, ICER 2023 V1, 2023, : 78 - 92
  • [33] Large language models and dermoscopy: Assessing the potential of task-specific GPT-4 vision in diagnosing basal cell carcinoma
    Traini, Daniele Omar
    Palmisano, Gerardo
    Peris, Ketty
    JOURNAL OF THE EUROPEAN ACADEMY OF DERMATOLOGY AND VENEREOLOGY, 2024, 38 (12) : 2320 - 2322
  • [34] Advancing radiology reporting with large language models: Is GPT-4 the LI-RADS game changer or just a wild card?
    Diaz-Gonzalez, Alvaro
    Forner, Alejandro
    Turnes, Juan
    LIVER INTERNATIONAL, 2024, 44 (07) : 1575 - 1577
  • [35] Investigating the clinical reasoning abilities of large language model GPT-4: an analysis of postoperative complications from renal surgeries
    Hsueh, Jessica Y.
    Nethala, Daniel
    Singh, Shiva
    Linehan, W. Marston
    Ball, Mark W.
    UROLOGIC ONCOLOGY-SEMINARS AND ORIGINAL INVESTIGATIONS, 2024, 42 (09) : 292e1 - 292e7
  • [36] Measuring Geographic Diversity of Foundation Models with a Natural Language-based Geo-guessing Experiment on GPT-4
    Liu, Zilong
    Janowicz, Krzysztof
    Currier, Kitty
    Shi, Meilin
    27TH AGILE CONFERENCE ON GEOGRAPHIC INFORMATION SCIENCE GEOGRAPHIC INFORMATION SCIENCE FOR A SUSTAINABLE FUTURE, 2024, 5
  • [37] Large language models: The new AI-powered kidney stone experts? Comparative study of chat GPT 3.5, chat GPT 4, Bard, and Bing AI
    Kalbit, R.
    Vergara, C. D.
    Lorenzo, E. I.
    Agudera, R.
    Quanico, U.
    Aquino, A.
    Mendoza, M. C.
    EUROPEAN UROLOGY, 2024, 85 : S931 - S932
  • [38] ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models
    Oh, Namkee
    Choi, Gyu-Seong
    Lee, Woo Yong
    ANNALS OF SURGICAL TREATMENT AND RESEARCH, 2023, 104 (05) : 269 - 273
  • [39] GPT4AIGChip: Towards Next-Generation AI Accelerator Design Automation via Large Language Models
    Fu, Yonggan
    Zhang, Yongan
    Yu, Zhongzhi
    Li, Sixu
    Ye, Zhifan
    Li, Chaojian
    Wan, Cheng
    Lin, Yingyan
    2023 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN, ICCAD, 2023,
  • [40] Political Bias in Large Language Models: A Comparative Analysis of ChatGPT-4, Perplexity, Google Gemini, and Claude
    Choudhary, Tavishi
    IEEE ACCESS, 2025, 13 : 11341 - 11379