Unmasking large language models by means of OpenAI GPT-4 and Google AI: A deep instruction-based analysis

被引：1

作者：

Zahid, Idrees A. ^{[1
]}

Joudar, Shahad Sabbar ^{[1
]}

Albahri, A. S. ^{[2
]}

Albahri, O. S. ^{[3
,4
]}

Alamoodi, A. H. ^{[5
,6
]}

Santamaria, Jose ^{[7
]}

Alzubaidi, Laith ^{[8
,9
]}

机构：

[1] Univ Technol Baghdad, Baghdad, Iraq

[2] Imam Jaafar Al Sadiq Univ, Tech Coll, Baghdad, Iraq

[3] Australian Tech & Management Coll, Melbourne, Australia

[4] Mazaya Univ Coll, Comp Tech Engn Dept, Nasiriyah, Iraq

[5] Appl Sci Private Univ, Appl Sci Res Ctr, Amman, Jordan

[6] Middle East Univ, MEU Res Unit, Amman, Jordan

[7] Univ Jaen, Dept Comp Sci, Jaen 23071, Spain

[8] Queensland Univ Technol, Sch Mech Med & Proc Engn, Brisbane, Qld 4000, Australia

[9] Queensland Univ Technol, Ctr Data Sci, Brisbane, Qld 4000, Australia

来源：

INTELLIGENT SYSTEMS WITH APPLICATIONS | 2024年 / 23卷

基金：

澳大利亚研究理事会;

关键词：

OpenAI GPT-4; Google AI; Instruction-based analysis; Sarcasm detection; Deception avoidance; Transformers; ARTIFICIAL-INTELLIGENCE; CHALLENGES;

D O I：

10.1016/j.iswa.2024.200431

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Language Models (LLMs) have become a hot topic in AI due to their ability to mimic human conversation. This study compares the open artificial intelligence generative pretrained transformer-4 (GPT-4) model, based on the (GPT), and Google's artificial intelligence (AI), which is based on the Bidirectional Encoder Representations from Transformers (BERT) framework in terms of the defined capabilities and the built-in architecture. Both LLMs are prominent in AI applications. First, eight different capabilities were identified to evaluate these models, i.e. translation accuracy, text generation, factuality, creativity, intellect, deception avoidance, sentiment classification, and sarcasm detection. Next, each capability was assessed using instructions. Additionally, a categorized LLM evaluation system has been developed by means of using ten research questions per category based on this paper's main contributions from a prompt engineering perspective. It should be highlighted that GPT-4 and Google AI successfully answered 85 % and 68,7 % of the study prompts, respectively. It has been noted that GPT-4 better understands prompts than Google AI, even with verbal flaws, and tolerates grammatical errors. Moreover, the GPT-4 based approach was more precise, accurate, and succinct than Google AI, which was sometimes verbose and less realistic. While GPT-4 beats Google AI in terms of translation accuracy, text generation, factuality, intellectuality, creativity, and deception avoidance, Google AI outperforms the former when considering sarcasm detection. Both sentiment classification models did work properly. More importantly, a human panel of judges was used to assess and evaluate the model comparisons. Statistical analysis of the judges' ratings revealed more robust results based on examining the specific uses, limitations, and expectations of both GPT-4 and Google AI-based approaches. Finally, the two approaches' transformers, parameter sizes, and attention mechanisms have been examined.

引用

页数：18

共 50 条

[41] Human-Comparable Sensitivity of Large Language Models inIdenti fying Eligible Studies Through Title and Abstract Screening:3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews
Matsui, Kentaro
Utsumi, Tomohiro
Aoki, Yumi
Maruki, Taku
Takeshima, Masahiro
Takaesu, Yoshikazu
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[42] Performance Evaluation of Multimodal Large Language Models (LLaVA and GPT-4-based ChatGPT) in Medical Image Classification Tasks
Guo, Yuhang
Wan, Zhiyu
2024 IEEE 12TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS, ICHI 2024, 2024, : 541 - 543
[43] Comparative Analysis of Deep Natural Networks and Large Language Models for Aspect-Based Sentiment Analysis
Mughal, Nimra
Mujtaba, Ghulam
Shaikh, Sarang
Kumar, Aveenash
Daudpota, Sher Muhammad
IEEE ACCESS, 2024, 12 : 60943 - 60959
[44] Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages
Khraisha, Qusai
Put, Sophie
Kappenberg, Johanna
Warraitch, Azza
Hadfield, Kristin
RESEARCH SYNTHESIS METHODS, 2024, 15 (04) : 616 - 626
[45] How Does a Generative Large Language Model Perform on Domain-Specific Information Extraction?―A Comparison between GPT-4 and a Rule-Based Method on Band Gap Extraction
Wang, Xin
Huang, Liangliang
Xu, Shuozhi
Lu, Kun
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2024, 64 (20) : 7895 - 7904
[46] Comparative performance of GPT-4 and CNV-ETLAI in extracting copy number variations from medical journals: Bridging the gap between large language models and specialized NLP tools in genomic data interpretation
Choi, J.
CLINICAL CHEMISTRY, 2024, 70 : I183 - I183
[47] EVALUATING AI-LANGUAGE MODELS IN PROVIDING ANSWERS FOR INFLAMMATORY BOWEL DISEASE (IBD) IN PREGNANCY: A COMPARATIVE ANALYSIS OF GPT4, BARD, AND LLAMA2
Mukherjee, Samiran
Kumar, Vishnu Charan Suresh
Weiss, Alexandra
Schmoyer, Christopher J.
Nandi, Neilanjan
GASTROENTEROLOGY, 2024, 166 (05) : S1159 - S1159
[48] Validation of a Deep Learning Chest X-ray Interpretation Model: Integrating Large-Scale AI and Large Language Models for Comparative Analysis with ChatGPT
Lee, Kyu Hong
Lee, Ro Woon
Kwon, Ye Eun
DIAGNOSTICS, 2024, 14 (01)
[49] Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis
Tong, Linjian
Zhang, Chaoyang
Liu, Rui
Yang, Jia
Sun, Zhiming
JOURNAL OF ORTHOPAEDIC SURGERY AND RESEARCH, 2024, 19 (01):
[50] Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study
Giannakopoulos, Kostis
Kavadella, Argyro
Salim, Anas Aaqel
Stamatopoulos, Vassilis
Kaklamanos, Eleftherios G.
JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25

← 1 2 3 4 5 →