Distance Matters: Euclidean Embedding Distances for Improved Language Model Generalization and Adaptability

被引：0

作者：

Alshamrani, Sultan ^{[1
]}

机构：

[1] Saudi Elect Univ, Dept Comp Sci, Riyadh 11673, Saudi Arabia

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Training; Data models; Adaptation models; Analytical models; Task analysis; Reliability; Euclidean distance; Language models; natural language processing; embeddings; model generalization; model robustness; model performance; data diversity; evaluation; data curation;

D O I：

10.1109/ACCESS.2024.3434612

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large language models (LLMs) have revolutionized natural language processing (NLP), enabling machines to process, understand and generate human-like text with high accuracy. However, the current practices in training and evaluating these models often overlook the relationship between the embeddings of training and testing samples, leading to potential overfitting and limited generalization capabilities. This paper introduces a new approach to enhancing the performance, reliability, and generalization of LLMs by curating training and testing samples based on the Euclidean distances between their embeddings. The central hypothesis is that training models on samples with high Euclidean distances between training and testing embeddings, coupled with evaluations spanning diverse distances, will improve the models' robustness and adaptability to inputs diverging from the training data distribution. The comprehensive evaluation across multiple datasets and architectures shows that models trained on samples with high Euclidean distances from the testing samples generally exhibit superior generalization and robustness compared to those trained on low-distance samples. The proposed evaluation methodology, assessing performance across a range of distances, provides a more reliable measure of a model's true adaptability. This study provides insights into the relationship between training data diversity and model reliability, paving the way for more robust and generalizable LLMs.

引用

页码：103583 / 103593

页数：11

共 2 条

[1] A Improved CF Model Base on Normal Distribution and Euclidean Distance
Zhan, Zhenpeng
Wang, Beizhan
Shi, Liang
Lin, Lida
ICCSSE 2009: PROCEEDINGS OF 2009 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, 2009, : 189 - 192
[2] Data Augmentations for Improved (Large) Language Model Generalization
Feder, Amir
Wald, Yoav
Shi, Claudia
Saria, Suchi
Blei, David
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,

← 1 →