Distance Matters: Euclidean Embedding Distances for Improved Language Model Generalization and Adaptability

被引:0
|
作者
Alshamrani, Sultan [1 ]
机构
[1] Saudi Elect Univ, Dept Comp Sci, Riyadh 11673, Saudi Arabia
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Training; Data models; Adaptation models; Analytical models; Task analysis; Reliability; Euclidean distance; Language models; natural language processing; embeddings; model generalization; model robustness; model performance; data diversity; evaluation; data curation;
D O I
10.1109/ACCESS.2024.3434612
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large language models (LLMs) have revolutionized natural language processing (NLP), enabling machines to process, understand and generate human-like text with high accuracy. However, the current practices in training and evaluating these models often overlook the relationship between the embeddings of training and testing samples, leading to potential overfitting and limited generalization capabilities. This paper introduces a new approach to enhancing the performance, reliability, and generalization of LLMs by curating training and testing samples based on the Euclidean distances between their embeddings. The central hypothesis is that training models on samples with high Euclidean distances between training and testing embeddings, coupled with evaluations spanning diverse distances, will improve the models' robustness and adaptability to inputs diverging from the training data distribution. The comprehensive evaluation across multiple datasets and architectures shows that models trained on samples with high Euclidean distances from the testing samples generally exhibit superior generalization and robustness compared to those trained on low-distance samples. The proposed evaluation methodology, assessing performance across a range of distances, provides a more reliable measure of a model's true adaptability. This study provides insights into the relationship between training data diversity and model reliability, paving the way for more robust and generalizable LLMs.
引用
收藏
页码:103583 / 103593
页数:11
相关论文
共 2 条
  • [1] A Improved CF Model Base on Normal Distribution and Euclidean Distance
    Zhan, Zhenpeng
    Wang, Beizhan
    Shi, Liang
    Lin, Lida
    ICCSSE 2009: PROCEEDINGS OF 2009 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, 2009, : 189 - 192
  • [2] Data Augmentations for Improved (Large) Language Model Generalization
    Feder, Amir
    Wald, Yoav
    Shi, Claudia
    Saria, Suchi
    Blei, David
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,