The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models

被引:1
|
作者
Zhou, Xin [1 ]
Kim, Kisub [1 ]
Xu, Bowen [1 ,2 ]
Liu, Jiakun [1 ]
Han, DongGyun [3 ]
Lo, David [1 ]
机构
[1] Singapore Management Univ, Singapore, Singapore
[2] North Carolina State Univ, Raleigh, NC USA
[3] Royal Holloway Univ London, London, England
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/ASE56229.2023.00157
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data's properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0% and 254.0% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation.
引用
收藏
页码:40 / 52
页数:13
相关论文
共 50 条
  • [31] Do long-tailed macaques avoid large heterospecific carcasses?
    Peterson, Jeffrey, V
    Fuentes, Agustin
    [J]. BEHAVIOUR, 2021, 158 (3-4) : 341 - 352
  • [32] Large-Scale Long-Tailed Recognition in an Open World
    Liu, Ziwei
    Miao, Zhongqi
    Zhan, Xiaohang
    Wang, Jiayun
    Gong, Boqing
    Yu, Stella X.
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 2532 - 2541
  • [33] Large Scale Long-tailed Product Recognition System at Alibaba
    Zhou, Xiangzeng
    Pan, Pan
    Zheng, Yun
    Xu, Yinghui
    Jin, Rong
    [J]. CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 3353 - 3356
  • [34] VideoLT: Large-scale Long-tailed Video Recognition
    Zhang, Xing
    Wu, Zuxuan
    Weng, Zejia
    Fu, Huazhu
    Chen, Jingjing
    Jiang, Yu-Gang
    Davis, Larry
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 7940 - 7949
  • [35] Power-law relationship in the long-tailed sections of proton dose distributions
    Jiang, Bo
    Wang, Xiaochun
    Zhang, Yang
    Guan, Fada
    Li, Yupeng
    Wang, Xianliang
    Zhu, Ronald X.
    Zhang, Xiaodong
    [J]. SCIENTIFIC REPORTS, 2018, 8
  • [36] Power-law relationship in the long-tailed sections of proton dose distributions
    Bo Jiang
    Xiaochun Wang
    Yang Zhang
    Fada Guan
    Yupeng Li
    Xianliang Wang
    Ronald X. Zhu
    Xiaodong Zhang
    [J]. Scientific Reports, 8
  • [37] Technical note: An improved range chart for normal and long-tailed symmetrical distributions
    Tadikamalla, Pandu
    Banciu, Mihai
    Popescu, Dana
    [J]. NAVAL RESEARCH LOGISTICS, 2008, 55 (01) : 91 - 99
  • [38] SAR Image Classification with Knowledge Distillation and Class Balancing for Long-Tailed Distributions
    Jahan, Chowdhury Sadman
    Savakis, Andreas
    Blasch, Erik
    [J]. 2022 IEEE 14TH IMAGE, VIDEO, AND MULTIDIMENSIONAL SIGNAL PROCESSING WORKSHOP (IVMSP), 2022,
  • [39] Hierarchical classification of data with long-tailed distributions via global and local granulation
    Zhao, Hong
    Guo, Shunxin
    Lin, Yaojin
    [J]. INFORMATION SCIENCES, 2021, 581 : 536 - 552
  • [40] A CLASS OF DISCRETE-DISTRIBUTIONS SUITED TO FITTING VERY LONG-TAILED DATA
    ONG, SH
    MUTHALOO, S
    [J]. COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 1995, 24 (04) : 929 - 945