An Empirical Study on Data Balancing in Machine Learning Based Software Traceability Methods

被引:1
|
作者
Wang, Bangchao [1 ,2 ]
Wang, Zihan [3 ]
Wan, Hongyan [1 ,2 ]
Li, Xingfu [1 ]
Deng, Yang [1 ]
机构
[1] Wuhan Text Univ, Sch Comp Sci & Artificial Intelligence, Wuhan, Peoples R China
[2] Wuhan Text Univ, Engn Res Ctr Hubei Prov Clothing Informat, Wuhan, Peoples R China
[3] Wuhan Text Univ, Sch Math & Phys Sci, Wuhan, Peoples R China
基金
中国国家自然科学基金;
关键词
Machine learning; Data balancing; Software traceability; Software engineering;
D O I
10.1109/IJCNN54540.2023.10191386
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine learning (ML) has been widely used in trace link recovery (TLR) to reduce the manual maintenance cost of trace links by developers. However, the imbalanced distribution of valid links and invalid links seriously affects the performance of classifiers. Although a few studies have applied data balancing techniques (DBT) to ML-based TLR, none of them has systematically analyzed more effective combinations of them. Therefore, we perform an empirical study on three groups of control experiments to explore the impact of the combination of different ML methods with and without DBT on TLR efficiency. We compare the performance of supervised ML-based TLR and unsupervised ML-based TLR with and without DBT respectively. Then, we analyze the performance of the ensemble learning model (EM) with DBT on TLR. The experimental results on the 7 imbalance datasets of CoEST indicate that DBT has a positive effect on ML-based TLR. Specifically, the recall of the LR model increased by 0.5517 after combining with most DBTs on EasyClinic(ID-TC), while Tomek-link significantly improves the precision of K-Nearest Neighbor (KNN), Decision Tree (DT), LR, Support Vector Machine (SVM). The precision of LR increased from 0.5036 to 1.0. BalanceRF is best at increasing recall, reaching 1.0 on 4 datasets. Moreover,the improvement degree of ML-based TLR with DBT shows differences in terms of the size of datasets and the proportion of valid links.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Software smell detection based on machine learning and its empirical study
    Yin, Yongfeng
    Su, Qingran
    Liu, Lijun
    SECOND TARGET RECOGNITION AND ARTIFICIAL INTELLIGENCE SUMMIT FORUM, 2020, 11427
  • [2] Empirical studies on software traceability: A mapping study
    Charalampidou, Sofia
    Ampatzoglou, Apostolos
    Karountzos, Evangelos
    Avgeriou, Paris
    JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2021, 33 (02)
  • [3] An empirical study of software entropy based bug prediction using machine learning
    Kaur A.
    Kaur K.
    Chopra D.
    International Journal of System Assurance Engineering and Management, 2017, 8 (Suppl 2) : 599 - 616
  • [4] An empirical analysis of data preprocessing for machine learning-based software cost estimation
    Huang, Jianglin
    Li, Yan-Fu
    Xie, Min
    INFORMATION AND SOFTWARE TECHNOLOGY, 2015, 67 : 108 - 127
  • [5] A large empirical assessment of the role of data balancing in machine-learning-based code smell detection
    Pecorelli, Fabiano
    Di Nucci, Dario
    De Roover, Coen
    De Lucia, Andrea
    JOURNAL OF SYSTEMS AND SOFTWARE, 2020, 169
  • [6] Impact of green data center pilots on the digital economy development: An empirical study based on dual machine learning methods
    Li, Chao
    He, Wanling
    Cao, Erbao
    COMPUTERS & INDUSTRIAL ENGINEERING, 2025, 201
  • [7] Empirical assessment of machine learning based software defect prediction techniques
    Challagulla, VUB
    Bastani, FB
    Yen, IL
    Paul, RA
    WORDS 2005: 10TH IEEE INTERNATIONAL WORKSHOP ON OBJECT-ORIENTED REAL-TIME DEPENDABLE, PROCEEDINGS, 2005, : 263 - 270
  • [8] Empirical assessment of machine learning based software defect prediction techniques
    Challagulla, Venkata Udaya B.
    Bastani, Farokh B.
    Yen, I-Ling
    Paul, Raymond A.
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2008, 17 (02) : 389 - 400
  • [9] Impact of datasets on machine learning based methods in Android malware detection: an empirical study
    Ge, Xiuting
    Huang, Yifan
    Hui, Zhanwei
    Wang, Xiaojuan
    Cao, Xu
    2021 IEEE 21ST INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2021), 2021, : 81 - 92
  • [10] An Empirical Study On Software Metrics and Machine Learning to Identify Untrustworthy Code
    Medeiros, Nadia
    Ivaki, Naghmeh
    Costa, Pedro
    Vieira, Marco
    2021 17TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE (EDCC 2021), 2021, : 87 - 94