An Empirical Study on Data Balancing in Machine Learning Based Software Traceability Methods

被引：1

作者：

Wang, Bangchao ^{[1
,2
]}

Wang, Zihan ^{[3
]}

Wan, Hongyan ^{[1
,2
]}

Li, Xingfu ^{[1
]}

Deng, Yang ^{[1
]}

机构：

[1] Wuhan Text Univ, Sch Comp Sci & Artificial Intelligence, Wuhan, Peoples R China

[2] Wuhan Text Univ, Engn Res Ctr Hubei Prov Clothing Informat, Wuhan, Peoples R China

[3] Wuhan Text Univ, Sch Math & Phys Sci, Wuhan, Peoples R China

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

基金：

中国国家自然科学基金;

关键词：

Machine learning; Data balancing; Software traceability; Software engineering;

D O I：

10.1109/IJCNN54540.2023.10191386

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Machine learning (ML) has been widely used in trace link recovery (TLR) to reduce the manual maintenance cost of trace links by developers. However, the imbalanced distribution of valid links and invalid links seriously affects the performance of classifiers. Although a few studies have applied data balancing techniques (DBT) to ML-based TLR, none of them has systematically analyzed more effective combinations of them. Therefore, we perform an empirical study on three groups of control experiments to explore the impact of the combination of different ML methods with and without DBT on TLR efficiency. We compare the performance of supervised ML-based TLR and unsupervised ML-based TLR with and without DBT respectively. Then, we analyze the performance of the ensemble learning model (EM) with DBT on TLR. The experimental results on the 7 imbalance datasets of CoEST indicate that DBT has a positive effect on ML-based TLR. Specifically, the recall of the LR model increased by 0.5517 after combining with most DBTs on EasyClinic(ID-TC), while Tomek-link significantly improves the precision of K-Nearest Neighbor (KNN), Decision Tree (DT), LR, Support Vector Machine (SVM). The precision of LR increased from 0.5036 to 1.0. BalanceRF is best at increasing recall, reaching 1.0 on 4 datasets. Moreover,the improvement degree of ML-based TLR with DBT shows differences in terms of the size of datasets and the proportion of valid links.

引用

页数：8

共 50 条

[1] Software smell detection based on machine learning and its empirical study
Yin, Yongfeng
Su, Qingran
Liu, Lijun
SECOND TARGET RECOGNITION AND ARTIFICIAL INTELLIGENCE SUMMIT FORUM, 2020, 11427
[2] Empirical studies on software traceability: A mapping study
Charalampidou, Sofia
Ampatzoglou, Apostolos
Karountzos, Evangelos
Avgeriou, Paris
JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2021, 33 (02)
[3] An empirical study of software entropy based bug prediction using machine learning
Kaur A.
Kaur K.
Chopra D.
International Journal of System Assurance Engineering and Management, 2017, 8 (Suppl 2) : 599 - 616
[4] An empirical analysis of data preprocessing for machine learning-based software cost estimation
Huang, Jianglin
Li, Yan-Fu
Xie, Min
INFORMATION AND SOFTWARE TECHNOLOGY, 2015, 67 : 108 - 127
[5] A large empirical assessment of the role of data balancing in machine-learning-based code smell detection
Pecorelli, Fabiano
Di Nucci, Dario
De Roover, Coen
De Lucia, Andrea
JOURNAL OF SYSTEMS AND SOFTWARE, 2020, 169
[6] Impact of green data center pilots on the digital economy development: An empirical study based on dual machine learning methods
Li, Chao
He, Wanling
Cao, Erbao
COMPUTERS & INDUSTRIAL ENGINEERING, 2025, 201
[7] Empirical assessment of machine learning based software defect prediction techniques
Challagulla, VUB
Bastani, FB
Yen, IL
Paul, RA
WORDS 2005: 10TH IEEE INTERNATIONAL WORKSHOP ON OBJECT-ORIENTED REAL-TIME DEPENDABLE, PROCEEDINGS, 2005, : 263 - 270
[8] Empirical assessment of machine learning based software defect prediction techniques
Challagulla, Venkata Udaya B.
Bastani, Farokh B.
Yen, I-Ling
Paul, Raymond A.
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2008, 17 (02) : 389 - 400
[9] Impact of datasets on machine learning based methods in Android malware detection: an empirical study
Ge, Xiuting
Huang, Yifan
Hui, Zhanwei
Wang, Xiaojuan
Cao, Xu
2021 IEEE 21ST INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2021), 2021, : 81 - 92
[10] An Empirical Study On Software Metrics and Machine Learning to Identify Untrustworthy Code
Medeiros, Nadia
Ivaki, Naghmeh
Costa, Pedro
Vieira, Marco
2021 17TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE (EDCC 2021), 2021, : 87 - 94

← 1 2 3 4 5 →