Fast and simple dataset selection for machine learning

被引:5
|
作者
Peter, Timm J. [1 ]
Nelles, Oliver [1 ]
机构
[1] Univ Siegen, Inst Mechan & Regelungstech Mechatron, Dept Maschinenbau, Paul Bonatz Str 9-11, D-57068 Siegen, Germany
关键词
machine learning; dataset selection; design of experiments; space-filling design; domain adaptation;
D O I
10.1515/auto-2019-0010
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of data reduction is discussed and a novel selection approach which allows to control the optimal point distribution of the selected data subset is proposed. The proposed approach utilizes the estimation of probability density functions (pdfs). Due to its structure, the new method is capable of selecting a subset either by approximating the pdf of the original dataset or by approximating an arbitrary, desired target pdf. The new strategy evaluates the estimated pdfs solely on the selected data points, resulting in a simple and efficient algorithm with low computational and memory demand. The performance of the new approach is investigated for two different scenarios. For representative subset selection of a dataset, the new approach is compared to a recently proposed, more complex method and shows comparable results. For the demonstration of the capability of matching a target pdf, a uniform distribution is chosen as an example. Here the new method is compared to strategies for space-filling design of experiments and shows convincing results.
引用
收藏
页码:833 / 842
页数:10
相关论文
共 50 条
  • [31] Machine learning for sperm selection
    Jae Bem You
    Christopher McCallum
    Yihe Wang
    Jason Riordon
    Reza Nosrati
    David Sinton
    Nature Reviews Urology, 2021, 18 : 387 - 403
  • [32] Machine learning for sperm selection
    You, Jae Bem
    McCallum, Christopher
    Wang, Yihe
    Riordon, Jason
    Nosrati, Reza
    Sinton, David
    NATURE REVIEWS UROLOGY, 2021, 18 (07) : 387 - 403
  • [33] Machine Learning for Stock Selection
    Yan, Robert J.
    Ling, Charles X.
    KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2007, : 1038 - 1042
  • [34] Sugarcane leaf dataset: A dataset for disease detection and classification for machine learning applications
    Thite, Sandip
    Suryawanshi, Yogesh
    Patil, Kailas
    Chumchu, Prawit
    DATA IN BRIEF, 2024, 53
  • [35] Machine Learning for Stock Selection
    Rasekhschaffe, Keywan Christian
    Jones, Robert C.
    FINANCIAL ANALYSTS JOURNAL, 2019, 75 (03) : 70 - 88
  • [36] A SIMPLE FAST ACTING GAS SELECTION VALVE
    JONES, TA
    BOTT, B
    JOURNAL OF PHYSICS E-SCIENTIFIC INSTRUMENTS, 1984, 17 (04): : 263 - 264
  • [37] Data preprocessing techniques: emergence and selection towards machine learning models - a practical review using HPA dataset
    K Mallikharjuna Rao
    Ghanta Saikrishna
    Kundrapu Supriya
    Multimedia Tools and Applications, 2023, 82 : 37177 - 37196
  • [38] Teaching core principles of machine learning with a simple machine learning algorithm
    Hazzan, Orit
    Mike, Koby
    ACM Inroads, 2022, 13 (01) : 18 - 25
  • [39] A hybrid machine learning approach to identify coronary diseases using feature selection mechanism on heart disease dataset
    Bhanu Prakash Doppala
    Debnath Bhattacharyya
    Midhun Chakkravarthy
    Tai-hoon Kim
    Distributed and Parallel Databases, 2023, 41 : 1 - 20
  • [40] A Comparative Analysis of Feature Selection Methods and Associated Machine Learning Algorithms on Wisconsin Breast Cancer Dataset (WBCD)
    Modi, Nileshkumar
    Ghanchi, Kaushar
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ICT FOR SUSTAINABLE DEVELOPMENT, ICT4SD 2015, VOL 1, 2016, 408 : 215 - 224