Nearest Centroid: A Bridge between Statistics and Machine Learning

被引:3
|
作者
Thulasidas, Manoj [1 ]
机构
[1] Singapore Management Univ, Sch Informat Syst, Singapore, Singapore
关键词
statistical thinking; applied statistics; machine learning; nearest centroid; k-means clustering; k nearest neighbor;
D O I
10.1109/TALE48869.2020.9368396
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
In order to guide our students of machine learning in their statistical thinking, we need conceptually simple and mathematically defensible algorithms. In this paper, we present the Nearest Centroid algorithm (NC) algorithm as a pedagogical tool, combining the key concepts behind two foundational algorithms: K-Means clustering and K Nearest Neighbors (kNN). In NC, we use the centroid (as defined in the K-Means algorithm) of the observations belonging to each class in our training data set and its distance from a new observation (similar to k-NN) for class prediction. Using this obvious extension, we will illustrate how the concepts of probability and statistics are applied in machine learning algorithms. Furthermore, we will describe how the practical aspects of validation and performance measurements are carried out. The algorithm and the work presented here can be easily converted to labs and reading assignments to cement the students' understanding of applied statistics and its connection to machine learning algorithms, as described toward the end of this paper.
引用
收藏
页码:9 / 16
页数:8
相关论文
共 50 条
  • [31] POINTS OF SIGNIFICANCE Statistics versus machine learning
    Bzdok, Danilo
    Altman, Naomi
    Krzywinski, Martin
    [J]. NATURE METHODS, 2018, 15 (04) : 232 - 233
  • [32] Special feature: computational statistics and machine learning
    Hiroshi Yadohisa
    Wataru Sakamoto
    [J]. Japanese Journal of Statistics and Data Science, 2019, 2 : 219 - 220
  • [33] Teaching Computational Machine Learning (without Statistics)
    Kinnaird, Katherine M.
    [J]. EUROPEAN CONFERENCE ON MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, VOL 141, 2020, 141
  • [34] Ten propositions on machine learning in official statistics
    Arnout van Delden
    Joep Burger
    Marco Puts
    [J]. AStA Wirtschafts- und Sozialstatistisches Archiv, 2023, 17 (3-4) : 195 - 221
  • [35] Data mining: Machine learning, statistics, and databases
    Mannila, H
    [J]. EIGHTH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE SYSTEMS, PROCEEDINGS, 1996, : 2 - 9
  • [36] Scaling Machine Learning and Statistics for Web Applications
    Agarwal, Deepak
    [J]. KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 1621 - 1621
  • [37] Fighting Money Laundering With Statistics and Machine Learning
    Jensen, Rasmus Ingemann Tuffveson
    Iosifidis, Alexandros
    [J]. IEEE ACCESS, 2023, 11 : 8889 - 8903
  • [38] Open-Set Nearest Shrunken Centroid Classification
    Schaalje, G. Bruce
    Fields, Paul J.
    [J]. COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2012, 41 (04) : 638 - 652
  • [39] Connecting numerical simulation and machine learning: How to bridge the gap between theory and reality?
    Piprek, Joachim
    [J]. 2020 INTERNATIONAL CONFERENCE ON NUMERICAL SIMULATION OF OPTOELECTRONIC DEVICES (NUSOD), 2020, : 105 - 106
  • [40] Special feature: computational statistics and machine learning
    Yadohisa, Hiroshi
    Sakamoto, Wataru
    [J]. JAPANESE JOURNAL OF STATISTICS AND DATA SCIENCE, 2019, 2 (01) : 219 - 220