Machines Learn Better with Better Data Ontology: Lessons from Philosophy of Induction and Machine Learning Practice

被引:3
|
作者
Li, Dan [1 ]
机构
[1] CUNY, Baruch Coll, Philosophy Dept, New York, NY 10031 USA
关键词
Induction; Machine learning; Data ontology; No Free Lunch theorem; Goodman's riddle of induction; CLIMATE; MODELS;
D O I
10.1007/s11023-023-09639-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As scientists start to adopt machine learning (ML) as one research tool, the security of ML and the knowledge generated become a concern. In this paper, I explain how supervised ML can be improved with better data ontology, or the way we make categories and turn information into data. More specifically, we should design data ontology in such a way that is consistent with the knowledge that we have about the target phenomenon so that such ontology can help us make the inductive leap. I do so by thinking through a thought experiment, Goodman's New Riddle of Induction (Fact, fiction, and forecast, Harvard University Press, 1955). Goodman's riddle helps flesh out three problems of induction: (1) the problem of equal goodies, that there are often too many equally good inductive results given the same data; (2) the problem of diverging performance, that these equally good results can give opposite predictions in the future; and (3) the problem of mediocrity, that when averaged across all equally possible datasets and tasks, no inductive algorithm outperforms any other. I show that all these three problems are manifested as real obstacles in ML practice, namely, the Rashomon effect (Breiman in Stat Sci 16(3):199-231, 2001), the problem of underspecification (D'Amour et al. in J Mach Learn Res, 2020, https://doi.org/10.48550/arXiv.2011.03395), and the No Free Lunch theorem (Wolpert in Neural Comput 8(7):1341-90, 1996, https://doi.org/10.1162/neco.1996.8.7. 1341). Lastly, I argue that proper data ontology can help mitigate these problems and I demonstrate how using concrete examples from climate science. This research highlights the links between philosophers' discussions of induction and implications in ML practice.
引用
收藏
页码:429 / 450
页数:22
相关论文
共 50 条
  • [21] Data mining and machine learning in retail business: developing efficiencies for better customer retention
    Kumar, M. Rajesh
    Venkatesh, J.
    Rahman, A. M. J. Md Zubair
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2021,
  • [22] Machine Learning for Automatic Encoding of French Electronic Medical Records: Is More Data Better ?
    Gobeill, Julien
    Ruch, Patrick
    Meyer, Rodolphe
    DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 312 - 316
  • [23] A Re-Identification Strategy Using Machine Learning that Exploits Better Side Data
    Hashimoto, Eina
    Ichino, Masatsugu
    Yoshiura, Hiroshi
    2019 IEEE 10TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE AND TECHNOLOGY (ICAST 2019), 2019, : 221 - 228
  • [24] INTRAOPERATIVE DATA ENABLES MACHINE LEARNING CLASSIFIERS TO BETTER PREDICT POSTOPERATIVE ACUTE KIDNE
    Hobson, Charles
    Baslanti, Tezcan Ozrazgat
    Thottakara, Paul
    Momcilovic, Petar
    Rashidi, Parisa
    Bihorac, Azra
    CRITICAL CARE MEDICINE, 2015, 43 (12)
  • [25] Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
    Gebru, Timnit
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3609 - 3609
  • [26] Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
    Jo, Eun Seo
    Gebru, Timnit
    FAT* '20: PROCEEDINGS OF THE 2020 CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, 2020, : 306 - 316
  • [27] Education, Access to Better Quality Work and Gender: Lessons from the Kagera Panel Data Set
    Kamanzi, Adalbertus
    McKay, Andy
    Newell, Andy
    Rienzo, Cinzia
    Tafesse, Wiktoria
    JOURNAL OF AFRICAN ECONOMIES, 2021, 30 (01) : 103 - 127
  • [28] From Principle to Practice: Vertical Data Minimization for Machine Learning
    Staab, Robin
    Joyanovic, Nikola
    Balunovic, Mislay
    Vechev, Martin
    45TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, SP 2024, 2024, : 4733 - 4752
  • [29] Utilizing public and private sector data to build better machine learning models for the prediction of pharmacokinetic parameters
    Kuroda, Masataka
    Watanabe, Reiko
    Esaki, Tsuyoshi
    Kawashima, Hitoshi
    Ohashi, Rikiya
    Sato, Tomohiro
    Honma, Teruki
    Komura, Hiroshi
    Mizuguchi, Kenji
    DRUG DISCOVERY TODAY, 2022, 27 (11)
  • [30] Are more data always better? - Machine learning forecasting of algae based on long-term observations
    Beckmann, D. Atton
    Werther, M.
    Mackay, E. B.
    Spyrakos, E.
    Hunter, P.
    Jones, I. D.
    JOURNAL OF ENVIRONMENTAL MANAGEMENT, 2025, 373