Enabling Collaborative Data Science Development with the Ballet Framework

被引:0
|
作者
Smith M.J. [1 ]
Cito J. [2 ]
Lu K. [1 ]
Veeramachaneni K. [1 ]
机构
[1] Massachusetts Institute of Technology, Cambridge, MA
[2] TU Wien, Vienna
基金
美国国家科学基金会;
关键词
collaborative framework; data science; feature definition; feature engineering; feature validation; machine learning; mutual information; streaming feature selection;
D O I
10.1145/3479575
中图分类号
学科分类号
摘要
While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, the first lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to software and ML performance validation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects. © 2021 Owner/Author.
引用
下载
收藏
相关论文
共 50 条
  • [1] The National Microbiome Data Collaborative: enabling microbiome science
    Wood-Charlson, Elisha M.
    Anubhav
    Auberry, Deanna
    Blanco, Hannah
    Borkum, Mark I.
    Corilo, Yuri E.
    Davenport, Karen W.
    Deshpande, Shweta
    Devarakonda, Ranjeet
    Drake, Meghan
    Duncan, William D.
    Flynn, Mark C.
    Hays, David
    Hu, Bin
    Huntemann, Marcel
    Li, Po-E
    Lipton, Mary
    Lo, Chien-Chi
    Millard, David
    Miller, Kayd
    Piehowski, Paul D.
    Purvine, Samuel
    Reddy, T. B. K.
    Shakya, Migun
    Sundaramurthi, Jagadish Chandrabose
    Vangay, Pajau
    Wei, Yaxing
    Wilson, Bruce E.
    Canon, Shane
    Chain, Patrick S. G.
    Fagnan, Kjiersten
    Martin, Stanton
    McCue, Lee Ann
    Mungall, Christopher J.
    Mouncey, Nigel J.
    Maxon, Mary E.
    Eloe-Fadrosh, Emiley A.
    NATURE REVIEWS MICROBIOLOGY, 2020, 18 (06) : 313 - 314
  • [2] The National Microbiome Data Collaborative: enabling microbiome science
    Elisha M. Wood-Charlson
    Deanna Anubhav
    Hannah Auberry
    Mark I. Blanco
    Yuri E. Borkum
    Karen W. Corilo
    Shweta Davenport
    Ranjeet Deshpande
    Meghan Devarakonda
    William D. Drake
    Mark C. Duncan
    David Flynn
    Bin Hays
    Marcel Hu
    Po-E Huntemann
    Mary Li
    Chien-Chi Lipton
    David Lo
    Kayd Millard
    Paul D. Miller
    Samuel Piehowski
    T.B.K. Purvine
    Migun Reddy
    Jagadish Chandrabose Shakya
    Pajau Sundaramurthi
    Yaxing Vangay
    Bruce E. Wei
    Shane Wilson
    Patrick S. G. Canon
    Kjiersten Chain
    Stanton Fagnan
    Lee Ann Martin
    Christopher J. McCue
    Nigel J. Mungall
    Mary E. Mouncey
    Emiley A. Maxon
    Nature Reviews Microbiology, 2020, 18 : 313 - 314
  • [3] Enabling collaborative engineering and science at JPL
    Bergman, R
    Baker, JD
    ADVANCES IN ENGINEERING SOFTWARE, 2000, 31 (8-9) : 661 - 668
  • [4] Qualitocracy: A Data Quality Collaborative Framework Applied to Citizen Science
    Antelio, Marcio
    Esteves, Maria Gilda P.
    Schneider, Daniel
    de Souza, Jano Moreira
    PROCEEDINGS 2012 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2012, : 931 - 936
  • [5] The International Brain Initiative: enabling collaborative science
    Quaglio, Gianluca
    Toia, Patrizia
    Moser, Edvard I.
    Karapiperis, Theodoros
    Amunts, Katrin
    Okabe, Shigeo
    Poo, Mu-ming
    Rah, Jong-Cheol
    De Koninck, Yves
    Ngai, John
    Richards, Linda
    Bjaalie, Jan G.
    LANCET NEUROLOGY, 2021, 20 (12): : 985 - 986
  • [6] Enabling Collaborative Data Sharing in Google
    Hu, Hongxin
    Ahn, Gail-Joon
    Jorgensen, Jan
    2012 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2012, : 720 - 725
  • [7] Enabling Data Science for the Majority
    Parameswaran, Aditya
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (12): : 2309 - 2322
  • [8] A collaborative data management framework for concurrent product and process development
    Chen, Y.-M.
    Hsiao, Y.-T.
    1997, (10)
  • [9] A collaborative data management framework for concurrent product and process development
    Chen, YM
    Hsiao, YT
    INTERNATIONAL JOURNAL OF COMPUTER INTEGRATED MANUFACTURING, 1997, 10 (06) : 446 - 469
  • [10] Enabling the Disagreement among Crowds: A Collaborative Crowdsourcing Framework
    Wang, Meihong
    Sun, Yuling
    Yang, Jing
    He, Liang
    PROCEEDINGS OF THE 2018 IEEE 22ND INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN ((CSCWD)), 2018, : 790 - 795