Wrangling Categorical Data in R

被引:0
|
作者
McNamara, Amelia [1 ]
Horton, Nicholas J. [2 ]
机构
[1] Smith Coll, Program Stat & Data Sci, 215 Burton Hall, Northampton, MA 01063 USA
[2] Amherst Coll, Dept Math & Stat, Amherst, MA 01002 USA
来源
AMERICAN STATISTICIAN | 2018年 / 72卷 / 01期
关键词
Data derivation; Data management; Data science; Statistical computing;
D O I
10.1080/00031305.2017.1356375
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This article discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the "tidyverse." We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis. Supplementary materials for this article are available online.
引用
收藏
页码:97 / 104
页数:8
相关论文
共 50 条
  • [21] Medical Data Wrangling With Sequential Variational Autoencoders
    Barrejon, Daniel
    Olmos, Pablo M.
    Artes-Rodriguez, Antonio
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (06) : 2737 - 2745
  • [22] Capturing and Visualizing Provenance From Data Wrangling
    Bors, Christian
    Gschwandtner, Theresia
    Miksch, Silvia
    IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2019, 39 (06) : 61 - 75
  • [23] Towards Automatic Data Format Transformations: Data Wrangling at Scale
    Bogatu, Alex
    Paton, Norman W.
    Fernandes, Alvaro A. A.
    DATA ANALYTICS, 2017, 10365 : 36 - 48
  • [24] Can language models automate data wrangling?
    Jaimovitch-Lopez, Gonzalo
    Ferri, Cesar
    Hernandez-Orallo, Jose
    Martinez-Plumed, Fernando
    Ramirez-Quintana, Maria Jose
    MACHINE LEARNING, 2023, 112 (06) : 2053 - 2082
  • [25] Revealing the Semantics of Data Wrangling Scripts with Comantics
    Xiong K.
    Luo Z.
    Fu S.
    Wang Y.
    Xu M.
    Wu Y.
    IEEE Transactions on Visualization and Computer Graphics, 2023, 29 (01) : 117 - 127
  • [26] Can language models automate data wrangling?
    Gonzalo Jaimovitch-López
    Cèsar Ferri
    José Hernández-Orallo
    Fernando Martínez-Plumed
    María José Ramírez-Quintana
    Machine Learning, 2023, 112 : 2053 - 2082
  • [27] Data wrangling for virtual attendance : A conceptual model
    Mpofu, Nkosinathi
    Kaondera, Charles
    Sidume, Freedmore
    Tamukate, Rabson
    Verma, Rachna
    INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND ENERGY TECHNOLOGIES (ICECET 2021), 2021, : 1954 - 1959
  • [28] Investigating the Utility of Graphics in Teaching Data Wrangling
    Sundin, Lovisa
    PROCEEDINGS OF THE 2020 ACM CONFERENCE ON INTERNATIONAL COMPUTING EDUCATION RESEARCH, ICER 2020, 2020, : 342 - 343
  • [29] Using Machine Learning to accelerate Data Wrangling
    Ahuja, Shilpi
    Roth, Mary
    Gangadharaiah, Rashmi
    Schwarz, Peter
    Bastidas, Rafael
    2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2016, : 343 - 349
  • [30] Towards Automatic Data Format Transformations: Data Wrangling at Scale
    Bogatu, Alex
    Paton, Norman W.
    Fernandes, Alvaro A. A.
    Koehler, Martin
    COMPUTER JOURNAL, 2019, 62 (07): : 1044 - 1060