Wrangling Categorical Data in R

被引:0
|
作者
McNamara, Amelia [1 ]
Horton, Nicholas J. [2 ]
机构
[1] Smith Coll, Program Stat & Data Sci, 215 Burton Hall, Northampton, MA 01063 USA
[2] Amherst Coll, Dept Math & Stat, Amherst, MA 01002 USA
来源
AMERICAN STATISTICIAN | 2018年 / 72卷 / 01期
关键词
Data derivation; Data management; Data science; Statistical computing;
D O I
10.1080/00031305.2017.1356375
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This article discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the "tidyverse." We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis. Supplementary materials for this article are available online.
引用
收藏
页码:97 / 104
页数:8
相关论文
共 50 条
  • [31] Uncovering Data Landscapes through Data Reconnaissance and Task Wrangling
    Crisan, Anamaria
    Munzner, Tamara
    2019 IEEE VISUALIZATION CONFERENCE (VIS), 2019, : 46 - 50
  • [32] From DNA sequences to microbial ecology: Wrangling NEON soil microbe data with the neonMicrobe R package
    Qin, Clara
    Bartelme, Ryan
    Chung, Y. Anny
    Fairbanks, Dawson
    Lin, Yang
    Liptzin, Daniel
    Muscarella, Chance
    Naithani, Kusum
    Peay, Kabir
    Pellitier, Peter
    St Rose, Ayanna
    Stanish, Lee
    Werbin, Zoey
    Zhu, Kai
    ECOSPHERE, 2021, 12 (11):
  • [33] Special Considerations for the Acquisition and Wrangling of Big Data
    Braun, Michael T.
    Kuljanin, Goran
    DeShon, Richard P.
    ORGANIZATIONAL RESEARCH METHODS, 2018, 21 (03) : 633 - 659
  • [34] LMest: An R Package for Latent Markov Models for Longitudinal Categorical Data
    Bartolucci, Francesco
    Pandolfi, Silvia
    Pennoni, Fulvia
    JOURNAL OF STATISTICAL SOFTWARE, 2017, 81 (04): : 1 - 38
  • [35] Automatically Wrangling Spreadsheets into Machine Learning Data Formats
    Verbruggen, Gust
    De Raedt, Luc
    ADVANCES IN INTELLIGENT DATA ANALYSIS XVII, IDA 2018, 2018, 11191 : 367 - 379
  • [36] Data Diff: Interpretable, Executable Summaries of Changes in Distributions for Data Wrangling
    Sutton, Charles
    Hobson, Timothy
    Geddes, James
    Caruana, Rich
    KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 2279 - 2288
  • [37] Recognizing the importance of COVID-19 data wrangling
    Rasmussen-Torvik, Laura J.
    JOURNAL OF CLINICAL INVESTIGATION, 2022, 132 (19):
  • [38] Research directions in data wrangling: Visualizations and transformations for usable and credible data
    Kandel, Sean
    Heer, Jeffrey
    Plaisant, Catherine
    Kennedy, Jessie
    van Ham, Frank
    Riche, Nathalie Henry
    Weaver, Chris
    Lee, Bongshin
    Brodbeck, Dominique
    Buono, Paolo
    INFORMATION VISUALIZATION, 2011, 10 (04) : 271 - 288
  • [39] Data Wrangling: A Decisive Step for Compact Regression Trees
    Parisot, Olivier
    Didry, Yoanne
    Tamisier, Thomas
    COOPERATIVE DESIGN, VISUALIZATION, AND ENGINEERING, CDVE 2014, 2014, 8683 : 60 - 63
  • [40] Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways
    Grimes, Mark L.
    Lee, Wan-Jui
    Van der Maaten, Laurens
    Shannon, Paul
    PLOS ONE, 2013, 8 (01):