Models for Analyzing Zero-Inflated and Overdispersed Count Data: An Application to Cigarette and Marijuana Use

被引:36
|
作者
Pittman, Brian [1 ]
Buta, Eugenia [2 ]
Krishnan-Sarin, Suchitra [1 ]
O'Malley, Stephanie S. [1 ]
Liss, Thomas [1 ]
Gueorguieva, Ralitza [1 ,2 ]
机构
[1] Yale Sch Med, Dept Psychiat, New Haven, CT USA
[2] Yale Sch Publ Hlth, Dept Biostat, New Haven, CT USA
关键词
HURDLE MODELS; REGRESSION; SMOKING; TOBACCO;
D O I
10.1093/ntr/nty072
中图分类号
R194 [卫生标准、卫生检查、医药管理];
学科分类号
摘要
Introduction: This article describes different methods for analyzing counts and illustrates their use on cigarette and marijuana smoking data. Methods: The Poisson, zero-inflated Poisson (ZIP), hurdle Poisson (HUP), negative binomial (NB), zero-inflated negative binomial (ZINB), and hurdle negative binomial (HUNB) regression models are considered. The different approaches are evaluated in terms of the ability to take into account zero-inflation (extra zeroes) and overdispersion (variance larger than expected) in count outcomes, with emphasis placed on model fit, interpretation, and choosing an appropriate model given the nature of the data. The illustrative data example focuses on cigarette and marijuana smoking reports from a study on smoking habits among youth e-cigarette users with gender, age, and e-cigarette use included as predictors. Results: Of the 69 subjects available for analysis, 36% and 64% reported smoking no cigarettes and no marijuana, respectively, suggesting both outcomes might be zero-inflated. Both outcomes were also overdispersed with large positive skew. The ZINB and HUNB models fit the cigarette counts best. According to goodness-of-fit statistics, the NB, HUNB, and ZINB models fit the marijuana data well, but the ZINB provided better interpretation. Conclusion: In the absence of zero-inflation, the NB model fits smoking data well, which is typically overdispersed. In the presence of zero-inflation, the ZINB or HUNB model is recommended to account for additional heterogeneity. In addition to model fit and interpretability, choosing between a zero-inflated or hurdle model should ultimately depend on the assumptions regarding the zeros, study design, and the research question being asked. Implications: Count outcomes are frequent in tobacco research and often have many zeros and exhibit large variance and skew. Analyzing such data based on methods requiring a normally distributed outcome are inappropriate and will likely produce spurious results. This study compares and contrasts appropriate methods for analyzing count data, specifically those with an over-abundance of zeros, and illustrates their use on cigarette and marijuana smoking data. Recommendations are provided.
引用
收藏
页码:1390 / 1398
页数:9
相关论文
共 50 条
  • [1] Models for Zero-Inflated and Overdispersed Correlated Count Data: An Application to Cigarette Use
    Pittman, Brian
    Buta, Eugenia
    Garrison, Kathleen
    Gueorguieva, Ralitza
    [J]. NICOTINE & TOBACCO RESEARCH, 2023, 25 (05) : 996 - 1003
  • [2] Exponential dispersion models for overdispersed zero-inflated count data
    Bar-Lev, Shaul K.
    Ridder, Ad
    [J]. COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2023, 52 (07) : 3286 - 3304
  • [3] A marginalized model for zero-inflated, overdispersed and correlated count data
    Iddia, Samuel
    Molenberghs, Geert
    [J]. ELECTRONIC JOURNAL OF APPLIED STATISTICAL ANALYSIS, 2013, 6 (02) : 149 - 165
  • [4] Examples of Computing Power for Zero-Inflated and Overdispersed Count Data
    Doyle, Suzanne R.
    [J]. JOURNAL OF MODERN APPLIED STATISTICAL METHODS, 2009, 8 (02) : 360 - 376
  • [5] Zero-inflated models with application to spatial count data
    Deepak K. Agarwal
    Alan E. Gelfand
    Steven Citron-Pousty
    [J]. Environmental and Ecological Statistics, 2002, 9 : 341 - 355
  • [6] Zero-inflated models with application to spatial count data
    Agarwal, DK
    Gelfand, AE
    Citron-Pousty, S
    [J]. ENVIRONMENTAL AND ECOLOGICAL STATISTICS, 2002, 9 (04) : 341 - 355
  • [7] Count Regression and Machine Learning Techniques for Zero-Inflated Overdispersed Count Data: Application to Ecological Data
    Sidumo B.
    Sonono E.
    Takaidza I.
    [J]. Annals of Data Science, 2024, 11 (03) : 803 - 817
  • [8] A joint model for hierarchical continuous and zero-inflated overdispersed count data
    Kassahun, Wondwosen
    Neyens, Thomas
    Molenberghs, Geert
    Faes, Christel
    Verbeke, Geert
    [J]. JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2015, 85 (03) : 552 - 571
  • [9] Marginalized multilevel hurdle and zero-inflated models for overdispersed and correlated count data with excess zeros
    Kassahun, Wondwosen
    Neyens, Thomas
    Molenberghs, Geert
    Faes, Christel
    Verbeke, Geert
    [J]. STATISTICS IN MEDICINE, 2014, 33 (25) : 4402 - 4419
  • [10] Bayesian Generalized Additive Models for Location, Scale, and Shape for Zero-Inflated and Overdispersed Count Data
    Klein, Nadja
    Kneib, Thomas
    Lang, Stefan
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2015, 110 (509) : 405 - 419