Comparing methods of analysing datasets with small clusters: case studies using four paediatric datasets

被引:20
|
作者
Marston, Louise [1 ]
Peacock, Janet L. [6 ]
Yu, Keming [3 ]
Brocklehurst, Peter [7 ]
Calvert, Sandra A. [4 ]
Greenough, Anne [5 ]
Marlow, Neil [2 ]
机构
[1] Brunel Univ, Dept Primary Care & Populat Hlth, Uxbridge UB8 3PH, Middx, England
[2] Brunel Univ, Inst Womens Hlth, UCL, Uxbridge UB8 3PH, Middx, England
[3] Brunel Univ, Sch Informat Syst Comp & Math, Uxbridge UB8 3PH, Middx, England
[4] Univ London, Dept Child Hlth, London WC1E 7HU, England
[5] Kings Coll London, Div Asthma Allergy & Lung Biol, Sch Med, London WC2R 2LS, England
[6] Univ Southampton, Dept Publ Hlth Sci & Med Stat, Southampton, Hants, England
[7] Univ Oxford, Natl Perinatal Epidemiol Unit, Oxford, England
关键词
multiple births; statistical methodology; multilevel model; generalised estimating equations; multiple linear regression; cluster; LONGITUDINAL DATA-ANALYSIS; RANDOMIZED-TRIALS; REGRESSION-MODELS; BINARY DATA; QUADRATURE; EXAMPLE; TWIN;
D O I
10.1111/j.1365-3016.2009.01046.x
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.
引用
下载
收藏
页码:380 / 392
页数:13
相关论文
共 50 条
  • [21] Simulations to assess the performance of different rarefaction methods in estimating population size using small datasets
    Alain C. Frantz
    Timothy J. Roper
    Conservation Genetics, 2006, 7 : 315 - 318
  • [22] Soft Sensor design for a Topping process in the case of small datasets
    Napoli, G.
    Xibilia, M. G.
    COMPUTERS & CHEMICAL ENGINEERING, 2011, 35 (11) : 2447 - 2456
  • [23] Extracting partitional clusters from heterogeneous datasets using mutual entropy
    Hossain, Mahmood
    Bridges, Susan
    Wang, Yong
    Hodges, Julia
    IRI 2007: PROCEEDINGS OF THE 2007 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2007, : 447 - +
  • [24] Using Clustering Ensembles and Heuristic Search to Estimate the Number of Clusters in Datasets
    Odebode, Afees Adegoke
    Arzoky, Mahir
    Tucker, Allan
    Mann, Ashley
    Maramazi, Faisal
    Swift, Stephen
    INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 3, INTELLISYS 2023, 2024, 824 : 334 - 353
  • [25] Comparing Different Oversampling Methods in Predicting Multi-Class Educational Datasets Using Machine Learning Techniques
    Tariq, Muhammad Arham
    Sargano, Allah Bux
    Iftikhar, Muhammad Aksam
    Habib, Zulfiqar
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2023, 23 (04) : 199 - 212
  • [26] Case Study Comparing Multiple Irrigated Land Datasets in Arizona and Colorado, USA
    Shi, Hua
    Auch, Roger F.
    Vogelmann, James E.
    Feng, Min
    Rigge, Matthew
    Senay, Gabriel
    Verdin, James P.
    JOURNAL OF THE AMERICAN WATER RESOURCES ASSOCIATION, 2018, 54 (02): : 505 - 526
  • [27] Managing large multidimensional hydrologic datasets: A case study comparing NetCDF and SciDB
    Liu, Haicheng
    van Oosterom, Peter
    Tijssen, Theo
    Commandeur, Tom
    Wang, Wen
    JOURNAL OF HYDROINFORMATICS, 2018, 20 (05) : 1058 - 1070
  • [28] Deep Learning on Small Datasets using Online Image Search
    Kolar, Martin
    Hradis, Michal
    Zemcik, Pavel
    32ND SPRING CONFERENCE ON COMPUTER GRAPHICS (SCCG 2016), 2016, : 87 - 93
  • [29] Comparing different methods for the replacement of missing values in longitudinal magnetic resonance imaging datasets
    Graml, A
    Held, U
    Toutenburg, H
    Kappos, L
    Daumer, M
    JOURNAL OF NEUROLOGY, 2004, 251 : 180 - 180
  • [30] Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods
    Kou, Gang
    Yang, Pei
    Peng, Yi
    Xiao, Feng
    Chen, Yang
    Alsaadi, Fawaz E.
    APPLIED SOFT COMPUTING, 2020, 86