Comparing methods of analysing datasets with small clusters: case studies using four paediatric datasets

被引:20
|
作者
Marston, Louise [1 ]
Peacock, Janet L. [6 ]
Yu, Keming [3 ]
Brocklehurst, Peter [7 ]
Calvert, Sandra A. [4 ]
Greenough, Anne [5 ]
Marlow, Neil [2 ]
机构
[1] Brunel Univ, Dept Primary Care & Populat Hlth, Uxbridge UB8 3PH, Middx, England
[2] Brunel Univ, Inst Womens Hlth, UCL, Uxbridge UB8 3PH, Middx, England
[3] Brunel Univ, Sch Informat Syst Comp & Math, Uxbridge UB8 3PH, Middx, England
[4] Univ London, Dept Child Hlth, London WC1E 7HU, England
[5] Kings Coll London, Div Asthma Allergy & Lung Biol, Sch Med, London WC2R 2LS, England
[6] Univ Southampton, Dept Publ Hlth Sci & Med Stat, Southampton, Hants, England
[7] Univ Oxford, Natl Perinatal Epidemiol Unit, Oxford, England
关键词
multiple births; statistical methodology; multilevel model; generalised estimating equations; multiple linear regression; cluster; LONGITUDINAL DATA-ANALYSIS; RANDOMIZED-TRIALS; REGRESSION-MODELS; BINARY DATA; QUADRATURE; EXAMPLE; TWIN;
D O I
10.1111/j.1365-3016.2009.01046.x
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.
引用
下载
收藏
页码:380 / 392
页数:13
相关论文
共 50 条
  • [1] Methods for analysing lineage tracing datasets
    Kostiou, Vasiliki
    Zhang, Huairen
    Hall, Michael W. J.
    Jones, Philip H.
    Hall, Benjamin A.
    ROYAL SOCIETY OPEN SCIENCE, 2021, 8 (05):
  • [2] Multivariate Data Analytics in PV Manufacturing-Four Case Studies Using Manufacturing Datasets
    Evans, Rhett
    Boreland, Matthew
    IEEE JOURNAL OF PHOTOVOLTAICS, 2018, 8 (01): : 38 - 47
  • [3] A toolkit for analysing large-scale plant small RNA datasets
    Moxon, Simon
    Schwach, Frank
    Dalmay, Tamas
    MacLean, Dan
    Studholme, David J.
    Moulton, Vincent
    BIOINFORMATICS, 2008, 24 (19) : 2252 - 2253
  • [4] Machine Learning Methods with Noisy, Incomplete or Small Datasets
    Caiafa, Cesar F.
    Sun, Zhe
    Tanaka, Toshihisa
    Marti-Puig, Pere
    Sole-Casals, Jordi
    APPLIED SCIENCES-BASEL, 2021, 11 (09):
  • [5] Deep learning with small datasets: using autoencoders to address limited datasets in construction management
    Delgado, Juan Manuel Davila
    Oyedele, Lukumon
    APPLIED SOFT COMPUTING, 2021, 112
  • [6] Using the stability of objects to determine the number of clusters in datasets
    Lord, Etienne
    Willems, Matthieu
    Lapointe, Francois-Joseph
    Makarenkov, Vladimir
    INFORMATION SCIENCES, 2017, 393 : 29 - 46
  • [7] Comparing alternative classifiers for database marketing: The case of imbalanced datasets
    Duman, Ekrem
    Ekinci, Yeliz
    Tanriverdi, Aydin
    EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (01) : 48 - 53
  • [8] Incremental model-based clustering for large datasets with small clusters
    Fraley, C
    Raftery, A
    Wehrens, R
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2005, 14 (03) : 529 - 546
  • [9] Replication Studies Using Secondary or Nonexperimental Datasets
    Huang, Francis L.
    Huang, Anna B.
    SCHOOL PSYCHOLOGY REVIEW, 2024,
  • [10] Decomposition Methods for Machine Learning with Small, Incomplete or Noisy Datasets
    Caiafa, Cesar Federico
    Sole-Casals, Jordi
    Marti-Puig, Pere
    Zhe, Sun
    Tanaka, Toshihisa
    APPLIED SCIENCES-BASEL, 2020, 10 (23): : 1 - 20