Efficient semiparametric estimation in two-sample comparison via semisupervised learning

被引:0
|
作者
Tan, Tao [1 ,2 ,3 ]
Zhang, Shuyi [1 ,2 ,3 ]
Zhou, Yong [1 ,2 ,3 ]
机构
[1] East China Normal Univ, MoE, Key Lab Adv Theory & Applicat Stat & Data Sci, Shanghai, Peoples R China
[2] East China Normal Univ, Sch Stat, Shanghai, Peoples R China
[3] East China Normal Univ, Acad Stat & Interdisciplinary Sci, Shanghai, Peoples R China
基金
中国国家自然科学基金; 上海市自然科学基金;
关键词
Adaptiveness; semiparametric efficiency; semisupervised inference; two-sample comparison; CAUSAL INFERENCE; VARIABLES; MODEL;
D O I
10.1002/cjs.11813
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We develop a general semisupervised framework for statistical inference in the two-sample comparison setting. Although the supervised Mann-Whitney statistic outperforms many estimators in the two-sample problem for nonnormally distributed responses, it is excessively inefficient because it ignores large amounts of unlabelled information. To borrow strength from unlabelled data, we propose a class of efficient and adaptive estimators that use two-step semiparametric imputation. The probabilistic index model is adopted primarily to achieve dimension reduction for multivariate covariates, and a follow-up reweighting step balances the contributions of labelled and unlabelled data. The asymptotic properties of our estimator are derived with variance comparison through a phase diagram. Efficiency theory shows our estimators achieve the semiparametric variance lower bound if the probabilistic index model is correctly specified, and are more efficient than their supervised counterpart when the model is not degenerate. The asymptotic variance is estimated through a two-step perturbation resampling procedure. To gauge the finite sample performance, we conducted extensive simulation studies which verify the adaptive nature of our methods with respect to model misspecification. To illustrate the merits of our proposed method, we analyze a dataset concerning homelessness in Los Angeles. Les auteurs de ce travail ont & eacute;labor & eacute; un cadre g & eacute;n & eacute;ral semi-supervis & eacute; pour l'inf & eacute;rence statistique afin de comparer deux & eacute;chantillons. Bien que la statistique de Mann-Whitney supervis & eacute;e soit plus performante que de nombreux estimateurs dans le test & agrave; deux & eacute;chantillons lorsque les r & eacute;ponses ne suivent pas une distribution normale, elle pr & eacute;sente une perte d'efficacit & eacute; inacceptable car elle ignore de grandes quantit & eacute;s d'informations non & eacute;tiquet & eacute;es. Afin de tirer parti des donn & eacute;es non & eacute;tiquet & eacute;es, les auteurs proposent une classe d'estimateurs efficaces et adaptatifs gr & acirc;ce & agrave; une imputation semi-param & eacute;trique en deux & eacute;tapes. Le mod & egrave;le d'indice probabiliste est principalement utilis & eacute; pour r & eacute;duire la dimension des covariables multivari & eacute;es, et une & eacute;tape de r & eacute;& eacute;quilibrage subs & eacute;quente est mise en place pour & eacute;quilibrer les contributions des donn & eacute;es & eacute;tiquet & eacute;es et non & eacute;tiquet & eacute;es. Les propri & eacute;t & eacute;s asymptotiques sont obtenues par comparaison de variance & agrave; l'aide d'un diagramme de phase. La th & eacute;orie de l'efficacit & eacute; montre que les estimateurs atteignent des bornes inf & eacute;rieures de variance semi-param & eacute;triques si le mod & egrave;le d'indice probabiliste est correctement sp & eacute;cifi & eacute;, et ils sont plus efficaces que leur & eacute;quivalent supervis & eacute; lorsque le mod & egrave;le n'est pas d & eacute;g & eacute;n & eacute;r & eacute;. La variance asymptotique est estim & eacute;e par une proc & eacute;dure de r & eacute;& eacute;chantillonnage en deux & eacute;tapes afin d'& eacute;valuer la performance d'& eacute;chantillons finis. Une simulation approfondie est r & eacute;alis & eacute;e pour v & eacute;rifier l'adaptabilit & eacute; de leurs m & eacute;thodes en cas d'erreurs de sp & eacute;cification du mod & egrave;le. En outre, une application illustrative est conduite sur un ensemble de donn & eacute;es r & eacute;elles concernant les sans-abris & agrave; Los Angeles.
引用
收藏
页数:25
相关论文
共 50 条
  • [21] Minimum Hellinger distance estimation for a two-sample semiparametric cure rate model with censored survival data
    Yayuan Zhu
    Jingjing Wu
    Xuewen Lu
    Computational Statistics, 2013, 28 : 2495 - 2518
  • [22] Semiparametric maximum likelihood estimation for a two-sample density ratio model with right-censored data
    Wei, Wenhua
    Zhou, Yong
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2016, 44 (01): : 58 - 81
  • [23] Analysis of two-sample censored data using a semiparametric mixture model
    Li, Gang
    Lin, Chien-tai
    ACTA MATHEMATICAE APPLICATAE SINICA-ENGLISH SERIES, 2009, 25 (03): : 389 - 398
  • [24] Analysis of two-sample censored data using a semiparametric mixture model
    Gang Li
    Chien-tai Lin
    Acta Mathematicae Applicatae Sinica, English Series, 2009, 25 : 389 - 398
  • [25] Efficient semiparametric scoring estimation of sample selection models
    Chen, SN
    Lee, LF
    ECONOMETRIC THEORY, 1998, 14 (04) : 423 - 462
  • [26] Empirical likelihood tests for two-sample problems via nonparametric density estimation
    Cao, R
    Van Keilegom, I
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2006, 34 (01): : 61 - 77
  • [27] f-Divergence Estimation and Two-Sample Homogeneity Test Under Semiparametric Density-Ratio Models
    Kanamori, Takafumi
    Suzuki, Taiji
    Sugiyama, Masashi
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2012, 58 (02) : 708 - 720
  • [28] Efficient semiparametric estimation via moment restrictions
    Newey, WK
    ECONOMETRICA, 2004, 72 (06) : 1877 - 1897
  • [29] A label-efficient two-sample test
    Li, Weizhi
    Dasarathy, Gautam
    Ramamurthy, Karthikeyan Natesan
    Berisha, Visar
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, VOL 180, 2022, 180 : 1168 - 1177
  • [30] Semiparametric two-sample changepoint model with application to human immunodeficiency virus studies
    Hu, Zonghui
    Qin, Jing
    Follmann, Dean
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 2008, 57 : 589 - 607