Estimating deep web data source size by capture-recapture method

被引:22
|
作者
Lu, Jianguo [1 ,2 ]
Li, Dingding [3 ]
机构
[1] Univ Windsor, Sch Comp Sci, Windsor, ON N9B 3P4, Canada
[2] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Jiangsu, Peoples R China
[3] Univ Windsor, Dept Econ, Windsor, ON N9B 3P4, Canada
来源
INFORMATION RETRIEVAL | 2010年 / 13卷 / 01期
基金
加拿大自然科学与工程研究理事会;
关键词
Deep web; Estimators; Capture-recapture;
D O I
10.1007/s10791-009-9107-y
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper addresses the problem of estimating the size of a deep web data source that is accessible by queries only. Since most deep web data sources are non-cooperative, a data source size can only be estimated by sending queries and analyzing the returning results. We propose an efficient estimator based on the capture-recapture method. First we derive an equation between the overlapping rate and the percentage of the data examined when random samples are retrieved from a uniform distribution. This equation is conceptually simple and leads to the derivation of an estimator for samples obtained by random queries. Since random queries do not produce random documents, it is well known that the traditional methods by random queries underestimate the size, i.e., those estimators have negative bias. Based on the simple estimator for random samples, we adjust the equation so that it can handle the samples returned by random queries. We conduct both simulation studies and experiments on corpora including Gov2, Reuters, Newsgroups, and Wikipedia. The results show that our method has small bias and standard deviation.
引用
收藏
页码:70 / 95
页数:26
相关论文
共 50 条
  • [1] Estimating deep web data source size by capture–recapture method
    Jianguo Lu
    Dingding Li
    [J]. Information Retrieval, 2010, 13 : 70 - 95
  • [2] Estimating the size of an open population using sparse capture-recapture data
    Huggins, Richard
    Stoklosa, Jakub
    Roach, Cameron
    Yip, Paul
    [J]. BIOMETRICS, 2018, 74 (01) : 280 - 288
  • [3] Estimating population size by spatially explicit capture-recapture
    Efford, Murray G.
    Fewster, Rachel M.
    [J]. OIKOS, 2013, 122 (06) : 918 - 928
  • [4] ESTIMATING THE FRACTION OF INVARIABLE CODONS WITH A CAPTURE-RECAPTURE METHOD
    SIDOW, A
    NGUYEN, T
    SPEED, TP
    [J]. JOURNAL OF MOLECULAR EVOLUTION, 1992, 35 (03) : 253 - 260
  • [5] Doubly Robust Capture-Recapture Methods for Estimating Population Size
    Das, Manjari
    Kennedy, Edward H.
    Jewell, Nicholas P.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2024, 119 (546) : 1309 - 1321
  • [6] Estimating the population size with a behavioral response in capture-recapture experiment
    Paul S.F. Yip
    Liqun Xi
    Anne Chao
    Wen-Han Hwang
    [J]. Environmental and Ecological Statistics, 2000, 7 : 405 - 414
  • [7] Estimating the population size with a behavioral response in capture-recapture experiment
    Yip, PSF
    Xi, LQ
    Chao, A
    Hwang, WH
    [J]. ENVIRONMENTAL AND ECOLOGICAL STATISTICS, 2000, 7 (04) : 405 - 414
  • [8] Minimum chi-square method for estimating population size in capture-recapture experiments
    Zheng, Yuyan
    Mao, Yongfei
    Tsao, Min
    Cowen, Laura L. E.
    [J]. PLOS ONE, 2023, 18 (10):
  • [9] Estimating the Size of Key Populations in Kampala, Uganda: 3-Source Capture-Recapture Study
    Doshi, Reena H.
    Apodaca, Kevin
    Ogwal, Moses
    Bain, Rommel
    Amene, Ermias
    Kiyingi, Herbert
    Aluzimbi, George
    Musinguzi, Geofrey
    Serwadda, David
    McIntyre, Anne F.
    Hladik, Wolfgang
    [J]. JMIR PUBLIC HEALTH AND SURVEILLANCE, 2019, 5 (03): : 75 - 84
  • [10] Estimating individual fitness in the wild using capture-recapture data
    Gimenez, Olivier
    Gaillard, Jean-Michel
    [J]. POPULATION ECOLOGY, 2018, 60 (1-2) : 101 - 109