On the construction of a large scale Chinese Web Test collection

被引:0
|
作者
Yan, Hongfei [1 ]
Chen, Chong [1 ]
Peng, Bo [1 ]
Li, Xiaoming [1 ]
机构
[1] Peking Univ, Sch Elect Engn & Comp Sci, Beijing 100871, Peoples R China
来源
INFORMATION RETRIEVAL TECHNOLOGY | 2008年 / 4993卷
关键词
test collection; documents; Zipf-like law;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The lack of a large scale Chinese test collection is an obstacle to the Chinese information retrieval development. In order to address this issue, we built such a collection composed of millions of Chinese web pages, known as the Chinese Web Test collection with 100 gigabyte (CWT100g) in data volume, which is the largest Chinese web test collection as of this writing, and has been used by several dozen research groups besides being adopted in the evaluation of the SEWM-2004 Chinese Web Track[1] and the HTRDPE-2004[2]. We present the total solution for constructing a large scale test collection like the CWT100g. Further, we found that: 1) the distribution of the number of pages within sites obeys a Zipf-like law instead of a power law proposed by Adamic and Huberman [3, 4]; 2) and an appropriate filtering method on host alias will economize resources for about 25% while crawling pages. The Zipf-like law and the method of filtering host alias proposed in the paper will facilitate both to model the Web and to perfect a search engine. Finally, we report on the results of the SEWM-2004 Chinese Web Track.
引用
收藏
页码:117 / +
页数:3
相关论文
共 50 条
  • [1] Large-Scale Web Service Collection for Service Archive Management
    Zhang, Zhen
    Feng, Zhiyong
    Chen, Shizhan
    Xin, Liyuan
    Hao, Yan
    PROCEEDINGS 2014 INTERNATIONAL CONFERENCE ON SERVICE SCIENCES (ICSS 2014), 2014, : 106 - 111
  • [2] A thesaurus construction method from large scale web dictionaries
    Nakayama, Kotaro
    Hara, Takahiro
    Nishio, Shojiro
    21ST INTERNATIONAL CONFERENCE ON ADVANCED NETWORKING AND APPLICATIONS, PROCEEDINGS, 2007, : 932 - +
  • [3] A Large-Scale Empirical Analysis of Chinese Web Passwords
    Li, Zhigong
    Han, Weili
    Xu, Wenyuan
    PROCEEDINGS OF THE 23RD USENIX SECURITY SYMPOSIUM, 2014, : 559 - 574
  • [4] A Very Large Scale Mandarin Chinese Broadcast Collection for the GALE Program
    Yi, Liu
    Fung, Pascale
    Yang Yongsheng
    DiPersio, Denise
    Glenn, Meghan Lammie
    Strassel, Stephanie M.
    Cieri, Christopher
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : J83 - J88
  • [5] Characteristics of chinese web searching: A large-scale analysis of Chinese query logs
    Li, Yanan
    Zhang, Sen
    Wang, Bin
    Li, Jintao
    Journal of Computational Information Systems, 2008, 4 (03): : 1127 - 1136
  • [6] On very large scale test collection for landmark image search benchmarking
    Cheng, Zhiyong
    Shen, Jialie
    SIGNAL PROCESSING, 2016, 124 : 13 - 26
  • [7] ATLAS electromagnetic calorimeter, construction and test of a large scale system
    Tayalati, Y
    NUCLEAR INSTRUMENTS & METHODS IN PHYSICS RESEARCH SECTION A-ACCELERATORS SPECTROMETERS DETECTORS AND ASSOCIATED EQUIPMENT, 2004, 525 (1-2): : 178 - 182
  • [8] Construction of a large-scale test set for author disambiguation
    Kang, In-Su
    Kim, Pyung
    Lee, Seungwoo
    Jung, Hanmin
    You, Beom-Jong
    INFORMATION PROCESSING & MANAGEMENT, 2011, 47 (03) : 452 - 465
  • [9] A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure
    Shinzato, Keiji
    Kawahara, Daisuke
    Hashimoto, Chikara
    Kurohashi, Sadao
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 2236 - 2241
  • [10] An Automated Test Assembly Design for a Large-Scale Chinese Proficiency Test
    Wang, Shiyu
    Zheng, Yi
    Zheng, Chanjin
    Su, Ya-Hui
    Li, Peize
    APPLIED PSYCHOLOGICAL MEASUREMENT, 2016, 40 (03) : 233 - 237