On the construction of a large scale Chinese Web Test collection

被引:0
|
作者
Yan, Hongfei [1 ]
Chen, Chong [1 ]
Peng, Bo [1 ]
Li, Xiaoming [1 ]
机构
[1] Peking Univ, Sch Elect Engn & Comp Sci, Beijing 100871, Peoples R China
来源
INFORMATION RETRIEVAL TECHNOLOGY | 2008年 / 4993卷
关键词
test collection; documents; Zipf-like law;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The lack of a large scale Chinese test collection is an obstacle to the Chinese information retrieval development. In order to address this issue, we built such a collection composed of millions of Chinese web pages, known as the Chinese Web Test collection with 100 gigabyte (CWT100g) in data volume, which is the largest Chinese web test collection as of this writing, and has been used by several dozen research groups besides being adopted in the evaluation of the SEWM-2004 Chinese Web Track[1] and the HTRDPE-2004[2]. We present the total solution for constructing a large scale test collection like the CWT100g. Further, we found that: 1) the distribution of the number of pages within sites obeys a Zipf-like law instead of a power law proposed by Adamic and Huberman [3, 4]; 2) and an appropriate filtering method on host alias will economize resources for about 25% while crawling pages. The Zipf-like law and the method of filtering host alias proposed in the paper will facilitate both to model the Web and to perfect a search engine. Finally, we report on the results of the SEWM-2004 Chinese Web Track.
引用
收藏
页码:117 / +
页数:3
相关论文
共 50 条
  • [21] WTR: A Test Collection for Web Table Retrieval
    Chen, Zhiyu
    Zhang, Shuo
    Davison, Brian D.
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 2514 - 2520
  • [22] CONSTRUCTION OF LARGE-SCALE GLOBAL MINIMUM CONCAVE QUADRATIC TEST PROBLEMS
    KALANTARI, B
    ROSEN, JB
    JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS, 1986, 48 (02) : 303 - 313
  • [23] Sentiment Analysis by Exploring Large Scale Web-based Chinese Short Text
    Liu, Ziyu
    Qi, Yonggang
    Ma, Zhanyu
    Yang, Jie
    INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND APPLICATION ENGINEERING (CSAE), 2017, 190 : 930 - 939
  • [24] Large scale MTConnect data collection
    Cui, Yesheng
    Kara, Sami
    Chan, Ka C.
    PROCEEDINGS OF THE IEEE 2019 9TH INTERNATIONAL CONFERENCE ON CYBERNETICS AND INTELLIGENT SYSTEMS (CIS) ROBOTICS, AUTOMATION AND MECHATRONICS (RAM) (CIS & RAM 2019), 2019, : 77 - 82
  • [25] Construction and Application of a Large-Scale Chinese Abstractness Lexicon Based on Word Similarity
    Xu, Huidan
    Yang, Lijiao
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 122 - 130
  • [26] A Sequence-to-Sequence Model for Large-scale Chinese Abbreviation Database Construction
    Wang, Chao
    Liu, Jingping
    Zhuang, Tianyi
    Li, Jiahang
    Liu, Juntao
    Xiao, Yanghua
    Wang, Wei
    Xie, Rui
    WSDM'22: PROCEEDINGS OF THE FIFTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2022, : 1063 - 1071
  • [27] The Research on Automatic Construction Techniques of Large-scale Corpus for Chinese Text Categorization
    Hu, Yan
    Wu, Wei
    Miao, Miao
    IEEC 2009: FIRST INTERNATIONAL SYMPOSIUM ON INFORMATION ENGINEERING AND ELECTRONIC COMMERCE, PROCEEDINGS, 2009, : 640 - 645
  • [28] Sleeping and feeding in the first 6 months: test of a large-scale data collection technique
    Ball, H
    JOURNAL OF REPRODUCTIVE AND INFANT PSYCHOLOGY, 2004, 22 (03) : 231 - 231
  • [29] Technique of Large-scale Image Set Construction Based on Web Image Searching Engine
    Li, Ran
    Xu, Weiguang
    Lu, Jianjiang
    Zhang, Yafei
    Lu, Zining
    PROCEEDINGS OF THE 8TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE, 2009, : 622 - 626
  • [30] Humkinar: Construction of a Large Scale Web Repository and Information System for Low Resource Urdu Language
    Amir Mehmood, Muhammad
    Tahir, Bilal
    IEEE ACCESS, 2024, 12 : 128404 - 128423