Development of a Web-Scale Chinese Word N-gram Corpus with Parts of Speech Information

被引:0
|
作者
Yu, Chi-Hsin [1 ]
Tang, Yi-jie [1 ]
Chen, Hsin-Hsi [1 ]
机构
[1] Natl Taiwan Univ, Dept Comp Sci & Informat Engn, Taipei 10617, Taiwan
关键词
ClueWeb09; encoding detection; part-of-speech n-grams;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
Web provides a large-scale corpus for researchers to study the language usages in real world. Developing a web-scale corpus needs not only a lot of computation resources, but also great efforts to handle the large variations in the web texts, such as character encoding in processing Chinese web texts. In this paper, we aim to develop a web-scale Chinese word N-gram corpus with parts of speech information called NTU PN-Gram corpus using the ClueWeb09 dataset. We focus on the character encoding and some Chinese-specific issues. The statistics about the dataset is reported. We will make the resulting corpus a public available resource to boost the Chinese language processing.
引用
下载
收藏
页码:320 / 324
页数:5
相关论文
共 31 条
  • [1] Web-Scale N-gram Models for Lexical Disambiguation
    Bergsma, Shane
    Lin, Dekang
    Goebel, Randy
    21ST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-09), PROCEEDINGS, 2009, : 1507 - 1512
  • [2] Creating Robust Supervised Classifiers via Web-Scale N-gram Data
    Bergsma, Shane
    Pitler, Emily
    Lin, Dekang
    ACL 2010: 48TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2010, : 865 - 874
  • [3] Web as a Corpus: Going Beyond the n-gram
    Nakov, Preslav
    INFORMATION RETRIEVAL, RUSSIR 2014, 2015, 505 : 185 - 228
  • [4] Oxymoron generation using an association word corpus and a large-scale N-gram corpus
    Yamane, Hiroaki
    Hagiwara, Masafumi
    SOFT COMPUTING, 2015, 19 (04) : 919 - 927
  • [5] Oxymoron generation using an association word corpus and a large-scale N-gram corpus
    Hiroaki Yamane
    Masafumi Hagiwara
    Soft Computing, 2015, 19 : 919 - 927
  • [6] Turkish word N-gram analyzing algorithms for a large scale Turkish corpus -: TurCo
    Çebi, Y
    Dalkiliç, G
    ITCC 2004: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: CODING AND COMPUTING, VOL 2, PROCEEDINGS, 2004, : 236 - 240
  • [7] The Combinatorial Analysis of n-Gram Dictionaries, Coverage and Information Entropy based on the Web Corpus of English
    Malashina, Anastasia
    BALTIC JOURNAL OF MODERN COMPUTING, 2021, 9 (03): : 363 - 376
  • [8] Speech Corpus Generation Based on N-gram Confidence Measure Classification
    Koctur, Tomas
    Ondas, Stanislav
    Juhar, Jozef
    PROCEEDINGS OF 2017 INTERNATIONAL SYMPOSIUM ELMAR, 2017, : 149 - 152
  • [9] cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
    Cao, Shaosheng
    Lu, Wei
    Zhou, Jun
    Li, Xiaolong
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 5053 - 5061
  • [10] Chinese new word identification using N-gram and PPM Models
    Li, Dun
    Tu, Wei
    Shi, Lei
    EMERGING SYSTEMS FOR MATERIALS, MECHANICS AND MANUFACTURING, 2012, 109 : 612 - 616