Development of a Web-Scale Chinese Word N-gram Corpus with Parts of Speech Information

被引：0

作者：

Yu, Chi-Hsin ^{[1
]}

Tang, Yi-jie ^{[1
]}

Chen, Hsin-Hsi ^{[1
]}

机构：

[1] Natl Taiwan Univ, Dept Comp Sci & Informat Engn, Taipei 10617, Taiwan

来源：

LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2012年

关键词：

ClueWeb09; encoding detection; part-of-speech n-grams;

D O I：

暂无

中图分类号：

H0 [语言学];

学科分类号：

030303 ; 0501 ; 050102 ;

摘要：

Web provides a large-scale corpus for researchers to study the language usages in real world. Developing a web-scale corpus needs not only a lot of computation resources, but also great efforts to handle the large variations in the web texts, such as character encoding in processing Chinese web texts. In this paper, we aim to develop a web-scale Chinese word N-gram corpus with parts of speech information called NTU PN-Gram corpus using the ClueWeb09 dataset. We focus on the character encoding and some Chinese-specific issues. The statistics about the dataset is reported. We will make the resulting corpus a public available resource to boost the Chinese language processing.

引用

下载

页码：320 / 324

页数：5

共 31 条

[1] Web-Scale N-gram Models for Lexical Disambiguation
Bergsma, Shane
Lin, Dekang
Goebel, Randy
21ST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-09), PROCEEDINGS, 2009, : 1507 - 1512
[2] Creating Robust Supervised Classifiers via Web-Scale N-gram Data
Bergsma, Shane
Pitler, Emily
Lin, Dekang
ACL 2010: 48TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2010, : 865 - 874
[3] Web as a Corpus: Going Beyond the n-gram
Nakov, Preslav
INFORMATION RETRIEVAL, RUSSIR 2014, 2015, 505 : 185 - 228
[4] Oxymoron generation using an association word corpus and a large-scale N-gram corpus
Yamane, Hiroaki
Hagiwara, Masafumi
SOFT COMPUTING, 2015, 19 (04) : 919 - 927
[5] Oxymoron generation using an association word corpus and a large-scale N-gram corpus
Hiroaki Yamane
Masafumi Hagiwara
Soft Computing, 2015, 19 : 919 - 927
[6] Turkish word N-gram analyzing algorithms for a large scale Turkish corpus -: TurCo
Çebi, Y
Dalkiliç, G
ITCC 2004: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: CODING AND COMPUTING, VOL 2, PROCEEDINGS, 2004, : 236 - 240
[7] The Combinatorial Analysis of n-Gram Dictionaries, Coverage and Information Entropy based on the Web Corpus of English
Malashina, Anastasia
BALTIC JOURNAL OF MODERN COMPUTING, 2021, 9 (03): : 363 - 376
[8] Speech Corpus Generation Based on N-gram Confidence Measure Classification
Koctur, Tomas
Ondas, Stanislav
Juhar, Jozef
PROCEEDINGS OF 2017 INTERNATIONAL SYMPOSIUM ELMAR, 2017, : 149 - 152
[9] cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Cao, Shaosheng
Lu, Wei
Zhou, Jun
Li, Xiaolong
THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 5053 - 5061
[10] Chinese new word identification using N-gram and PPM Models
Li, Dun
Tu, Wei
Shi, Lei
EMERGING SYSTEMS FOR MATERIALS, MECHANICS AND MANUFACTURING, 2012, 109 : 612 - 616

← 1 2 3 4 →