Creating Robust Supervised Classifiers via Web-Scale N-gram Data

被引:0
|
作者
Bergsma, Shane [1 ]
Pitler, Emily [2 ]
Lin, Dekang [3 ]
机构
[1] Univ Alberta, Edmonton, AB T6G 2M7, Canada
[2] Univ Penn, Philadelphia, PA 19104 USA
[3] Google Inc, Mountain View, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we systematically assess the value of using web-scale N-gram data in state-of-the-art supervised NLP classifiers. We compare classifiers that include or exclude features for the counts of various N-grams, where the counts are obtained from a web-scale auxiliary corpus. We show that including N-gram count features can advance the state-of-the-art accuracy on standard data sets for adjective ordering, spelling correction, noun compound bracketing, and verb part-of-speech dis-ambiguation. More importantly, when operating on new domains, or when labeled training data is not plentiful, we show that using web-scale N-gram features is essential for achieving robust performance.
引用
收藏
页码:865 / 874
页数:10
相关论文
共 50 条
  • [1] Web-Scale N-gram Models for Lexical Disambiguation
    Bergsma, Shane
    Lin, Dekang
    Goebel, Randy
    [J]. 21ST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-09), PROCEEDINGS, 2009, : 1507 - 1512
  • [2] Development of a Web-Scale Chinese Word N-gram Corpus with Parts of Speech Information
    Yu, Chi-Hsin
    Tang, Yi-jie
    Chen, Hsin-Hsi
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 320 - 324
  • [3] Creating voiD descriptions for Web-scale data
    Boehm, Christoph
    Lorey, Johannes
    Naumann, Felix
    [J]. JOURNAL OF WEB SEMANTICS, 2011, 9 (03): : 339 - 345
  • [4] Supervised N-gram Topic Model
    Kawamae, Noriaki
    [J]. WSDM'14: PROCEEDINGS OF THE 7TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2014, : 473 - 482
  • [5] Improvement in Performance of N-Gram Classifiers Frequent Updates
    Cochrane, D. G.
    Allegra, J. R.
    Brown, P.
    Halasz, S.
    Godall, C.
    [J]. ANNALS OF EMERGENCY MEDICINE, 2008, 52 (04) : S102 - S102
  • [6] Pipilika N-gram Viewer: An Efficient Large Scale N-gram Model for Bengali
    Ahmad, Adnan
    Talha, Mahbubur Rub
    Amin, Md. Ruhul
    Chowdhury, Farida
    [J]. 2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
  • [7] N-gram MalGAN: Evading machine learning detection via feature n-gram
    Zhu, Enmin
    Zhang, Jianjie
    Yan, Jijie
    Chen, Kongyang
    Gao, Chongzhi
    [J]. DIGITAL COMMUNICATIONS AND NETWORKS, 2022, 8 (04) : 485 - 491
  • [8] N-gram MalGAN:Evading machine learning detection via feature n-gram
    Enmin Zhu
    Jianjie Zhang
    Jijie Yan
    Kongyang Chen
    Chongzhi Gao
    [J]. Digital Communications and Networks., 2022, 8 (04) - 491
  • [9] Arabic supervised learning method using N-gram
    Sanan, Majed
    Rammal, Mahmoud
    Zreik, Khaldoun
    [J]. INTERACTIVE TECHNOLOGY AND SMART EDUCATION, 2008, 5 (03) : 157 - +
  • [10] Web as a Corpus: Going Beyond the n-gram
    Nakov, Preslav
    [J]. INFORMATION RETRIEVAL, RUSSIR 2014, 2015, 505 : 185 - 228