Chinese WPLC: A Chinese Dataset for Evaluating Pretrained Language Models on Word Prediction Given Long-Range Context

被引:0
|
作者
Ge, Huibin [1 ]
Sun, Chenxi [1 ]
Xiong, Deyi [1 ]
Liu, Qun [2 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] Huawei Noahs Ark Lab, Hong Kong, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a Chinese dataset for evaluating pretrained language models on Word Prediction given Long-term Context (Chinese WPLC). We propose both automatic and manual selection strategies tailored to Chinese to guarantee that target words in passages collected from over 69K novels can only be predicted with long-term context beyond the scope of sentences containing the target words. Dataset analysis reveals that the types of target words range from common nouns to Chinese 4-character idioms. We also observe that linguistic relations between target words and long-range context exhibit diversity, including lexical match, synonym, summary and reasoning. Experiment results show that the Chinese pretrained language model PanGu-alpha (Zeng et al., 2021) is 45 points behind human in terms of top-1 word prediction accuracy, indicating that Chinese WPLC is a challenging dataset. The dataset is publicly available at https://git.openi.org.cn/PCL-Platform.Intelligence/Chinese_WPLC.
引用
收藏
页码:3770 / 3778
页数:9
相关论文
共 2 条
  • [1] CHAPTERBREAK: A Challenge Dataset for Long-Range Language Models
    Sun, Simeng
    Thai, Katherine
    Iyyer, Mohit
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3704 - 3714
  • [2] Do Long-Range Language Models Actually Use Long-Range Context?
    Sun, Simeng
    Krishna, Kalpesh
    Mattarella-Micke, Andrew
    Iyyer, Mohit
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 807 - 822