Collection and evaluation of lexical complexity data for Russian language using crowdsourcing

被引:2
|
作者
Abramov, Aleksei, V [1 ]
Ivanov, Vladimir V. [1 ]
机构
[1] Kazan Fed Univ, Kazan, Russia
来源
RUSSIAN JOURNAL OF LINGUISTICS | 2022年 / 26卷 / 02期
基金
俄罗斯科学基金会;
关键词
Lexical complexity; Russian language; annotation; corpora; Bible;
D O I
10.22363/2687-0088-30118
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Estimating word complexity with binary or continuous scores is a challenging task that has been studied for several domains and natural languages. Commonly this task is referred to as Complex Word Identification (CWI) or Lexical Complexity Prediction (LCP). Correct evaluation of word complexity can be an important step in many Lexical Simplification pipelines. Earlier works have usually presented methodologies of lexical complexity estimation with several restrictions: hand-crafted features correlated with word complexity, performed feature engineering to describe target words with features such as number of hypernyms, count of consonants, Named Entity tag, and evaluations with carefully selected target audiences. Modern works investigated the use of transforner-based models that afford extracting features from surrounding context as well. However, the majority of papers have been devoted to pipelines for the English language and few translated them to other languages such as German, French, and Spanish. In this paper we present a dataset of lexical complexity in context based on the Russian Synodal Bible collected using a crowdsourcing platform. We describe a methodology for collecting the data using a 5-point Likert scale for annotation, present descriptive statistics and compare results with analogous work for the English language. We evaluate a linear regression model as a baseline for predicting word complexity on handcrafted features, fastText and ELMo embeddings of target words. The result is a corpus consisting of 931 distinct words that used in 3,364 different contexts.
引用
收藏
页码:409 / 425
页数:17
相关论文
共 50 条
  • [31] Crowdsourcing in Eczema Research: A Novel Method of Data Collection
    Armstrong, April W.
    Harskamp, C. T.
    Cheeney, S.
    Schupp, C. W.
    [J]. JOURNAL OF DRUGS IN DERMATOLOGY, 2012, 11 (10) : 1153 - 1155
  • [32] Crowdsourcing and Its Application to Transportation Data Collection and Management
    Misra, Aditi
    Gooze, Aaron
    Watkins, Kari
    Asad, Mariam
    Le Dantec, Christopher A.
    [J]. TRANSPORTATION RESEARCH RECORD, 2014, (2414) : 1 - 8
  • [33] Enabling a Massive Data Collection for Hotel Receptionist Chatbot Using a Crowdsourcing Information System
    Levannoza, Reval
    Latif, Rizky Fauzi
    Nurkafianti, Syafira Indah
    Oktriono, Kristianus
    Devina
    Wiharja, Chandra Kurniawan
    Cenggoro, Tjeng Wawan
    [J]. PROCEEDINGS OF 2020 INTERNATIONAL CONFERENCE ON INFORMATION MANAGEMENT AND TECHNOLOGY (ICIMTECH), 2020, : 213 - 217
  • [34] Creating a system for lexical substitutions from scratch using crowdsourcing
    Chris Biemann
    [J]. Language Resources and Evaluation, 2013, 47 : 97 - 122
  • [35] Creating a system for lexical substitutions from scratch using crowdsourcing
    Biemann, Chris
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2013, 47 (01) : 97 - 122
  • [36] The relationship between lexical complexity measures and language learning beliefs
    Kovacevic, Ervin
    [J]. JEZIKOSLOVLJE, 2019, 20 (03): : 555 - 582
  • [37] The Relationship between Language Learning Strategies and Lexical Complexity Measures
    Kovacevic, Ervin
    [J]. PORTA LINGUARUM, 2019, (32) : 37 - 52
  • [38] LEXICAL INTERFERENCE IN THE BULGARIAN WRITTEN LANGUAGE OF NATIVE SPEAKERS OF RUSSIAN
    Mavrova, Aglaya
    [J]. CHUZHDOEZIKOVO OBUCHENIE-FOREIGN LANGUAGE TEACHING, 2019, 46 (06): : 588 - 597
  • [39] ANALYSIS OF LEXICAL ANTONYMS IN THE GERMAN AND RUSSIAN RAILWAY LANGUAGE AND THEIR CLASSIFICATION
    Turekhanova, Asima
    Aitzhanova, Gulnara
    Sultanova, Lyudmila
    [J]. SCIENTIFIC JOURNAL OF SILESIAN UNIVERSITY OF TECHNOLOGY-SERIES TRANSPORT, 2015, 88 : 115 - 120
  • [40] LEXICAL SEMANTICS - SYNONYMS AS MEANS OF LANGUAGE - RUSSIAN - APRESJAN,JD
    RASKIN, V
    [J]. SLAVIC AND EAST EUROPEAN JOURNAL, 1979, 23 (01): : 114 - 124