Collection and evaluation of lexical complexity data for Russian language using crowdsourcing

被引:2
|
作者
Abramov, Aleksei, V [1 ]
Ivanov, Vladimir V. [1 ]
机构
[1] Kazan Fed Univ, Kazan, Russia
来源
RUSSIAN JOURNAL OF LINGUISTICS | 2022年 / 26卷 / 02期
基金
俄罗斯科学基金会;
关键词
Lexical complexity; Russian language; annotation; corpora; Bible;
D O I
10.22363/2687-0088-30118
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Estimating word complexity with binary or continuous scores is a challenging task that has been studied for several domains and natural languages. Commonly this task is referred to as Complex Word Identification (CWI) or Lexical Complexity Prediction (LCP). Correct evaluation of word complexity can be an important step in many Lexical Simplification pipelines. Earlier works have usually presented methodologies of lexical complexity estimation with several restrictions: hand-crafted features correlated with word complexity, performed feature engineering to describe target words with features such as number of hypernyms, count of consonants, Named Entity tag, and evaluations with carefully selected target audiences. Modern works investigated the use of transforner-based models that afford extracting features from surrounding context as well. However, the majority of papers have been devoted to pipelines for the English language and few translated them to other languages such as German, French, and Spanish. In this paper we present a dataset of lexical complexity in context based on the Russian Synodal Bible collected using a crowdsourcing platform. We describe a methodology for collecting the data using a 5-point Likert scale for annotation, present descriptive statistics and compare results with analogous work for the English language. We evaluate a linear regression model as a baseline for predicting word complexity on handcrafted features, fastText and ELMo embeddings of target words. The result is a corpus consisting of 931 distinct words that used in 3,364 different contexts.
引用
收藏
页码:409 / 425
页数:17
相关论文
共 50 条
  • [21] Lexical Complexity Controlled Sentence Generation for Language Learning
    Nie, Jinran
    Yang, Liner
    Chen, Yun
    Kong, Cunliang
    Zhu, Junhui
    Yang, Erhong
    [J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 106 - 126
  • [22] Word Complexity Effect on the Lexical Component of the Language Faculty
    Dusheyko, Aleksandra Sufuatovna
    Rezanova, Zoya Ivanovna
    Alenina, Yevgeniya Alekseyevna
    [J]. TOMSK STATE UNIVERSITY JOURNAL, 2019, (442): : 22 - 31
  • [23] THE SEMANTIC RANGE OF RUSSIAN LEXICAL BORROWINGS IN KOREAN LANGUAGE
    Ragozina, Sabina S.
    [J]. REVISTA SAN GREGORIO, 2018, (23): : 62 - 67
  • [24] LEXICAL PECULIARITIES OF RUSSIAN TALES' TRANSLATION INTO UDIHE LANGUAGE
    Sagaydachnaya, A. O.
    [J]. TOMSKII ZHURNAL LINGVISTICHESKIKH I ANTROPOLOGICHESKIKH ISSLEDOVANII-TOMSK JOURNAL OF LINGUISTICS AND ANTHROPOLOGY, 2021, (02): : 68 - 78
  • [25] Exploring Networks of Lexical Variation in Russian Sign Language
    Kimmelman, Vadim
    Komarova, Anna
    Luchkova, Lyudmila
    Vinogradova, Valeria
    Alekseeva, Oksana
    [J]. FRONTIERS IN PSYCHOLOGY, 2022, 12
  • [26] THE LEXICAL-SEMANTIC FIELD "RUSSIAN" IN THE SPANISH LANGUAGE
    Denisova, Anna
    [J]. CUADERNOS DE RUSISTICA ESPANOLA, 2013, 9 : 15 - 28
  • [27] IGRAT AND GULYAT VERBS IN THE LEXICAL SYSTEM OF THE RUSSIAN LANGUAGE
    Belyakova, S. M.
    [J]. ZBORNIK MATICE SRPSKE ZA SLAVISTIKU-MATICA SRPSKA JOURNAL OF SLAVIC STUDIES, 2005, 67 : 151 - 155
  • [28] Crowdsourcing a Normative Natural Language Dataset: A Comparison of Amazon Mechanical Turk and In-Lab Data Collection
    Saunders, Daniel R.
    Bex, Peter J.
    Woods, Russell L.
    [J]. JOURNAL OF MEDICAL INTERNET RESEARCH, 2013, 15 (05)
  • [29] Measuring Writing Development and Proficiency Gains Using Indices of Lexical and Syntactic Complexity: Evidence From Longitudinal Russian Learner Corpus Data
    Kisselev, Olesya
    Soyan, Rossina
    Pastushenkov, Dmitrii
    Merrill, Jason
    [J]. MODERN LANGUAGE JOURNAL, 2022, : 798 - 817
  • [30] Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
    Sigurdsson, Gunnar A.
    Varol, Gul
    Wang, Xiaolong
    Farhadi, Ali
    Laptev, Ivan
    Gupta, Abhinav
    [J]. COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 : 510 - 526