Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

被引:1
|
作者
Liu, Zoey [1 ]
Prud'hommeaux, Emily [1 ]
机构
[1] Boston Coll, Dept Comp Sci, Chestnut Hill, MA 02467 USA
基金
美国国家科学基金会;
关键词
D O I
10.1162/tacl_a_00467
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Common designs of model evaluation typically focus on monolingual settings, where different models are compared according to their performance on a single data set that is assumed to be representative of all possible data for the task at hand. While this may be reasonable for a large data set, this assumption is difficult to maintain in low-resource scenarios, where artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental. To address these concerns, we investigate model generalizability in crosslinguistic low-resource scenarios. Using morphological segmentation as the test case, we compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families. In each experimental setting, we evaluate all models on a first data set, then examine their performance consistency when introducing new randomly sampled data sets with the same size and when applying the trained models to unseen test sets of varying sizes. The results demonstrate that the extent of model generalization depends on the characteristics of the data set, and does not necessarily rely heavily on the data set size. Among the characteristics that we studied, the ratio of morpheme overlap and that of the average number of morphemes per word between the training and test sets are the two most prominent factors. Our findings suggest that future work should adopt random sampling to construct data sets with different sizes in order to make more responsible claims about model evaluation.
引用
收藏
页码:393 / 413
页数:21
相关论文
共 50 条
  • [1] DATA-DRIVEN PHRASING FOR SPEECH SYNTHESIS IN LOW-RESOURCE LANGUAGES
    Parlikar, Alok
    Black, Alan W.
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4013 - 4016
  • [2] Low-Resource Footprint, Data-Driven Malware Detection on Android
    Aonzo, Simone
    Merlo, Alessio
    Migliardi, Mauro
    Oneto, Luca
    Palmieri, Francesco
    [J]. IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, 2020, 5 (02): : 213 - 222
  • [3] Investigating data partitioning strategies for crosslinguistic low-resource ASR evaluation
    Liu, Zoey
    Spence, Justin
    Prud'hommeaux, Emily
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 123 - 131
  • [4] Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages
    Eskander, Ramy
    Klavans, Judith L.
    Muresan, Smaranda
    [J]. 16TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS PHONOLOGY, AND MORPHOLOGY (SIGMORPHON 2019), 2019, : 189 - 195
  • [5] A data-driven approach to water treatment in low-resource communities: Andrea Johnson
    Johnson, Andrea
    [J]. ONE EARTH, 2022, 5 (02): : 138 - 139
  • [6] Comparison of Data-Driven and Morphological Features for Cell Segmentation in Histopathological Images
    Karaaslan, Omer Faruk
    Bilgin, Gokhan
    [J]. 29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
  • [7] Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation
    Sorokin, Alexey
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3978 - 3983
  • [8] Tackling the Low-resource Challenge for Canonical Segmentation
    Mager, Manuel
    Cetinoglu, Ozlem
    Kann, Katharina
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 5237 - 5250
  • [9] A comprehensive generalizability assessment of data-driven Urban Heat
    Acosta, Monica Pena
    Dikkers, Marco
    Vahdatikhaki, Faridaddin
    Santos, Joao
    Doree, Andries G.
    [J]. SUSTAINABLE CITIES AND SOCIETY, 2023, 96
  • [10] Pushing the Limits of Low-Resource Morphological Inflection
    Anastasopoulos, Antonios
    Neubig, Graham
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 984 - 996