Creating a large-scale diachronic corpus resource: Automated parsing in the Greek papyri (and beyond)

被引：0

作者：

Keersmaekers, Alek ^{[1
]}

Van Hal, Toon ^{[1
]}

机构：

[1] Katholieke Univ Leuven, Dept Linguist, Leuven, Belgium

来源：

NATURAL LANGUAGE ENGINEERING | 2023年

关键词：

Parsing; Ancient Greek; Corpus applications; LANGUAGE;

D O I：

10.1017/S1351324923000384

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper explores how to syntactically parse Ancient Greek texts automatically and maps ways of fruitfully employing the results of such an automated analysis. Special attention is given to documentary papyrus texts, a large diachronic corpus of non-literary Greek, which presents a unique set of challenges to tackle. By making use of the Stanford Graph-Based Neural Dependency Parser, we show that through careful curation of the parsing data and several manipulation strategies, it is possible to achieve an Labeled Attachment Score of about 0.85 for this corpus. We also explain how the data can be converted back to its original (Ancient Greek Dependency Treebanks) format. We describe the results of several tests we have carried out to improve parsing results, with special attention paid to the impact of the annotation format on parser achievements. In addition, we offer a detailed qualitative analysis of the remaining errors, including possible ways to solve them. Moreover, the paper gives an overview of the valorisation possibilities of an automatically annotated corpus of Ancient Greek texts in the fields of linguistics, language education and humanities studies in general. The concluding section critically analyses the remaining difficulties and outlines avenues to further improve the parsing quality and the ensuing practical applications.

引用

页数：30

共 50 条

[1] Creating A Large-Scale Financial News Corpus for Relation Extraction
Wu, Haoyu
Lei, Qing
Zhang, Xinyue
Luo, Zhengqian
[J]. 2020 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA (ICAIBD 2020), 2020, : 259 - 263
[2] A corpus-based connectionist architecture for large-scale natural language parsing
Tepper, JA
Powell, HM
Palmer-Brown, D
[J]. CONNECTION SCIENCE, 2002, 14 (02) : 93 - 114
[3] Towards Automated Log Parsing for Large-Scale Log Data Analysis
He, Pinjia
Zhu, Jieming
He, Shilin
Li, Jian
Lyu, Michael R.
[J]. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2018, 15 (06) : 931 - 944
[4] Creating a Large-Scale Silver Corpus from Multiple Algorithmic Segmentations
Krenn, Markus
Dorfer, Matthias
del Toro, Oscar Alfonso Jimenez
Mueller, Henning
Menze, Bjoern
Weber, Marc-Andre
Hanbury, Allan
Langs, Georg
[J]. MEDICAL COMPUTER VISION: ALGORITHMS FOR BIG DATA, 2016, 9601 : 103 - 115
[5] A Methodological Framework for Creating Large-Scale Corpus for Natural Language Processing Models
Santos, David
Auquilla, Andres
Siguenza-Guzman, Lorena
Pena, Mario
[J]. INFORMATION AND COMMUNICATION TECHNOLOGIES (TICEC 2021), 2021, 1456 : 87 - 100
[6] Large-Scale QA-SRL Parsing
FitzGerald, Nicholas
Michael, Julian
He, Luheng
Zettlemoyer, Luke
[J]. PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2051 - 2060
[7] Automated Dynamic Resource Provisioning and Monitoring in Virtualized Large-scale Datacenter
Abar, Sameera
Lemarinier, Pierre
Theodoropoulos, Georgios K.
O'Hare, Gregory M. P.
[J]. 2014 IEEE 28TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS (AINA), 2014, : 961 - 970
[8] MINERVA: An automated resource provisioning tool for large-scale storage systems
Alvarez, GA
Borowsky, E
Go, S
Romer, TH
Becker-Szendy, R
Golding, R
Merchant, A
Spasojevic, M
Veitch, A
Wilkes, J
[J]. ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2001, 19 (04): : 483 - 518
[9] A Large-Scale Corpus for Conversation Disentanglement
Kummerfeld, Jonathan K.
Athreya, Vignesh
Patel, Siva Sankalp
Gouravajhala, Sai R.
Gunasekara, Chulaka
Polymenakos, Lazaros
Peper, Joseph J.
Ganhotra, Jatin
Lasecki, Walter S.
[J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3846 - 3856
[10] A Corpus for Large-Scale Phonetic Typology
Salesky, Elizabeth
Chodroff, Eleanor
Pimentel, Tiago
Wiesner, Matthew
Cotterell, Ryan
Black, Alan W.
Eisner, Jason
[J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 4526 - 4546

← 1 2 3 4 5 →