Comparing "parallel passages" in digital archives

被引:1
|
作者
Harris, Martyn [1 ]
Levene, Mark [1 ]
Zhang, Dell [1 ]
Levene, Dan [2 ]
机构
[1] Birkbeck Univ London, Dept Comp Sci & Informat Syst, London, England
[2] Southampton Univ, Dept Hist, Southampton, Hants, England
基金
英国惠康基金;
关键词
Digital libraries; Computer applications; Archives; Linguistics; Probabilistic analysis; Language and literature;
D O I
10.1108/JD-10-2018-0175
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose The purpose of this paper is to present a language-agnostic approach to facilitate the discovery of "parallel passages" stored in historic and cultural heritage digital archives. Design/methodology/approach The authors explore a novel, and relatively simple approach, using a character-based statistical language model combined with a tailored version of the Basic Local Alignment Tool to extract exact and approximate string patterns shared between groups of documents. Findings The approach is applicable to a wide range of languages, and compensates for variability in the text of the documents as a result of differences in dialect, authorship, language change over time and errors due to inaccurate transcriptions and optical character recognition errors as a result of the digitisation process. Originality/value The approach is novel and addresses a need by humanities researchers for tools that can identify similar documents and local similarities represented by shared text sequences in a potentially vast large archive of documents. As far as the authors are aware, there are no tools currently exist that provide the same level of tolerance to the language of the documents.
引用
收藏
页码:271 / 289
页数:19
相关论文
共 50 条