Using character n-grams to match a list of publications to references in bibliographic databases

被引:0
|
作者
Mehmet Ali Abdulhayoglu
Bart Thijs
Wouter Jeuris
机构
[1] KU Leuven,ECOOM, Center for R&D Monitoring, FEB
来源
Scientometrics | 2016年 / 109卷
关键词
String matching; Character n-gram; Salton cosine; Kondrak’s Levenshtein distance; Information retrieval;
D O I
暂无
中图分类号
学科分类号
摘要
For research evaluation, publication lists need to be matched to entries in large bibliographic databases, such as Thomson Reuters Web of Science. This matching process is often done manually, making it very time consuming. This paper presents the use of character n-grams as automated indicator to inform and ease the manual matching process. The similarity of two references was identified by calculating Salton’s cosine for their common character n-grams. As a complementary and confirmatory measure, Kondrak’s Levenshtein distance score, based on the character n-grams, is used to re-measure the similarity of the top matches resulting from Salton’s cosine. These automated matches were compared to results from completely manual matching. Incorrect matches were examined in depth and possible solutions suggested. This method was applied to two independent datasets, to validate the results and inferences drawn. For both datasets, the Salton’s score based on character n-grams proves to be a useful indicator to distinguish between correct and incorrect matches. The suggested method is compared with a baseline which is based on word unigrams. Accuracy of the character and word based systems are 96.0 and 94.7 %, respectively. Despite a small difference in accuracy, we observed that the character based system provides more correct matches when the data contains abbreviations, mathematical expressions or erroneous text.
引用
收藏
页码:1525 / 1546
页数:21
相关论文
共 50 条
  • [1] Using character n-grams to match a list of publications to references in bibliographic databases
    Abdulhayoglu, Mehmet Ali
    Thijs, Bart
    Jeuris, Wouter
    [J]. SCIENTOMETRICS, 2016, 109 (03) : 1525 - 1546
  • [2] MATCHING BIBLIOGRAPHIC DATA FROM PUBLICATION LISTS WITH LARGE DATABASES USING N-GRAMS (RIP)
    Abdulhayoglu, Mehmet Ali
    Thijs, Bart
    [J]. 14TH INTERNATIONAL SOCIETY OF SCIENTOMETRICS AND INFORMETRICS CONFERENCE (ISSI), 2013, : 1151 - 1158
  • [3] Spam detection using character N-grams
    Kanaris, Ioannis
    Kanaris, Konstantinos
    Stamatatos, Efstathios
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 3955 : 95 - 104
  • [4] Authorship Attribution in Portuguese Using Character N-grams
    Markov, Ilia
    Baptista, Jorge
    Pichardo-Lagunas, Obdulia
    [J]. ACTA POLYTECHNICA HUNGARICA, 2017, 14 (03) : 59 - 78
  • [5] A first approach to CLIR using character n-grams alignment
    Vilares, Jesus
    Oakes, Michael P.
    Tait, John I.
    [J]. EVALUATION OF MULTILINGUAL AND MULTI-MODAL INFORMATION RETRIEVAL, 2007, 4730 : 111 - +
  • [6] Detection of Opinion Spam with Character n-grams
    Hernandez Fusilier, Donato
    Montes-y-Gomez, Manuel
    Rosso, Paolo
    Guzman Cabrera, Rafael
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT II, 2015, 9042 : 285 - 294
  • [7] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
    Lecluze, Charlotte
    Rigouste, Lois
    Giguet, Emmanuel
    Lucas, Nadine
    [J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
  • [8] Predicting Political Donations Using Twitter Hashtags and Character N-Grams
    Conrad, Colin
    Keselj, Vlado
    [J]. 2016 IEEE 18TH CONFERENCE ON BUSINESS INFORMATICS (CBI), VOL. 2, 2016, : 1 - 7
  • [9] Author Assertion of Furtive Write Print Using Character N-Grams
    Hassan, Feryal H.
    Chaurasia, Mousmi A.
    [J]. FUTURE INFORMATION TECHNOLOGY, 2011, 13 : 274 - 278
  • [10] Unconstrained Offline Handwriting Recognition using Connectionist Character N-grams
    Zamora-Martinez, F.
    Castro-Bleda, M. J.
    Espana-Boquera, S.
    Gorbe-Moya, J.
    [J]. 2010 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS IJCNN 2010, 2010,