Text Similarity Calculations Using Text and Syntactical Structures

被引:0
|
作者
Elhadi, Mohamed T. [1 ]
机构
[1] Univ Zawia, Dept Comp Sci, Zawia, Libya
关键词
Syntaical strctures; document similarity; Longest Common Subsequnce;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
this paper reports on experiments performed to investigate the use of syntactical structures of sentences as the basis of similarity calculation between two text documents. Sentences of the documents are converted into an ordered Part of Speech (POS) tags that are then fed to Longest Common Subsequence (LCS) algorithm to determine the size and count of the LCSs found when comparing the document sentence by sentence. In the first stage the syntactical features of the text were used as a structural representation of the document's text. It also serves as a text reduction to improve the efficiency of the LCS when used in comparing. In the second stage, documents that score well in the first stage as measured by computing an accumulative score that is a function of the number of the LCSs, are then subjects to further comparison using the actual sentences (content words) in a sentence by sentence fashion to produce a final measure of similarity based on common words (accumulated for the whole file) and the total number of LCSs from the first step. Experiments done on two different corpuses and results obtained have showed the utility of the proposed procedure in calculating similarities between written documents.
引用
收藏
页码:715 / 719
页数:5
相关论文
共 50 条
  • [41] Text classification using similarity measures on intuitionistic fuzzy sets
    Intarapaiboon, Peerasak
    SCIENCEASIA, 2016, 42 (01): : 52 - 60
  • [42] K Nearest Neighbor for Text Summarization using Feature Similarity
    Jo, Taeho
    2017 INTERNATIONAL CONFERENCE ON COMMUNICATION, CONTROL, COMPUTING AND ELECTRONICS ENGINEERING (ICCCCEE), 2017,
  • [43] Text Steganography Approaches Using Similarity of English Font Styles
    El Rahman, Sahar A.
    INTERNATIONAL JOURNAL OF SOFTWARE INNOVATION, 2019, 7 (03) : 29 - 50
  • [44] Using text analysis to quantify the similarity and evolution of scientific disciplines
    Dias, Laercio
    Gerlach, Martin
    Scharloth, Joachim
    Altmann, Eduardo G.
    ROYAL SOCIETY OPEN SCIENCE, 2018, 5 (01):
  • [45] Using K Nearest Neighbors for Text Segmentation with Feature Similarity
    Jo, Taeho
    2017 INTERNATIONAL CONFERENCE ON COMMUNICATION, CONTROL, COMPUTING AND ELECTRONICS ENGINEERING (ICCCCEE), 2017,
  • [46] StABLE: Analyzing Player Movement Similarity Using Text Mining
    Fragoso, Luana
    Stanley, Kevin G.
    2021 IEEE CONFERENCE ON GAMES (COG), 2021, : 437 - 444
  • [47] Efficient Hybrid Semantic Text Similarity using Wordnet and a Corpus
    Atoum, Issa
    Otoom, Ahmed
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2016, 7 (09) : 124 - 130
  • [48] SimiT: A Text Similarity Method Using Lexicon and Dependency Representations
    Inan, Emrah
    NEW GENERATION COMPUTING, 2020, 38 (03) : 509 - 530
  • [49] Improvement of the Log Pattern Extracting Algorithm Using Text Similarity
    Zhao, Yining
    Wang, Xiaodong
    Xiao, Haili
    Chi, Xuebin
    2018 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2018), 2018, : 507 - 514
  • [50] Text-based Document Similarity Matching Using sdtext
    Shields, Clay
    PROCEEDINGS OF THE 49TH ANNUAL HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES (HICSS 2016), 2016, : 5607 - 5616