Text Similarity Calculations Using Text and Syntactical Structures

被引:0
|
作者
Elhadi, Mohamed T. [1 ]
机构
[1] Univ Zawia, Dept Comp Sci, Zawia, Libya
关键词
Syntaical strctures; document similarity; Longest Common Subsequnce;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
this paper reports on experiments performed to investigate the use of syntactical structures of sentences as the basis of similarity calculation between two text documents. Sentences of the documents are converted into an ordered Part of Speech (POS) tags that are then fed to Longest Common Subsequence (LCS) algorithm to determine the size and count of the LCSs found when comparing the document sentence by sentence. In the first stage the syntactical features of the text were used as a structural representation of the document's text. It also serves as a text reduction to improve the efficiency of the LCS when used in comparing. In the second stage, documents that score well in the first stage as measured by computing an accumulative score that is a function of the number of the LCSs, are then subjects to further comparison using the actual sentences (content words) in a sentence by sentence fashion to produce a final measure of similarity based on common words (accumulated for the whole file) and the total number of LCSs from the first step. Experiments done on two different corpuses and results obtained have showed the utility of the proposed procedure in calculating similarities between written documents.
引用
收藏
页码:715 / 719
页数:5
相关论文
共 50 条
  • [1] Using similarity network analysis to improve text similarity calculations
    Witschard, Daniel
    Kucher, Kostiantyn
    Jusufi, Ilir
    Kerren, Andreas
    Applied Network Science, 2025, 10 (01)
  • [2] Use of Text Syntactical Structures in Detection of Document Duplicates
    Elhadi, Mohamed
    Al-Tobi, Amjad
    2008 THIRD INTERNATIONAL CONFERENCE ON DIGITAL INFORMATION MANAGEMENT, VOLS 1 AND 2, 2008, : 531 - 536
  • [3] Text mining using the hierarchical syntactical structure of documents
    Danger, R
    Ruíz-Shulcloper, J
    Berlanga, R
    CURRENT TOPICS IN ARTIFICIAL INTELLIGENCE, 2004, 3040 : 556 - 565
  • [4] Text mining: identification of similarity of text documents using hybrid similarity model
    K. M. Shiva Prasad
    Iran Journal of Computer Science, 2023, 6 (2) : 123 - 135
  • [6] Analyzing statistical and syntactical English text for word prediction and text generation
    Homeed, Taher S. K.
    Al-A'ali, Mansoor
    Information Technology Journal, 2007, 6 (07) : 954 - 965
  • [7] Interactive optimization of embedding-based text similarity calculations
    Witschard, Daniel
    Jusufi, Ilir
    Martins, Rafael M.
    Kucher, Kostiantyn
    Kerren, Andreas
    INFORMATION VISUALIZATION, 2022, 21 (04) : 335 - 353
  • [8] Text Similarity Analysis Using IR Lists
    Metin, Senem Kumova
    Kisla, Tarik
    Karaoglan, Bahar
    2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,
  • [9] A new approach for text similarity using articles
    Atlam, Elsayed
    INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING, 2008, 7 (01) : 23 - 34
  • [10] Assessing text semantic similarity using ontology
    Liu, Hongzhe
    Wang, Pengfei
    1600, Academy Publisher (09): : 490 - 497