Text Similarity Calculations Using Text and Syntactical Structures

被引:0
|
作者
Elhadi, Mohamed T. [1 ]
机构
[1] Univ Zawia, Dept Comp Sci, Zawia, Libya
关键词
Syntaical strctures; document similarity; Longest Common Subsequnce;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
this paper reports on experiments performed to investigate the use of syntactical structures of sentences as the basis of similarity calculation between two text documents. Sentences of the documents are converted into an ordered Part of Speech (POS) tags that are then fed to Longest Common Subsequence (LCS) algorithm to determine the size and count of the LCSs found when comparing the document sentence by sentence. In the first stage the syntactical features of the text were used as a structural representation of the document's text. It also serves as a text reduction to improve the efficiency of the LCS when used in comparing. In the second stage, documents that score well in the first stage as measured by computing an accumulative score that is a function of the number of the LCSs, are then subjects to further comparison using the actual sentences (content words) in a sentence by sentence fashion to produce a final measure of similarity based on common words (accumulated for the whole file) and the total number of LCSs from the first step. Experiments done on two different corpuses and results obtained have showed the utility of the proposed procedure in calculating similarities between written documents.
引用
收藏
页码:715 / 719
页数:5
相关论文
共 50 条
  • [31] TSI: an Ad Text Strength Indicator using Text-to-CTR and Semantic-Ad-Similarity
    Mishra, Shaunak
    Hu, Changwei
    Verma, Manisha
    Yen, Kevin
    Hu, Yifan
    Sviridenko, Maxim
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4036 - 4045
  • [32] Text Mining using Comparison of Semantic Structures
    Montes y Gomez, Manuel
    COMPUTACION Y SISTEMAS, 2005, 9 (01): : 63 - 81
  • [33] Text as Policy: Measuring Policy Similarity through Bill Text Reuse
    Linder, Fridolin
    Desmarais, Bruce
    Burgess, Matthew
    Giraudy, Eugenia
    POLICY STUDIES JOURNAL, 2020, 48 (02) : 546 - 574
  • [34] Text Similarity Function Based on Word Embeddings for Short Text Analysis
    Pascual, Adrian Jimenez
    Fujita, Sumio
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2017), PT I, 2018, 10761 : 391 - 402
  • [35] An effective short text conceptualization based on new short text similarity
    Bekkali, Mohammed
    Lachkar, Abdelmonaime
    SOCIAL NETWORK ANALYSIS AND MINING, 2018, 9 (01)
  • [36] Scene Text Retrieval via Joint Text Detection and Similarity Learning
    Wang, Hao
    Bai, Xiang
    Yang, Mingkun
    Zhu, Shenggao
    Wang, Jing
    Liu, Wenyu
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4556 - 4565
  • [37] Energy Efficient Calculations of Text Similarity Measure on FPGA-Accelerated Computing Platforms
    Karwatowski, Michal
    Russek, Pawel
    Wielgosz, Maciej
    Koryciak, Sebastian
    Wiatr, Kazimierz
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, PPAM 2015, PT I, 2016, 9573 : 31 - 40
  • [38] Using Siamese BiLSTM Models for Identifying Text Semantic Similarity
    Fradelos, Georgios
    Perikos, Isidoros
    Hatzilygeroudis, Ioannis
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS. AIAI 2023 IFIP WG 12.5 INTERNATIONAL WORKSHOPS, 2023, 677 : 381 - 392
  • [39] MEASURING SHORT TEXT SEMANTIC SIMILARITY USING MULTIPLE MEASUREMENTS
    Zhu, Tian-Tian
    Lan, Man
    PROCEEDINGS OF 2013 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), VOLS 1-4, 2013, : 808 - 813
  • [40] OntoSeg: a Novel Approach to Text Segmentation using Ontological Similarity
    Bayomi, Mostafa
    Levacher, Killian
    Ghorab, M. Rami
    Lawless, Seamus
    2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOP (ICDMW), 2015, : 1274 - 1281