Visualizing Document Similarity Using N-Grams and Latent Semantic Analysis

被引:0
|
作者
Hussein, Ashraf S. [1 ]
机构
[1] Arab Open Univ, Fac Comp & Informat Technol, Kuwait, Kuwait
关键词
document visualization; text-reuse; text mining; similarity estimation; plagiarism check; natural language processing; Latent Semantic Analysis; Singular Value Decomposition;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
As the number of information resources and document quantity explodes, efficient tools with intuitive visualization capabilities desperately needed to assist users in conducting document similarity analysis and/or plagiarism detection tasks by discovering hidden relations among documents. This paper proposes a content-based method for document similarity analysis and visualization. The proposed method is based on modeling the relationship between documents and their n-gram phrases, which are generated from the normalized text, exploiting morphology analysis and lexical lookup. Resolving possible morphological ambiguities is carried out by tagging the words within the examined documents. Text indexing and stop-words removal are performed, employing a new technique that is efficient in dealing with multiple long documents. The examined documents' TF-IDF model is constructed using heuristic based pair-wise matching algorithm, considering lexical and syntactic changes. Then, the hidden associations between the documents and their unique n-gram phrases are investigated using Latent Semantic Analysis (LSA). Next, the pairwise document subset and similarity measures are derived from the Singular Value Decomposition (SVD) computations. Different visualization techniques are then applied on the SVD results to expose the hidden relations among the documents under consideration. As Arabic is one of the most morphological and complicated languages, this paper emphasizes Arabic documents similarity analysis and visualization. Various experiments were carried out revealing the strong capabilities of the proposed method in analyzing and visualizing literal and some types of intelligent similarities.
引用
收藏
页码:269 / 279
页数:11
相关论文
共 50 条
  • [21] Using N-grams for arabic text searching
    Mustafa, SH
    Al-Radaideh, QA
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2004, 55 (11): : 1002 - 1007
  • [22] GAUGING SIMILARITY WITH N-GRAMS - LANGUAGE-INDEPENDENT CATEGORIZATION OF TEXT
    DAMASHEK, M
    [J]. SCIENCE, 1995, 267 (5199) : 843 - 848
  • [23] Protein classification using modified n-grams and skip-grams
    Islam, S. M. Ashiqul
    Heil, Benjamin J.
    Kearney, Christopher Michel
    Baker, Erich J.
    [J]. BIOINFORMATICS, 2018, 34 (09) : 1481 - 1487
  • [24] A visual framework for sequence analysis using n-grams and spectral rearrangement
    Maetschke, Stefan R.
    Kassahn, Karin S.
    Dunn, Jasmyn A.
    Han, Siew-Ping
    Curley, Eva Z.
    Stacey, Katryn J.
    Ragan, Mark A.
    [J]. BIOINFORMATICS, 2010, 26 (06) : 737 - 744
  • [25] Relative N-Gram Signatures: Document Visualization at the Level of Character N-Grams
    Jankowska, Magdalena
    Keselj, Vlado
    Milios, Evangelos
    [J]. 2012 IEEE CONFERENCE ON VISUAL ANALYTICS SCIENCE AND TECHNOLOGY (VAST), 2012, : 103 - 112
  • [26] Identifying Similar Sentences by Using N-Grams of Characters
    Sultana, Saima
    Biskri, Ismail
    [J]. RECENT TRENDS AND FUTURE TECHNOLOGY IN APPLIED INTELLIGENCE, IEA/AIE 2018, 2018, 10868 : 833 - 843
  • [27] Clone Detection for Ecore Metamodels using N-grams
    Babur, Onder
    [J]. PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON MODEL-DRIVEN ENGINEERING AND SOFTWARE DEVELOPMENT, 2018, : 411 - 419
  • [28] USING N-GRAMS TO IDENTIFY EDIT WARS ON WIKIPEDIA
    Ghosh, Arjun
    [J]. 2019 IEEE FIFTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2019), 2019, : 398 - 403
  • [29] Using n-grams of spatial densities to construct maps
    Maffei, Renan
    Jorge, Vitor A. M.
    Rey, Vitor E.
    Franco, Guilherme S.
    Giambastiani, Mariane
    Barbosa, Jessica
    Kolberg, Mariana
    Prestes, Edson
    [J]. 2015 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2015, : 3850 - 3855
  • [30] A Pseudo-document-based Topical N-grams model for short texts
    Lin, Hao
    Zuo, Yuan
    Liu, Guannan
    Li, Hong
    Wu, Junjie
    Wu, Zhiang
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2020, 23 (06): : 3001 - 3023