Visualizing Document Similarity Using N-Grams and Latent Semantic Analysis

被引:0
|
作者
Hussein, Ashraf S. [1 ]
机构
[1] Arab Open Univ, Fac Comp & Informat Technol, Kuwait, Kuwait
关键词
document visualization; text-reuse; text mining; similarity estimation; plagiarism check; natural language processing; Latent Semantic Analysis; Singular Value Decomposition;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
As the number of information resources and document quantity explodes, efficient tools with intuitive visualization capabilities desperately needed to assist users in conducting document similarity analysis and/or plagiarism detection tasks by discovering hidden relations among documents. This paper proposes a content-based method for document similarity analysis and visualization. The proposed method is based on modeling the relationship between documents and their n-gram phrases, which are generated from the normalized text, exploiting morphology analysis and lexical lookup. Resolving possible morphological ambiguities is carried out by tagging the words within the examined documents. Text indexing and stop-words removal are performed, employing a new technique that is efficient in dealing with multiple long documents. The examined documents' TF-IDF model is constructed using heuristic based pair-wise matching algorithm, considering lexical and syntactic changes. Then, the hidden associations between the documents and their unique n-gram phrases are investigated using Latent Semantic Analysis (LSA). Next, the pairwise document subset and similarity measures are derived from the Singular Value Decomposition (SVD) computations. Different visualization techniques are then applied on the SVD results to expose the hidden relations among the documents under consideration. As Arabic is one of the most morphological and complicated languages, this paper emphasizes Arabic documents similarity analysis and visualization. Various experiments were carried out revealing the strong capabilities of the proposed method in analyzing and visualizing literal and some types of intelligent similarities.
引用
收藏
页码:269 / 279
页数:11
相关论文
共 50 条
  • [1] Arabic Document Similarity Analysis using N-grams and Singular Value Decomposition
    Hussein, Ashraf S.
    [J]. 2015 IEEE 9TH INTERNATIONAL CONFERENCE ON RESEARCH CHALLENGES IN INFORMATION SCIENCE (RCIS), 2015, : 445 - 455
  • [2] Document Verification Using n-grams and Histograms of Words
    Almarimi, Abdulwahed
    Andrejkova, Gabriela
    Sedmak, Peter
    [J]. 2015 IEEE 13TH INTERNATIONAL SCIENTIFIC CONFERENCE ON INFORMATICS, 2015, : 15 - 20
  • [3] Use of N-grams Model and Semantic Similarity to Improve the Results of Search Engine
    El Hadi, Amine
    Madani, Youness
    El Ayachi, Rachid
    Erritali, Mohamed
    [J]. ADVANCED INTELLIGENT SYSTEMS FOR SUSTAINABLE DEVELOPMENT (AI2SD'2020), VOL 2, 2022, 1418 : 437 - 444
  • [4] Experiments in spoken document retrieval using phoneme n-grams
    Ng, C
    Wilkinson, R
    Zobel, J
    [J]. SPEECH COMMUNICATION, 2000, 32 (1-2) : 61 - 77
  • [5] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
    Lecluze, Charlotte
    Rigouste, Lois
    Giguet, Emmanuel
    Lucas, Nadine
    [J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
  • [6] Classifying disease outbreak reports using n-grams and semantic features
    Conway, Mike
    Doan, Son
    Kawazoe, Ai
    Collier, Nigel
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2009, 78 (12) : E47 - E58
  • [7] Measuring similarity between Karel programs using character and word n-grams
    G. Sidorov
    M. Ibarra Romero
    I. Markov
    R. Guzman-Cabrera
    L. Chanona-Hernández
    F. Velásquez
    [J]. Programming and Computer Software, 2017, 43 : 47 - 50
  • [8] Statistical Analysis of the Indus Script Using n-Grams
    Yadav, Nisha
    Joglekar, Hrishikesh
    Rao, Rajesh P. N.
    Vahia, Mayank N.
    Adhikari, Ronojoy
    Mahadevan, Iravatham
    [J]. PLOS ONE, 2010, 5 (03):
  • [9] Measuring similarity between Karel programs using character and word n-grams
    Sidorov, G.
    Ibarra Romero, M.
    Markov, I.
    Guzman-Cabrera, R.
    Chanona-Hernandez, L.
    Velasquez, F.
    [J]. PROGRAMMING AND COMPUTER SOFTWARE, 2017, 43 (01) : 47 - 50
  • [10] Malware Detection and Classification Based on n-grams Attribute Similarity
    Zhang Fuyong
    Zhao Tiezhou
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE) AND IEEE/IFIP INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (EUC), VOL 1, 2017, : 793 - 796