SUDMAD: Sequential and Unsupervised Decomposition of a Multi-Author Document Based on a Hidden Markov Model

被引:0
|
作者
Aldebei, Khaled [1 ,2 ]
He, Xiangjian [1 ,3 ]
Jia, Wenjing [1 ]
Yeh, Weichang [4 ]
机构
[1] Univ Technol Sydney, Global Big Data Technol Ctr, Sydney, NSW, Australia
[2] Minjiang Univ, Fujian Prov Key Lab Informat Proc & Intelligent C, Fuzhou 350121, Fujian, Peoples R China
[3] Northwestern Polytech Univ, Sch Software & Microelect, Xian, Shaanxi, Peoples R China
[4] Natl Tsing Hua Univ, Dept Ind Engn & Engn Management, Hsinchu, Taiwan
关键词
D O I
10.1002/asi.23956
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Decomposing a document written by more than one author into sentences based on authorship is of great significance due to the increasing demand for plagiarism detection, forensic analysis, civil law (i.e., disputed copyright issues), and intelligence issues that involve disputed anonymous documents. Among existing studies for document decomposition, some were limited by specific languages, according to topics or restricted to a document of two authors, and their accuracies have big room for improvement. In this paper, we consider the contextual correlation hidden among sentences and propose an algorithm for Sequential and Unsupervised Decomposition of a Multi-Author Document (SUDMAD) written in any language, disregarding topics, through the construction of a Hidden Markov Model (HMM) reflecting the authors' writing styles. To build and learn such a model, an unsupervised, statistical approach is first proposed to estimate the initial values of HMM parameters of a preliminary model, which does not require the availability of any information of author's or document's context other than how many authors contributed to writing the document. To further boost the performance of this approach, a boosted HMM learning procedure is proposed next, where the initial classification results are used to create labeled training data to learn a more accurate HMM. Moreover, the contextual relationship among sentences is further utilized to refine the classification results. Our proposed approach is empirically evaluated on three benchmark datasets that are widely used for authorship analysis of documents. Comparisons with recent state-of-the-art approaches are also presented to demonstrate the significance of our new ideas and the superior performance of our approach.
引用
收藏
页码:201 / 214
页数:14
相关论文
共 50 条
  • [1] Unsupervised Multi-Author Document Decomposition Based on Hidden Markov Model
    Aldebei, Khaled
    He, Xiangjian
    Jia, Wenjing
    Yang, Jie
    PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 706 - 714
  • [2] Unsupervised Decomposition of a Multi-Author Document Based on Naive-Bayesian Model
    Aldebei, Khaled
    He, Xiangjian
    Yang, Jie
    PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 501 - 505
  • [3] An improved algorithm for unsupervised decomposition of a multi-author document
    Giannella, Chris
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2016, 67 (02) : 400 - 411
  • [4] Sequential and Unsupervised Document Authorial Clustering Based on Hidden Markov Model
    Aldebei, Khaled
    Farhood, Helia
    Jia, Wenjing
    Nanda, Priyadarsi
    He, Xiangjian
    2017 16TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS / 11TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING / 14TH IEEE INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS, 2017, : 379 - 385
  • [5] A Generic Unsupervised Method for Decomposing Multi-Author Documents
    Akiva, Navot
    Koppel, Moshe
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2013, 64 (11): : 2256 - 2264
  • [6] RESEARCH BASED ON CONCEPT MAPS AND HIDDEN MARKOV MODEL FOR MULTI-DOCUMENT SUMMARY
    Mu, Xinguo
    Hao, Wenning
    Chen, Gang
    Zhao, Shuining
    Jin, Dawei
    2011 4TH IEEE INTERNATIONAL CONFERENCE ON BROADBAND NETWORK AND MULTIMEDIA TECHNOLOGY (4TH IEEE IC-BNMT2011), 2011, : 611 - 614
  • [7] Credit Allocation for Each Author in a Multi-Author Paper Based on PageRank
    Wang J.-P.
    Guo Q.
    Liu J.-G.
    Guo, Qiang (qiang.guo@usst.edu.cn), 1600, Univ. of Electronic Science and Technology of China (49): : 918 - 923
  • [8] Unsupervised Image Sequence Segmentation Based on Hidden Markov Tree Model
    Zhang Yinhui
    Zhang Yunsheng
    Tang Xiangyang
    He Zifen
    PROCEEDINGS OF THE 27TH CHINESE CONTROL CONFERENCE, VOL 4, 2008, : 495 - +
  • [9] CONCEPT BASED QUERY AND DOCUMENT EXPANSION USING HIDDEN MARKOV MODEL
    Zhang, Jiuling
    Liu, Zuoda
    Deng, Beixing
    Li, Xing
    WEBIST 2009: PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND TECHNOLOGIES, 2009, : 697 - 700
  • [10] Unsupervised scene analysis: A hidden Markov model approach
    Bicego, M
    Cristani, M
    Murino, V
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2006, 102 (01) : 22 - 41