Topic Model Based Text Similarity Measure for Chinese Judgment Document

被引:4
|
作者
Wang, Yue [1 ,2 ]
Ge, Jidong [1 ,2 ]
Zhou, Yemao [1 ,2 ]
Feng, Yi [1 ,2 ]
Li, Chuanyi [1 ,2 ]
Li, Zhongjin [1 ,2 ]
Zhou, Xiaoyu [1 ,2 ]
Luo, Bin [1 ,2 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210093, Jiangsu, Peoples R China
[2] Nanjing Univ, Software Inst, Nanjing 210093, Jiangsu, Peoples R China
来源
DATA SCIENCE, PT II | 2017年 / 728卷
关键词
Chinese judgment documents; Data science; Machine learning; Natural language processing; Text similarity; TF-IDF; Topic model; Latent Dirichlet Allocation; Labeled Latent Dirichlet Allocation; LATENT DIRICHLET ALLOCATION;
D O I
10.1007/978-981-10-6388-6_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the recent informatization of Chinese courts, the huge amount of law cases and judgment documents, which were digital stored, has provided a good foundation for the research of judicial big data and machine learning. In this situation, some ideas about Chinese courts can reach automation or get better result through the research of machine learning, such as similar documents recommendation, workload evaluation based on similarity of judgement documents and prediction of possible relevant statutes. In trying to achieve all above mentioned, and also in face of the characteristics of Chinese judgement document, we propose a topic model based approach to measure the text similarity of Chinese judgement document, which is based on TF-IDF, Latent Dirichlet Allocation (LDA), Labeled Latent Dirichlet Allocation (LLDA) and other treatments. Combining with the characteristics of Chinese judgment document, we focus on the specific steps of approach, the preprocessing of corpus, the parameters choices of training and the evaluation of similarity measure result. Besides, implementing the approach for prediction of possible statutes and regarding the prediction accuracy as the evaluation metric, we designed experiments to demonstrate the reasonability of decisions in the process of design and the high performance of our approach on text similarity measure. The experiments also show the restriction of our approach which need to be focused in future work.
引用
收藏
页码:42 / 54
页数:13
相关论文
共 50 条
  • [1] Document Similarity Measure Based on Topic Model
    He, Ming
    Wang, Zhen-zhen
    Du, Yong-ping
    [J]. APPLIED SCIENCE, MATERIALS SCIENCE AND INFORMATION TECHNOLOGIES IN INDUSTRY, 2014, 513-517 : 1280 - 1284
  • [2] Novel Similarity Measure for Document Clustering Based on Topic Phrases
    ELdesoky, A. E.
    Saleh, M.
    Sakr, N. A.
    [J]. ICNM: 2009 INTERNATIONAL CONFERENCE ON NETWORKING & MEDIA CONVERGENCE, 2007, : 92 - +
  • [3] A New Similarity Measure for Document Classification and Text Mining
    Eminagaoglu, Mete
    Goksen, Yilmaz
    [J]. ECONOMIES OF THE BALKAN AND EASTERN EUROPEAN COUNTRIES, 2020, : 353 - 366
  • [4] An Intelligent Similarity Measure for Effective Text Document Clustering
    Aishwarya, M. L.
    Selvi, K.
    [J]. 2016 INTERNATIONAL CONFERENCE ON COMPUTING TECHNOLOGIES AND INTELLIGENT DATA ENGINEERING (ICCTIDE'16), 2016,
  • [5] An improved Similarity Measure For Chinese Text Clustering
    Zhang, Shaolei
    Wang, Zhong
    Huang, Wei
    [J]. 2016 2ND INTERNATIONAL CONFERENCE ON MECHANICAL, ELECTRONIC AND INFORMATION TECHNOLOGY ENGINEERING (ICMITE 2016), 2016, : 141 - 144
  • [6] TOPIC MODEL AND SIMILARITY CALCULATION OF TEXT ON SPARK
    Dai, Changsong
    Wang, Yongbin
    Wang, Qi
    [J]. 2017 14TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2017, : 15 - 19
  • [7] Biterm Pseudo Document Topic Model for Short Text
    Jiang, Lan
    Lu, Hengyang
    Xu, Ming
    Wang, Chongjun
    [J]. 2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2016), 2016, : 865 - 872
  • [8] Topic-Grained Text Representation-Based Model for Document Retrieval
    Du, Mengxue
    Li, Shasha
    Jie, Yu
    Ma, Jun
    Bin, Ji
    Liu, Huijun
    Lin, Wuhang
    Yi, Zibo
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 776 - 788
  • [9] Web topic text extraction based on document features
    Lin, Kunhui
    Xiao, Zhimin
    Wu, Tunhua
    Zhou, Changle
    Yao, Junfeng
    [J]. Journal of Computational Information Systems, 2007, 3 (03): : 1181 - 1188
  • [10] A Judgment Set Similarity Measure Based on Prime Implicants
    Slavkovik, Marija
    Agotnes, Thomas
    [J]. AAMAS'14: PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS & MULTIAGENT SYSTEMS, 2014, : 1573 - 1574