Accounting for Language Changes Over Time in Document Similarity Search

被引:5
|
作者
Morsy, Sara [1 ]
Karypis, George [1 ]
机构
[1] Univ Minnesota, Dept Comp Sci, 200 Union St SE, Minneapolis, MN 55455 USA
关键词
Citation network; language change; longitudinal document collections; regularization; similarity search; terms usage frequency changes;
D O I
10.1145/2934671
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given a query document, ranking the documents in a collection based on how similar they are to the query is an essential task with extensive applications. For collections that contain documents whose creation dates span several decades, this task is further complicated by the fact that the language changes over time. For example, many terms add or lose one or more senses to meet people's evolving needs. To address this problem, we present methods that take advantage of two types of information to account for the language change. The first is the citation network that often exists within the collection, which can be used to link related documents with significantly different creation dates ( and hence different language use). The second is the changes in the usage frequency of terms that occur over time, which can indicate changes in their senses and uses. These methods utilize the preceding information while estimating the representation of both documents and terms within the context of nonprobabilistic static and dynamic topic models. Our experiments on two real-world datasets that span more than 40 years show that our proposed methods improve the retrieval performance of existing models and that these improvements are statistically significant.
引用
收藏
页数:26
相关论文
共 50 条
  • [1] Similarity of Private Keyword Search over Encrypted Document Collection
    Elmehdwi, Yousef
    Jiang, Wei
    Hurson, Ali
    ADVANCES IN COMPUTERS, VOL 94, 2014, 94 : 71 - 102
  • [2] Document Visual Similarity Measure For Document Search
    Ahmadullin, Ildus
    Allebach, Jan P.
    Damera-Venkata, Niranjan
    Fan, Jian
    Lee, Seungyon
    Lin, Qian
    Liu, Jerry
    DOCENG 2011: PROCEEDINGS OF THE 2011 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2011, : 139 - 142
  • [3] Secure Sketch Search For Document Similarity
    Orencik, Cengiz
    Alewiwi, Mahmoud
    Savas, Erkay
    2015 IEEE TRUSTCOM/BIGDATASE/ISPA, VOL 1, 2015, : 1102 - 1107
  • [4] Document Similarity Analysis in Slovak Language
    Hanusniak, Vladimir
    Smatanik, Vladimir
    Straka, Milan
    Zabovsky, Michal
    2016 INTERNATIONAL CONFERENCE ON INFORMATION MANAGEMENT AND TECHNOLOGY (ICIMTECH), 2016, : 281 - 285
  • [5] Efficient similarity search over future stream time series
    Lian, Xiang
    Chen, Lei
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (01) : 40 - 54
  • [6] Approximate similarity search over multiple stream time series
    Lian, Xiang
    Chen, Lei
    Wang, Bin
    ADVANCES IN DATABASES: CONCEPTS, SYSTEMS AND APPLICATIONS, 2007, 4443 : 962 - +
  • [7] A Similarity Search Method for Encrypted Cloud Document
    Fu, Zhangjie
    Shu, Jiangang
    Wang, Jin
    Sun, Xingming
    2014 TENTH INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION HIDING AND MULTIMEDIA SIGNAL PROCESSING (IIH-MSP 2014), 2014, : 791 - 794
  • [8] Comparison of two "document similarity search engines"
    Poinçot, P
    Lesteven, S
    Murtagh, F
    LIBRARY AND INFORMATION SERVICES IN ASTRONOMY III (LISA III), 1998, 153 : 85 - 92
  • [9] Document similarity search based on generic summaries
    Wan, XJ
    Yang, JW
    INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2005, 3689 : 635 - 640
  • [10] Language and fluency in child language disorders: Changes over time
    Hall, NE
    JOURNAL OF FLUENCY DISORDERS, 1996, 21 (01) : 1 - 32