Accounting for Language Changes Over Time in Document Similarity Search

被引:5
|
作者
Morsy, Sara [1 ]
Karypis, George [1 ]
机构
[1] Univ Minnesota, Dept Comp Sci, 200 Union St SE, Minneapolis, MN 55455 USA
关键词
Citation network; language change; longitudinal document collections; regularization; similarity search; terms usage frequency changes;
D O I
10.1145/2934671
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given a query document, ranking the documents in a collection based on how similar they are to the query is an essential task with extensive applications. For collections that contain documents whose creation dates span several decades, this task is further complicated by the fact that the language changes over time. For example, many terms add or lose one or more senses to meet people's evolving needs. To address this problem, we present methods that take advantage of two types of information to account for the language change. The first is the citation network that often exists within the collection, which can be used to link related documents with significantly different creation dates ( and hence different language use). The second is the changes in the usage frequency of terms that occur over time, which can indicate changes in their senses and uses. These methods utilize the preceding information while estimating the representation of both documents and terms within the context of nonprobabilistic static and dynamic topic models. Our experiments on two real-world datasets that span more than 40 years show that our proposed methods improve the retrieval performance of existing models and that these improvements are statistically significant.
引用
收藏
页数:26
相关论文
共 50 条
  • [41] Secure similarity search over encrypted cloud images
    Jiangsu Engineering Center of Network Monitoring, China
    不详
    不详
    Int. J. Secur. Appl., 8 (1-14):
  • [42] Secure Similarity Search over Encrypted Cloud Images
    Zhu, Yi
    Sun, Xingming
    Xia, Zhihua
    Xiong, Naixue
    INTERNATIONAL JOURNAL OF SECURITY AND ITS APPLICATIONS, 2015, 9 (08): : 1 - 14
  • [43] Fishing in the Stream: Similarity Search over Endless Data
    Kraus, Naama
    Carmel, David
    Keidar, Idit
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 964 - 969
  • [44] Efficient and Effective Similarity Search over Bipartite Graphs
    Yang, Renchi
    PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 308 - 318
  • [45] Journalistic corpus similarity over time
    Mota, Cristina
    CORPUS-LINGUISTIC APPLICATIONS CURRENT STUDIES, NEW DIRECTIONS, 2010, 71 : 67 - 83
  • [46] Internet: Tool for document and term search in training language specialists
    Francoeur, A
    Cormier, MC
    Lamontagne, C
    TRADUCTION ET LANGUES DE SPECIALITE: APPROCHES THEORIQUES ET CONSIDERATIONS PEDAGOGIQUES, 1998, 214 : 37 - 45
  • [47] The role of item similarity in the time course of hybrid search and memory search
    Utochkin, Igor S.
    Nartker, Makaela
    Tikhonenko, Platon
    Gronau, Nurit
    Wolfe, Jeremy M.
    PERCEPTION, 2021, 50 (1_SUPPL) : 62 - 62
  • [48] A Novel Method for Similarity Search over Meteorological Time Series Data based on the Coulomb's Law
    de Andrade, Claudinei Garcia
    Ribeiro, Marcela Xavier
    Yaguinuma, Cristiane
    Prado Santos, Marilde Terezinha
    ICEIS: PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOL 1, 2013, : 209 - 216
  • [49] Beyond document similarity: Understanding value-based search and browsing technologies
    Paepcke, Andreas
    Garcia-Molina, Hector
    Rodriguez-Mula, Gerard
    Cho, Junghoo
    SIGMOD Record (ACM Special Interest Group on Management of Data), 2000, 29 (01): : 80 - 92
  • [50] PathEmb: Random Walk Based Document Embedding for Global Pathway Similarity Search
    Zhang, Jiao
    Kwong, Sam
    Liu, Guangming
    Lin, Qiuzhen
    Wong, Ka-Chun
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2019, 23 (03) : 1329 - 1335