Privacy-Preserving Text Similarity via Non-Prefix-Free Codes

被引:1
|
作者
Kulekci, M. Oguzhan [1 ]
Habib, Ismail [1 ]
Aghabaiglou, Amir [1 ]
机构
[1] Istanbul Tech Univ, Inst Informat, Istanbul, Turkey
关键词
INFORMATION;
D O I
10.1007/978-3-030-32047-8_9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many methods have been proposed to compute the similarity score alpha <- S(A, B) in between two plain documents A and B. However, when their contents are confidential, special processing is required to protect privacy. A great extent of the solutions offered to date is mostly based on homomorphic encryption or secure multi-party computation techniques, where their computational cost inhibits the practical usage, especially on massive sets. In this study we propose an alternative by encoding the documents with non-prefix-free (NPF) coding before applying the preferred similarity metric S(). The NPF coding simply represents the symbols with variable-length codewords, where the codeword set is generated without the prefix-free restriction. Thus, a codeword may be a prefix of another, and without the explicit codeword boundary information, retrieving the original data from the encoded stream becomes hard due to the lack of unique decodability in non-prefix-free codes. We provide the combinatorial analysis of this hardness, and experimentally compare the similarity scores obtained on NPF encoded documents and on original plain text versions. We have considered normalized compression distance (NCD) and Jaccard coefficient (JC) for the similarity metric S(). When A' and B' denote the NPF-encoded documents, experiments conducted on METER corpus revealed that the difference between alpha' <- S(A', B') and alpha <- S(A, B) lie in the range of 0.5% and 3% for both NCD and JC.
引用
收藏
页码:94 / 102
页数:9
相关论文
共 50 条
  • [1] Huffman Codes versus Augmented Non-Prefix-Free Codes
    Adas, Boran
    Bayraktar, Ersin
    Kulekci, M. Oguzhan
    [J]. EXPERIMENTAL ALGORITHMS, SEA 2015, 2015, 9125 : 315 - 326
  • [2] Uniquely Decodable and Directly Accessible Non-Prefix-Free Codes via Wavelet Trees
    Kulekci, M. Oguzhan
    [J]. 2013 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY PROCEEDINGS (ISIT), 2013, : 1969 - 1973
  • [3] Privacy-Preserving Similarity-Based Text Retrieval
    Pang, Hweehwa
    Shen, Jialie
    Krishnan, Ramayya
    [J]. ACM TRANSACTIONS ON INTERNET TECHNOLOGY, 2010, 10 (01)
  • [4] Privacy-Preserving Outsourced Similarity Search
    Kozak, Stepan
    Novak, David
    Zezula, Pavel
    [J]. JOURNAL OF DATABASE MANAGEMENT, 2014, 25 (03) : 48 - 71
  • [5] Privacy-Preserving Text Mining as a Service
    Costantino, Gianpiero
    La Marra, Antonio
    Martinelli, Fabio
    Saracino, Andrea
    Sheikhalishahi, Mina
    [J]. 2017 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (ISCC), 2017, : 890 - 897
  • [6] Privacy-preserving Neural Representations of Text
    Coavoux, Maximin
    Narayan, Shashi
    Cohen, Shay B.
    [J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 1 - 10
  • [7] Privacy-preserving similarity coefficients for binary data
    Wong, Kok-Seng
    Kim, Myung Ho
    [J]. COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2013, 65 (09) : 1280 - 1290
  • [8] Towards Robust and Privacy-preserving Text Representations
    Li, Yitong
    Baldwin, Timothy
    Cohn, Trevor
    [J]. PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2018, : 25 - 30
  • [9] Privacy-Preserving Task Matching With Threshold Similarity Search via Vehicular Crowdsourcing
    Song, Fuyuan
    Qin, Zheng
    Liu, Dongxiao
    Zhang, Jixin
    Lin, Xiaodong
    Shen, Xuemin
    [J]. IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2021, 70 (07) : 7161 - 7175
  • [10] Privacy-Preserving Task Matching with Threshold Similarity Search via Vehicular Crowdsourcing
    Song, Fuyuan
    Qin, Zheng
    Liu, Dongxiao
    Zhang, Jixin
    Lin, Xiaodong
    Shen, Xuemin
    [J]. IEEE Transactions on Vehicular Technology, 2021, 70 (07): : 7161 - 7175