Hide and Mine in Strings: Hardness, Algorithms, and Experiments

被引：3

作者：

Bernardini, Giulia ^{[1
,2
]}

Conte, Alessio ^{[3
]}

Gourdel, Garance ^{[3
,4
]}

Grossi, Roberto ^{[5
,6
]}

Loukides, Grigorios ^{[7
]}

Pisanti, Nadia ^{[3
,6
]}

Pissis, Solon P. ^{[2
,8
]}

Punzi, Giulia ^{[3
]}

Stougie, Leen ^{[2
,9
]}

Sweering, Michelle ^{[2
]}

机构：

[1] Univ Trieste, Trieste I-34127, Italy

[2] CWI, NL-1098 XG Amsterdam, Netherlands

[3] Univ Pisa, I-56126 Pisa, Italy

[4] Inria Rennes, ENS, Ecole Normale Super, Gif Sur Yvette F-91190, France

[5] Univ Pisa, Comp Sci, I-91190 Pisa, Italy

[6] ERABLE Team, F-38330 Montbonnot SaintMartin, France

[7] Kings Coll London, London WCR 2LS, England

[8] Vrije Univ, NL-1081 HV Amsterdam, Netherlands

[9] Vrije Univ, Operat Res, NL-1081 HV Amsterdam, Netherlands

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2023年 / 35卷 / 06期

基金：

欧盟地平线“2020”;

关键词：

Data mining; Bioinformatics; Genomics; DNA; Data integrity; Privacy; Resists; Data privacy; data sanitization; knowledge hiding; frequent pattern mining; string algorithms; MOTIFS; FRAMEWORK; PATTERNS; RULES;

D O I：

10.1109/TKDE.2022.3158063

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Data sanitization and frequent pattern mining are two well-studied topics in data mining. Data sanitization is the process of disguising (hiding) confidential information in a given dataset. Typically, this process incurs some utility loss that should be minimized. Frequent pattern mining is the process of obtaining all patterns occurring frequently enough in a given dataset. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We also complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well. We show that, unlike popular approaches, our methods can fill missing values in genomic sequences, while preserving the accuracy of frequent pattern mining.

引用

页码：5948 / 5963

页数：16

共 50 条

[31] Finding concise plans: Hardness and algorithms
O'Kane, Jason M.
Shell, Dylan A.
2013 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2013, : 4803 - 4810
[32] Clustering Affine Subspaces: Hardness and Algorithms
Lee, Euiwoong
Schulman, Leonard J.
PROCEEDINGS OF THE TWENTY-FOURTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS (SODA 2013), 2013, : 810 - 827
[33] Concise Planning and Filtering: Hardness and Algorithms
O'Kane, Jason M.
Shell, Dylan A.
IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2017, 14 (04) : 1666 - 1681
[34] Hardness and Algorithms for Robust and Sparse Optimization
Price, Eric
Silwal, Sandeep
Zhou, Samson
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[35] On subbetweennesses of trees: Hardness, algorithms, and characterizations
Rautenbach, Dieter
dos Santos, Vinicius Fernandes
Schaefer, Philipp M.
Szwarcfiter, Jayme L.
COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2011, 62 (12) : 4674 - 4681
[36] Dealings with problem hardness in genetic algorithms
Picek, Stjepan
Golub, Marin
WSEAS Transactions on Computers, 2009, 8 (05): : 747 - 756
[37] Order scheduling models: Hardness and algorithms
Garg, Naveen
Kumar, Amit
Pandit, Vinayaka
FSTTCS 2007: FOUNDATIONS OF SOFTWARE TECHNOLOGY AND THEORETICAL COMPUTER SCIENCE, PROCEEDINGS, 2007, 4855 : 96 - +
[38] Algorithms and Hardness for Metric Dimension on Digraphs
Dailly, Antoine
Foucaud, Florent
Hakanen, Anni
GRAPH-THEORETIC CONCEPTS IN COMPUTER SCIENCE, WG 2023, 2023, 14093 : 232 - 245
[39] Approximation algorithms and hardness for domination with propagation
Aazami, Ashkan
Stilp, Michael D.
APPROXIMATION, RANDOMIZATION, AND COMBINATORIAL OPTIMIZATION: ALGORITHMS AND TECHNIQUES, 2007, 4627 : 1 - +
[40] APPROXIMATION ALGORITHMS AND HARDNESS FOR DOMINATION WITH PROPAGATION
Aazami, Ashkan
Stilp, Kael
SIAM JOURNAL ON DISCRETE MATHEMATICS, 2009, 23 (03) : 1382 - 1399

← 1 2 3 4 5 →