Reconstructing Textual Documents from n-grams

被引：2

作者：

Galle, Matthias ^{[1
]}

Tealdi, Matias ^{[1
,2
]}

机构：

[1] Xerox Res Ctr Europe, Meylan, France

[2] Univ Nacl Cordoba, Cordoba, Argentina

来源：

KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2015年

关键词：

D O I：

10.1145/2783258.2783361

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We analyze the problem of reconstructing documents when we only have access to the n-grams and their counts from the original documents and a fixed n. Formally, we are interested in recovering the longest contiguous substrings of whose presence in the original documents we are certain. We map this problem on a de Bruijn graph, where the n-grams form the edges and where every Eulerian cycles gives a plausible reconstruction. We define two rules that reduce this graph in such a way of preserving all possible reconstruction, while at the same time increasing the length of the edge labels. From a theoretical perspective we prove that the iterative application of these rules gives an irreducible graph equivalent to the original one. We then apply this on the data from the Gutenberg project to measure the number and size of the obtained longest substrings. Moreoever, we analyze how the n-gram corpus could be noised to prevent reconstruction, showing empirically that removing low frequent n-grams has little impact. Instead, we propose another method consisting in adding strategically fictitious n-grams and show that a noised corpus like that is much harder to reconstruct while increasing only little the perplexity of a language model obtained through it.

引用

页码：329 / 338

页数：10

共 50 条

[1] Applications of N-grams in textual information systems
Robertson, AM
Willett, P
[J]. JOURNAL OF DOCUMENTATION, 1998, 54 (01) : 48 - 69
[2] Hierarchical classification of Chinese documents based on N-grams
Guan, JH
Zhou, SG
[J]. DIGITAL LIBRARIES: TECHNOLOGY AND MANAGEMENT OF INDIGENOUS KNOWLEDGE FOR GLOBAL ACCESS, 2003, 2911 : 643 - 652
[3] Modeling documents for structure recognition using generalized N-grams
Brugger, R
Zramdini, A
Ingold, R
[J]. PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, 1997, : 56 - 60
[4] The Distribution of N-Grams
Leo Egghe
[J]. Scientometrics, 2000, 47 : 237 - 252
[5] The distribution of N-grams
Egghe, L
[J]. SCIENTOMETRICS, 2000, 47 (02) : 237 - 252
[6] Collocations and N-grams
FREEBURY-JONES, D. A. R. R. E. N.
[J]. RENAISSANCE AND REFORMATION, 2021, 44 (04) : 210 - 216
[7] The Method of Search for Falsifications in Copies of Contractual Documents based on N-grams
Slavin, Oleg
Andreeva, Elena
Arlazarov, Vladimir V.
[J]. THIRTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2020), 2021, 11605
[8] Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents
Zizka, Jan
Darena, Frantisek
[J]. TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 461 - 469
[9] IDF for Word N-grams
Shirakawa, Masumi
Hara, Takahiro
Nishio, Shojiro
[J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2017, 36 (01)
[10] N-grams: A well-structured knowledge representation for recognition of graphical documents
Lank, E
Blostein, D
[J]. PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, 1997, : 801 - 804

← 1 2 3 4 5 →