TemporalDedup: Domain-Independent Deduplication of Redundant and Errant Temporal Data

被引:0
|
作者
Rogers, Jon [1 ]
Aygun, Ramazan [2 ]
Etzkorn, Letha [1 ]
机构
[1] Univ Alabama, Dept Comp Sci, 301 Sparkman Dr, Huntsville, AL 35899 USA
[2] Kennesaw State Univ, Dept Comp Sci, 1100 South Marietta Pkwy SE, Marietta, GA 30060 USA
关键词
Temporal data; record deduplication; record linkage; entity matching; data preparation;
D O I
10.1142/S1793351X23500010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deduplication is a key component of the data preparation process, a bottleneck in the machine learning (ML) and data mining pipeline that is very time-consuming and often relies on domain expertise and manual involvement. Further, temporal data is increasingly prevalent and is not well suited to traditional similarity and distance-based deduplication techniques. We establish a fully automated, domain-independent deduplication model for temporal data domains, known as TemporalDedup, that infers the key attribute(s), applies a base set of deduplication techniques focused on value matches for key, non-key, and elapsed time, and further detects duplicates through inference of temporal ordering requirements using Longest Common Subsequence (LCS) for records of a shared type. Using LCS, we split each record's temporal sequence into constrained and unconstrained sequences. We flag suspicious (errant) records that are non-adherent to the inferred constrained order and we flag a record as a duplicate if its unconstrained order, of sufficient length, matches that of another record. TemporalDedup was compared against a similarity-based Adaptive Sorted Neighborhood Method (ASNM) in evaluating duplicates for two disparate datasets: (1) 22,794 records from Sony's PlayStation Network (PSN) trophy data, where duplication may be indicative of cheating, and (2) emergency declarations and government responses related to COVID-19 for all U.S. states and territories. TemporalDedup (F1-scores of 0.971 and 0.954) exhibited combined sensitivities above 0.9 for all duplicate classes whereas ASNM (0.705 and 0.732) exhibited combined sensitivities below 0.2 for all time and order duplicate classes.
引用
收藏
页码:309 / 343
页数:35
相关论文
共 50 条
  • [1] On the predictability of domain-independent temporal planners
    Cenamor, Isabel
    Vallati, Mauro
    Chrpa, Lukas
    [J]. COMPUTATIONAL INTELLIGENCE, 2019, 35 (04) : 745 - 773
  • [2] Domain-independent temporal reasoning with recurring events
    Morris, RA
    Shoaff, WD
    Khatib, L
    [J]. COMPUTATIONAL INTELLIGENCE, 1996, 12 (03) : 450 - 477
  • [3] Exploiting relationships for domain-independent data cleaning
    Kalashnikov, Dmitri V.
    Mehrotra, Sharad
    Chen, Zhaoqi
    [J]. PROCEEDINGS OF THE FIFTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2005, : 262 - 273
  • [4] Domain-independent Data-to-Text Generation for Open Data
    Burgdorf, Andreas
    Barkmann, Micaela
    Pomp, Andre
    Meisen, Tobias
    [J]. PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, TECHNOLOGY AND APPLICATIONS (DATA), 2022, : 95 - 106
  • [5] Semantic Deduplication of Redundant and Non-Conformant Data in Temporal Domains
    The University of Alabama in Huntsville
    [J].
  • [6] A Constraint-Based Encoding for Domain-Independent Temporal Planning
    Bit-Monnot, Arthur
    [J]. PRINCIPLES AND PRACTICE OF CONSTRAINT PROGRAMMING, 2018, 11008 : 30 - 46
  • [7] Domain-independent Unsupervised Text Segmentation For Data Management
    Sakahara, Makoto
    Okada, Shogo
    Nitta, Katsumi
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOP (ICDMW), 2014, : 481 - 487
  • [8] Domain-independent temporal planning in a planning-graph-based approach
    Dpto. de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Camino de Vera s/n, 46071, Valencia, Spain
    [J]. AI Commun, 2006, 4 (341-367):
  • [9] Domain-independent Design Theory
    Korn, J.
    [J]. Journal of Engineering Design, 7 (03):
  • [10] Domain-independent design theory
    Korn, J
    [J]. JOURNAL OF ENGINEERING DESIGN, 1996, 7 (03) : 293 - 311