Entity Matching on Unstructured Data: An Active Learning Approach

被引:9
|
作者
Brunner, Ursin [1 ]
Stockinger, Kurt [1 ]
机构
[1] ZHAW Zurich Univ Appl Sci, Zurich, Switzerland
来源
2019 6TH SWISS CONFERENCE ON DATA SCIENCE (SDS) | 2019年
关键词
D O I
10.1109/SDS.2019.00006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the growing number of data sources in enterprises, entity matching becomes a crucial part of every data integration project. In order to reduce the human effort involved in identifying matching entities between different database tables, typically machine learning algorithms are applied. Moreover, active learning is often combined with supervised machine learning methods to further reduce the effort of labeling entities as true or false matches. However, while state-of-the-art active learning algorithms have proven to work well on structured data sets, unstructured data still poses a challenge in entity matching. This paper proposes an end-to-end entity matching pipeline to minimize the human labeling effort for entity matching on unstructured data sets. We use several natural language processing techniques such as soft tf-idf to pre-process the record pairs before we classify them using a novel Active Learning with Uncertainty Sampling (ALWUS) algorithm. We designed our algorithm as a plugin system to work with any state-of-the-art classifier such as support vector machines, random forests or deep neural networks. Detailed experimental results demonstrate that our end-to-end entity matching pipeline clearly outperforms comparable entity matching approaches on an unstructured real-word data set. Our approach achieves significantly better scores (F1-score) while using 1 to 2 orders of magnitude fewer human labeling efforts than existing state-of-the-art algorithms.
引用
收藏
页码:97 / 102
页数:6
相关论文
共 50 条
  • [21] A Framework for Classifying Unstructured Data of Cardiac Patients: A Supervised Learning Approach
    Basharat, Iqra
    Anjum, Ali Raza
    Fatima, Mamuna
    Qamar, Usman
    Khan, Shoal Ahmed
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2016, 7 (02) : 133 - 141
  • [22] Combining structured and unstructured data for predictive models: a deep learning approach
    Dongdong Zhang
    Changchang Yin
    Jucheng Zeng
    Xiaohui Yuan
    Ping Zhang
    BMC Medical Informatics and Decision Making, 20
  • [23] Impact of the Characteristics of Multi-source Entity Matching Tasks on the Performance of Active Learning Methods
    Primpeli, Anna
    Bizer, Christian
    SEMANTIC WEB, ESWC 2022, 2022, 13261 : 113 - 129
  • [24] Anonymization of Unstructured Data via Named-Entity Recognition
    Hassan, Fadi
    Domingo-Ferrer, Josep
    Soria-Comas, Jordi
    MODELING DECISIONS FOR ARTIFICIAL INTELLIGENCE (MDAI 2018), 2018, 11144 : 296 - 305
  • [25] Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization
    Zhao, Huimin
    Ram, Sudha
    DATA & KNOWLEDGE ENGINEERING, 2008, 66 (03) : 368 - 381
  • [26] Product Entity Matching via Tabular Data
    Abadi, Ali Naeim
    Nayeem, Mir Tafseer
    Rafiei, Davood
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 4215 - 4219
  • [27] A private entity matching approach for multiple databases
    Han, Shumin
    Shen, Derong
    Nie, Tiezheng
    Kou, Yue
    Yu, Ge
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 38 (04) : 4403 - 4414
  • [28] An active learning approach to hyperspectral data classification
    Rajan, Suju
    Ghosh, Joydeep
    Crawford, Melba M.
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2008, 46 (04): : 1231 - 1242
  • [29] The GAAIN Entity Mapper: An Active-Learning System for Medical Data Mapping
    Ashish, Naveen
    Dewan, Peehoo
    Toga, Arthur W.
    FRONTIERS IN NEUROINFORMATICS, 2016, 10 : 1 - 10
  • [30] A Framework of Data Augmentation While Active Learning for Chinese Named Entity Recognition
    Li, Qingqing
    Huang, Zhen
    Dou, Yong
    Zhang, Ziwen
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2021, PT II, 2021, 12816 : 88 - 100