Capturing Semantics for Imputation with Pre-trained Language Models

被引:3
|
作者
Mei, Yinan [1 ]
Song, Shaoxu [1 ]
Fang, Chenguang [1 ]
Yang, Haifeng [2 ]
Fang, Jingyun [2 ]
Long, Jiang [2 ]
机构
[1] Tsinghua Univ, Sch Software, BNRist, Beijing, Peoples R China
[2] HUAWEI Cloud BU, Data Governance Innovat Lab, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Imputation; Deep Learning; Pre-trained Language Models;
D O I
10.1109/ICDE51399.2021.00013
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.
引用
收藏
页码:61 / 72
页数:12
相关论文
共 50 条
  • [1] Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models
    Huang, James Y.
    Huang, Kuan-Hao
    Chang, Kai-Wei
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1372 - 1379
  • [2] Pre-Trained Language Models and Their Applications
    Wang, Haifeng
    Li, Jiwei
    Wu, Hua
    Hovy, Eduard
    Sun, Yu
    [J]. ENGINEERING, 2023, 25 : 51 - 65
  • [3] Annotating Columns with Pre-trained Language Models
    Suhara, Yoshihiko
    Li, Jinfeng
    Li, Yuliang
    Zhang, Dan
    Demiralp, Cagatay
    Chen, Chen
    Tan, Wang-Chiew
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 1493 - 1503
  • [4] LaoPLM: Pre-trained Language Models for Lao
    Lin, Nankai
    Fu, Yingwen
    Yang, Ziyu
    Chen, Chuwei
    Jiang, Shengyi
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6506 - 6512
  • [5] PhoBERT: Pre-trained language models for Vietnamese
    Dat Quoc Nguyen
    Anh Tuan Nguyen
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1037 - 1042
  • [6] HinPLMs: Pre-trained Language Models for Hindi
    Huang, Xixuan
    Lin, Nankai
    Li, Kexin
    Wang, Lianxi
    Gan, Suifu
    [J]. 2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2021, : 241 - 246
  • [7] Evaluating Commonsense in Pre-Trained Language Models
    Zhou, Xuhui
    Zhang, Yue
    Cui, Leyang
    Huang, Dandan
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9733 - 9740
  • [8] Knowledge Inheritance for Pre-trained Language Models
    Qin, Yujia
    Lin, Yankai
    Yi, Jing
    Zhang, Jiajie
    Han, Xu
    Zhang, Zhengyan
    Su, Yusheng
    Liu, Zhiyuan
    Li, Peng
    Sun, Maosong
    Zhou, Jie
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3921 - 3937
  • [9] Pre-trained language models in medicine: A survey *
    Luo, Xudong
    Deng, Zhiqi
    Yang, Binxia
    Luo, Michael Y.
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 154
  • [10] Probing for Hyperbole in Pre-Trained Language Models
    Schneidermann, Nina Skovgaard
    Hershcovich, Daniel
    Pedersen, Bolette Sandford
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-SRW 2023, VOL 4, 2023, : 200 - 211