Survey on Distantly-Supervised Relation Extraction

被引:0
|
作者
Yang S.-Z. [1 ]
Liu Y.-X. [1 ]
Zhang K.-W. [1 ]
Hong Y. [1 ]
Huang H. [1 ]
机构
[1] School of Software Engineering, South China University of Technology, Guangzhou
来源
基金
中国国家自然科学基金;
关键词
Distant supervision; Information extraction; Long tail; Noise reduction; Relation extraction; Wrong labelling;
D O I
10.11897/SP.J.1016.2021.1636
中图分类号
学科分类号
摘要
Relation extraction is a fundamental task in natural language processing and one of the essential parts of information extraction, whose dataset requires high cost due to manual labelling. Fortunately, distant supervision was proposed to alleviate the pressure and cost of manually annotated corpus, which can automatically build datasets for relation extraction task. Owing to its value in automatic relation extraction, it has been widely concerned by academia and business in recent years. However, the datasets constructed by distant supervision are not exactly equivalent to those generated manually. On the contrary, they suffer from the problem of wrong labelling and long tail distribution, resulting in their low quality, and thus hindering the improvement of relation extraction based on these datasets. Therefore, in order to reduce the impact, most of the existing work about distantly-supervised relation extraction (DSRE) focused on how to deal with the noise generated by wrong labelling problem and the long tail distribution. In recent years, deep learning technologies have developed rapidly such as deep neural network, attention mechanism, deep reinforcement learning and so on. Compared with traditional machine learning methods, e.g. feature-based methods, the application of deep learning methods has obvious advantages in relation extraction, as well as DSRE task. That is why DSRE is faced with a new round of opportunities and challenges. What's more, as researches continue, a common workflow of this task was generated step by step. This paper summarizes the existing work in the field of DSRE, and pays more attention to the methods based on deep learning. This paper starts with an introduction of distant supervision as well as its vanilla assumption, analyzes the major shortcoming and reviews the methods based on traditional machine learning such as topic models and pattern correlation and so on. Then this paper introduces the general workflow with four modules, including sample collection, external information, encoder and classifier. According to their target problem, the existing work is divided into two categories, noise reduction methods of DSRE and the solutions of the long tail distribution. For each category, in the light of different modules of the common workflow, the existing work is summarized from four aspects, namely sample noise reduction, external information fusion, encoder optimization and classifier optimization. Meanwhile, this paper analyzes different improvement methods of the same module, and compares their weakness and strength. It should be noted that these four aspects are not mutually exclusive, meaning that there can be two or more modules improved in one method at the same time. What's more, we introduce the datasets in common use for this task in detail, as well as their related corpus and knowledge graphs. Moreover, this paper introduces the metrics and evaluation methods used in the DSRE evaluation. Last but not least, this paper ends up with forecasting the future development trend. In order to bring this task into a new frontier, we hope that DSRE can be integrated with some popular and reasonable technologies such as joint extraction, few-shot learning, hybrid supervision and so on. © 2021, Science Press. All right reserved.
引用
收藏
页码:1636 / 1660
页数:24
相关论文
共 85 条
  • [1] Craven M, Kumlien J., Constructing biological knowledge bases by extracting information from text sources, Proceedings of the ISMB, pp. 77-86, (1999)
  • [2] Wu F, Weld D S., Autonomously semantifying wikipedia, Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management, pp. 41-50, (2007)
  • [3] Mintz M, Bills S, Snow R, Jurafsky D., Distant supervision for relation extraction without labeled data, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2, pp. 1003-1011, (2009)
  • [4] Go A, Bhayani R, Huang L., Twitter sentiment classification using distant supervision, CS224N Project Report, Stanford, 1, 12, (2009)
  • [5] Plank B, Agic Z., Distant supervision from disparate sources for low-resource part-of-speech tagging, (2018)
  • [6] Qin L, Liu Y, Che W, Et al., End-to-end task-oriented dialogue system with distantly supervised knowledge base retriever, Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pp. 238-249, (2018)
  • [7] Lee S, Song Y, Choi M, Kim H., Bagging-based active learning model for named entity recognition with distant supervision, Proceedings of the 2016 International Conference on Big Data and Smart Computing(BigComp), pp. 321-324, (2016)
  • [8] Roth B, Barth T, Wiegand M, Klakow D., A survey of noise reduction methods for distant supervision, Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, pp. 73-78, (2013)
  • [9] Smirnova A, Cudre-Mauroux P., Relation extraction using distant supervision: A survey, ACM Computing Surveys, 51, 5, (2018)
  • [10] Dumitrache A, Aroyo L, Welty C., False positive and cross-relation signals in distant supervision data, (2017)