Paper and Patent Data Fusion Based on Deep Text Clustering

被引：0

作者：

Xie S. ^{[1
,2
]}

Wang X. ^{[1
]}

机构：

[1] Institutes of Science and Development, Chinese Academy of Sciences, Beijing

[2] School of Public Policy and Management, University of Chinese Academy of Sciences, Beijing

来源：

Data Analysis and Knowledge Discovery | 2024年 / 8卷 / 04期

关键词：

Data Fusion; Deep Text Clustering; Papers; Patents; Research Topic Identification;

D O I：

10.11925/infotech.2096-3467.2023.0232

中图分类号：

学科分类号：

摘要：

[Objective] This study integrates papers and patents based on research topics to bridge their language gaps. [Method] Using Wikipedia as the primary classification system, we constructed a small number of annotation sets semi-automatically. Then, we designed a semi-supervised deep text clustering model to fuse papers and patents with similar topics. Finally, we created indicators to evaluate the data fusion quality. [Results] Our model’s clustering accuracy was 2.4~11.9% higher than that of other baseline models. Its quality evaluation score of data fusion reached 0.9, which can supplement research topics based on the known topics. [Limitations] We did not conduct empirical analysis using the fused data and need to determine the cluster numbers manually. [Conclusion] The proposed model can extract topic-related features from differentiated texts of papers and patents to effectively realize data fusion. © 2024 Chinese Academy of Sciences. All rights reserved.

引用

页码：112 / 124

页数：12

共 48 条

[1] Liu Ziqiang, Xu Haiyun, Luo Rui, Et al., Research on Scientific and Technological Interaction Patterns Based on Topic Relevance Analysis, Journal of the China Society for Scientific and Technical Information, 38, 10, pp. 997-1011, (2019)
[2] Li Hui, Hu Jixia, Tong Zhiying, Subject Topic Mining and Evolution Analysis with Multi-Source Data, Data Analysis and Knowledge Discovery, 6, 7, pp. 44-55, (2022)
[3] Zhang Xue, Zhang Zhiqiang, Cao Lingjing, Et al., Research Progress of Research Front Recognition Methods in Subject Fields, Library and Information Service, 66, 12, pp. 139-151, (2022)
[4] Zhou Yuan, Liu Yufei, Xue Lan, An Approach to Identify Emerging Technologies Using Machine Learning: A Case Study of Robotics, Journal of the China Society for Scientific and Technical Information, 37, 9, pp. 939-955, (2018)
[5] Qiu Huilin, Shao Bo, Research on Identification Methods of Scientific Research Hotspots under Multi-source Data, Library and Information Service, 64, 5, pp. 78-88, (2020)
[6] Zhou Qun, Hua Bolin, Topic Identification of Scientific and Technical Decision-Making Demands Based on Multi-source Data Fusion, Information Studies: Theory & Application, 42, 3, (2019)
[7] Ma Cuichang, Situ Junfeng, Cao Shujin, Study on Mechanism of Information Organization for Fine-Grained Correlation and Aggregation of Academic Documents in the Internet Environment, Journal of Modern Information, 39, 12, pp. 37-45, (2019)
[8] Zhang Xinxing, Yang Zhigang, Pang Hongshen, Et al., Research on Science Data Integration System and the Latest Progress, Information Studies: Theory & Application, 45, 6, pp. 199-206, (2022)
[9] Yin W P, Hay J, Roth D., Benchmarking Zero-Shot Text Classification: Datasets, Evaluation and Entailment Approach [OL]
[10] Xu Haiyun, Dong Kun, Wei Ling, Et al., Research on Multi-source Data Fusion Method in Scientometrics, Journal of the China Society for Scientific and Technical Information, 37, 3, pp. 318-328, (2018)

← 1 2 3 4 5 →