Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents

被引：0

作者：

Bridgelall, Raj ^{[1
]}

机构：

[1] North Dakota State Univ, Coll Business, Dept Transportat & Supply Chain, POB 6050, Fargo, ND 58108 USA

来源：

APPLIED SCIENCES-BASEL | 2025年 / 15卷 / 05期

关键词：

document search; supervised machine learning; unsupervised machine learning; natural language processing; latent Dirichlet allocation; non-negative matrix factorization; manifold learning; t-distributed stochastic neighbor embedding; term co-occurrence networks; RETRIEVAL;

D O I：

10.3390/app15052357

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

The exponential growth of patent datasets poses a significant challenge in filtering relevant documents for research and innovation. Traditional semantic search methods based on keywords often fail to capture the complexity and variability in multidisciplinary terminology, leading to inefficiencies. This study addresses the problem by systematically evaluating supervised and unsupervised machine learning (ML) techniques for document relevance filtering across five technology domains: solid-state batteries, electric vehicle chargers, connected vehicles, electric vertical takeoff and landing aircraft, and light detecting and ranging (LiDAR) sensors. The contributions include benchmarking the performance of 10 classical models. These models include extreme gradient boosting, random forest, and support vector machines; a deep artificial neural network; and three natural language processing methods: latent Dirichlet allocation, non-negative matrix factorization, and k-means clustering of a manifold-learned reduced feature dimension. Applying these methods to more than 4200 patents filtered from a database of 9.6 million patents revealed that most supervised ML models outperform the unsupervised methods. An average of seven supervised ML models achieved significantly higher precision, recall, and F1-scores across all technology domains, while unsupervised methods show variability depending on domain characteristics. These results offer a practical framework for optimizing document relevance filtering, enabling researchers and practitioners to efficiently manage large datasets and enhance innovation.

引用

页数：25

共 50 条

[41] Detecting hate crimes through machine learning and natural language processing
Salazar, Ana Ortiz
POLICE PRACTICE AND RESEARCH, 2024,
[42] Distributed peer review enhanced with natural language processing and machine learning
Wolfgang E. Kerzendorf
Ferdinando Patat
Dominic Bordelon
Glenn van de Ven
Tyler A. Pritchard
Nature Astronomy, 2020, 4 : 711 - 717
[43] Distributed peer review enhanced with natural language processing and machine learning
Kerzendorf, Wolfgang E.
Patat, Ferdinando
Bordelon, Dominic
van de Ven, Glenn
Pritchard, Tyler A.
NATURE ASTRONOMY, 2020, 4 (07) : 711 - 717
[44] Arabic Natural Language Processing and Machine Learning-Based Systems
Marie-Sainte, Souad Larabi
Alalyani, Nada
Alotaibi, Sihaam
Ghouzali, Sanaa
Abunadi, Ibrahim
IEEE ACCESS, 2019, 7 : 7011 - 7020
[45] SmishGuard: Leveraging Machine Learning and Natural Language Processing for Smishing Detection
Samad, Saleem Raja Abdul
Ganesan, Pradeepa
Rajasekaran, Justin
Radhakrishnan, Madhubala
Ammaippan, Hariraman
Ramamurthy, Vinodhini
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (11) : 586 - 593
[46] Detecting Phishing Attacks Using Natural Language Processing And Machine Learning
Banu, Reshma
Anand, M.
Kamath, Akshatha C.
Ashika, S.
Ujwala, H. S.
Harshitha, S. N.
PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICCS), 2019, : 1210 - 1214
[47] SmartFund: Predicting Research Outcomes with Machine Learning and Natural Language Processing
Alaphat, Alvin
Jiang, Meng
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 2857 - 2865
[48] Subjective Answers Evaluation Using Machine Learning and Natural Language Processing
Bashir, Muhammad Farrukh
Arshad, Hamza
Javed, Abdul Rehman
Kryvinska, Natalia
Band, Shahab S.
IEEE ACCESS, 2021, 9 : 158972 - 158983
[49] Applying machine learning and natural language processing to detect phishing email
Alhogail, Areej
Alsabih, Afrah
COMPUTERS & SECURITY, 2021, 110
[50] Machine Learning Techniques for Biomedical Natural Language Processing: A Comprehensive Review
Houssein, Essam H.
Mohamed, Rehab E.
Ali, Abdelmgeid A.
IEEE ACCESS, 2021, 9 : 140628 - 140653

← 1 2 3 4 5 →