Discovering Research Areas in Dataset Applications through Knowledge Graphs and Large Language Models

被引:0
|
作者
Gerasimov, Irina [1 ]
Mehrabian, Armin [1 ]
Binita, K. C. [1 ]
Alfred, Jerome [1 ]
McGuire, Michael P. [2 ]
机构
[1] NASA, Goddard Space Flight Ctr, ADNET Syst Inc, Greenbelt, MD USA
[2] Towson Univ, Dept Comp & Informat Sci, Towson, MD USA
关键词
LLM; Knowledge Graph; data citation;
D O I
10.1109/e-Science62913.2024.10678676
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Scientific datasets are increasingly cited in peer-reviewed journal publications, facilitating easy access to research utilizing those datasets. Datasets undergo a life cycle where older versions of datasets are replaced by newer versions often due to improvements in data resolution, algorithms, and other factors. Unlike peer reviewed documents registered with a single Digital Unique Identifier (DOI), datasets can be updated over time and the newer version of the datasets are registered with a new DOI which is not necessarily linked to the previous version of the dataset. It is challenging when publications citing a dataset need to be traced over the entire life cycle of that dataset. We provide an innovative approach to link the dataset versions and publications using a knowledge graph (KG). KG can help to trace the dataset cited in publications over the entire dataset life cycle and shed light into dataset usage in various applied research areas. We fine-tuned the pretrained NASA IMPACT INDUS Large Language Model (LLM) on a set of labeled publications abstracts. Our results showed that 87% of the publications were classified into one of twenty applied research areas, while the remaining 13% were classified into non-applied research areas. By linking datasets to applied research areas through the KG and employing Global Change Master Directory (GCMD), a well-established controlled vocabulary of scientific keywords describing Earth science datasets, we contribute to a transparent and advanced search and discovery mechanism for datasets across the Earth data ecosystem. The integrated KG and LLM approach is now incorporated and operational in dataset publication management at one of NASA's Earth science data archival centers.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Discovering Research Areas in Dataset Applications through Knowledge Graphs and Large Language Models
    Gerasimov, Irina
    Mehrabian, Armin
    Kc, Binita
    Alfred, Jerome
    Mcguire, Michael P.
    Proceedings - 2024 IEEE 20th International Conference on e-Science, e-Science 2024, 2024,
  • [2] Unifying Large Language Models and Knowledge Graphs: A Roadmap
    Pan, Shirui
    Luo, Linhao
    Wang, Yufei
    Chen, Chen
    Wang, Jiapu
    Wu, Xindong
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (07) : 3580 - 3599
  • [3] Workshop on Enterprise Knowledge Graphs using Large Language Models
    Gupta, Rajeev
    Srinivasa, Srinath
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 5271 - 5272
  • [4] Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective
    Lavrinovics, Ernests
    Biswas, Russa
    Bjerva, Johannes
    Hose, Katja
    JOURNAL OF WEB SEMANTICS, 2025, 85
  • [5] A dataset for evaluating clinical research claims in large language models
    Zhang, Boya
    Bornet, Alban
    Yazdani, Anthony
    Khlebnikov, Philipp
    Milutinovic, Marija
    Rouhizadeh, Hossein
    Amini, Poorya
    Teodoro, Douglas
    SCIENTIFIC DATA, 2025, 12 (01)
  • [6] Enhanced Story Comprehension for Large Language Models through Dynamic Document-Based Knowledge Graphs
    Andrus, Berkeley R.
    Nasiri, Yeganeh
    Cui, Shilong
    Cullen, Benjamin
    Fulda, Nancy
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 10436 - 10444
  • [7] Research on Dataset Generation in the Development of Large Language Models for Digital Textbooks
    Lee, Youngho
    2023 3RD INTERNATIONAL CONFERENCE ON ROBOTICS, AUTOMATION AND ARTIFICIAL INTELLIGENCE, RAAI 2023, 2023, : 297 - 300
  • [8] Building a business email compromise research dataset with large language models
    Dube, Rohit
    JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2025, 21 (01):
  • [9] A Temporal Knowledge Graph Generation Dataset Supervised Distantly by Large Language Models
    Jun Zhu
    Yan Fu
    Junlin Zhou
    Duanbing Chen
    Scientific Data, 12 (1)
  • [10] Novel applications of large language models in clinical research
    Abers, Michael S.
    Mathias, Rasika A.
    JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY, 2025, 155 (03) : 813 - 814