ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

被引:0
|
作者
Feuer, Benjamin [1 ]
Liu, Yurong [1 ]
Hegde, Chinmay [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10016 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 17卷 / 09期
关键词
D O I
10.14778/3665844.3665857
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark.
引用
收藏
页码:2279 / 2292
页数:14
相关论文
共 50 条
  • [1] Servicing open-source large language models for oncology
    Ray, Partha Pratim
    ONCOLOGIST, 2024,
  • [2] A tutorial on open-source large language models for behavioral science
    Hussain, Zak
    Binz, Marcel
    Mata, Rui
    Wulff, Dirk U.
    BEHAVIOR RESEARCH METHODS, 2024, : 8214 - 8237
  • [3] ZAP: An Open-Source Multilingual Annotation Projection Framework
    Akbik, Alan
    Vollgraf, Roland
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2180 - 2184
  • [4] Preliminary Systematic Review of Open-Source Large Language Models in Education
    Lin, Michael Pin-Chuan
    Chang, Daniel
    Hall, Sarah
    Jhajj, Gaganpreet
    GENERATIVE INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, PT I, ITS 2024, 2024, 14798 : 68 - 77
  • [5] PharmaLLM: A Medicine Prescriber Chatbot Exploiting Open-Source Large Language Models
    Ayesha Azam
    Zubaira Naz
    Muhammad Usman Ghani Khan
    Human-Centric Intelligent Systems, 2024, 4 (4): : 527 - 544
  • [6] Automated Essay Scoring and Revising Based on Open-Source Large Language Models
    Song, Yishen
    Zhu, Qianta
    Wang, Huaibo
    Zheng, Qinhua
    IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2024, 17 : 1920 - 1930
  • [7] Open-source large language models in action: A bioinformatics chatbot for PRIDE database
    Bai, Jingwen
    Kamatchinathan, Selvakumar
    Kundu, Deepti J.
    Bandla, Chakradhar
    Vizcaino, Juan Antonio
    Perez-Riverol, Yasset
    PROTEOMICS, 2024,
  • [8] Open-source large language models in medical education: Balancing promise and challenges
    Ray, Partha Pratim
    ANATOMICAL SCIENCES EDUCATION, 2024, 17 (06) : 1361 - 1362
  • [9] Accessible Russian Large Language Models: Open-Source Models and Instructive Datasets for Commercial Applications
    Kosenko, D. P.
    Kuratov, Yu. M.
    Zharikova, D. R.
    DOKLADY MATHEMATICS, 2023, 108 (SUPPL 2) : S393 - S398
  • [10] Accessible Russian Large Language Models: Open-Source Models and Instructive Datasets for Commercial Applications
    D. P. Kosenko
    Yu. M. Kuratov
    D. R. Zharikova
    Doklady Mathematics, 2023, 108 : S393 - S398