ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

被引:0
|
作者
Feuer, Benjamin [1 ]
Liu, Yurong [1 ]
Hegde, Chinmay [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10016 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 17卷 / 09期
关键词
D O I
10.14778/3665844.3665857
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark.
引用
收藏
页码:2279 / 2292
页数:14
相关论文
共 50 条
  • [21] An open-source framework for large-scale transient topology optimization using PETSc
    Hansotto Kristiansen
    Niels Aage
    Structural and Multidisciplinary Optimization, 2022, 65
  • [22] TeenyTinyLlama: Open-source tiny language models trained in Brazilian Portuguese
    Correa, Nicholas Kluge
    Falk, Sophia
    Fatimah, Shiza
    Sen, Aniket
    De Oliveira, Nythamar
    MACHINE LEARNING WITH APPLICATIONS, 2024, 16
  • [23] OPEN-SOURCE LANGUAGE AI CHALLENGES BIG TECH'S MODELS
    Gibney, Elizabeth
    NATURE, 2022, 606 (7916) : 850 - 851
  • [24] Open-source language AI challenges big tech’s models
    Elizabeth Gibney
    Nature, 2022, 606 : 850 - 851
  • [25] Inductive Thematic Analysis of Healthcare Qualitative Interviews Using Open-Source Large Language Models: How Does it Compare to Traditional Methods?
    Mathis, Walter S.
    Zhao, Sophia
    Pratt, Nicholas
    Weleff, Jeremy
    De Paoli, Stefano
    SSRN,
  • [26] Inductive thematic analysis of healthcare qualitative interviews using open-source large language models: How does it compare to traditional methods?
    Mathis, Walter S.
    Zhao, Sophia
    Pratt, Nicholas
    Weleff, Jeremy
    De Paoli, Stefano
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2024, 255
  • [27] RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model
    Lu, Yao
    Liu, Shang
    Zhang, Qijun
    Xie, Zhiyao
    29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 722 - 727
  • [28] PMC-LLaMA: toward building open-source language models for medicine
    Wu, Chaoyi
    Lin, Weixiong
    Zhang, Xiaoman
    Zhang, Ya
    Xie, Weidi
    Wang, Yanfeng
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 1833 - 1843
  • [29] Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data
    Chen, Yuhao
    Wang, Zhimu
    Zulkernine, Farhana
    2024 IEEE INTERNATIONAL CONFERENCE ON DIGITAL HEALTH, ICDH 2024, 2024, : 126 - 128
  • [30] A Novel and Open-Source Illumination Correction for Hyperspectral Digital Outcrop Models
    Thiele, Samuel T.
    Lorenz, Sandra
    Kirsch, Moritz
    Gloaguen, Richard
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60