ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

被引:0
|
作者
Feuer, Benjamin [1 ]
Liu, Yurong [1 ]
Hegde, Chinmay [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10016 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 17卷 / 09期
关键词
D O I
10.14778/3665844.3665857
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark.
引用
收藏
页码:2279 / 2292
页数:14
相关论文
共 50 条
  • [41] Using the open-source statistical language R to analyze the dichotomous Rasch model
    Yuelin Li
    Behavior Research Methods, 2006, 38 : 532 - 541
  • [42] IRSTLM: an Open Source Toolkit for Handling Large Scale Language Models
    Federico, Marcello
    Bertoldi, Nicola
    Cettolo, Mauro
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1618 - 1621
  • [43] The (ab)use of Open Source Code to Train Large Language Models
    Al-Kaswan, Ali
    Izadi, Maliheh
    2023 IEEE/ACM 2ND INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING, NLBSE, 2023, : 9 - 10
  • [44] USING OPEN-SOURCE NATURAL LANGUAGE PROCESSING TO CLASSIFY TRAUMATIC CRANIAL HEMORRHAGES
    Lopez, Alexander
    Crawford, Malcolm
    Tran, Diem Kieu
    Chen, Jefferson
    JOURNAL OF NEUROTRAUMA, 2021, 38 (14) : A83 - A83
  • [45] Using the open-source statistical language R to analyze the dichotomous Rasch model
    Li, Yuelin
    BEHAVIOR RESEARCH METHODS, 2006, 38 (03) : 532 - 541
  • [46] Euclidean and Shape-Based Analysis of the Dynamic Mitral Annulus in Children using a Novel Open-Source Framework
    Amin, Silvani
    Dewey, Hannah
    Lasso, Andras
    Sabin, Patricia
    Han, Ye
    Vicory, Jared
    Paniagua, Beatriz
    Herz, Christian
    Nam, Hannah
    Cianciulli, Alana
    Flynn, Maura
    Laurence, Devin W.
    Harrild, David
    Fichtinger, Gabor
    Cohen, Meryl S.
    Jolley, Matthew A.
    JOURNAL OF THE AMERICAN SOCIETY OF ECHOCARDIOGRAPHY, 2024, 37 (02) : 259 - 267
  • [47] Open-source embedded framework for Unmanned Ground Vehicle control using CIAA
    Pessacg, Facundo
    Nitsche, Matias
    Teijeiro, Adrian
    Martin, Diego
    De Cristoforis, Pablo
    2017 EIGHT ARGENTINE SYMPOSIUM AND CONFERENCE ON EMBEDDED SYSTEMS (CASE), 2017, : 35 - 40
  • [48] Construction of a Digital Twin Framework Using Free and Open-Source Software Programs
    Shah, Karan
    Prabhakar, T., V
    Sarweshkumar, C. R.
    Abhishek, S., V
    Kumar, Vasanth T.
    IEEE INTERNET COMPUTING, 2022, 26 (05) : 50 - 59
  • [49] Development of Novel QAPEX Analysis System Using Open-Source GIS
    Koo, Jayoung
    Kim, Jonggun
    Ryu, Jicheol
    Shin, Dong-Suk
    Lee, Seoro
    Kim, Min-Kyeong
    Jeong, Jaehak
    Lim, Kyoung-Jae
    SUSTAINABILITY, 2022, 14 (13)
  • [50] mixl: An open-source R package for estimating complex choice models on large datasets
    Molloy, Joseph
    Becker, Felix
    Schmid, Basil
    Axhausen, Kay W.
    JOURNAL OF CHOICE MODELLING, 2021, 39