ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

被引：0

作者：

Feuer, Benjamin ^{[1
]}

Liu, Yurong ^{[1
]}

Hegde, Chinmay ^{[1
]}

Freire, Juliana ^{[1
]}

机构：

[1] NYU, New York, NY 10016 USA

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 17卷 / 09期

关键词：

D O I：

10.14778/3665844.3665857

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark.

引用

页码：2279 / 2292

页数：14

共 50 条

[21] An open-source framework for large-scale transient topology optimization using PETSc
Hansotto Kristiansen
Niels Aage
Structural and Multidisciplinary Optimization, 2022, 65
[22] TeenyTinyLlama: Open-source tiny language models trained in Brazilian Portuguese
Correa, Nicholas Kluge
Falk, Sophia
Fatimah, Shiza
Sen, Aniket
De Oliveira, Nythamar
MACHINE LEARNING WITH APPLICATIONS, 2024, 16
[23] OPEN-SOURCE LANGUAGE AI CHALLENGES BIG TECH'S MODELS
Gibney, Elizabeth
NATURE, 2022, 606 (7916) : 850 - 851
[24] Open-source language AI challenges big tech’s models
Elizabeth Gibney
Nature, 2022, 606 : 850 - 851
[25] Inductive Thematic Analysis of Healthcare Qualitative Interviews Using Open-Source Large Language Models: How Does it Compare to Traditional Methods?
Mathis, Walter S.
Zhao, Sophia
Pratt, Nicholas
Weleff, Jeremy
De Paoli, Stefano
SSRN,
[26] Inductive thematic analysis of healthcare qualitative interviews using open-source large language models: How does it compare to traditional methods?
Mathis, Walter S.
Zhao, Sophia
Pratt, Nicholas
Weleff, Jeremy
De Paoli, Stefano
COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2024, 255
[27] RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model
Lu, Yao
Liu, Shang
Zhang, Qijun
Xie, Zhiyao
29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 722 - 727
[28] PMC-LLaMA: toward building open-source language models for medicine
Wu, Chaoyi
Lin, Weixiong
Zhang, Xiaoman
Zhang, Ya
Xie, Weidi
Wang, Yanfeng
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 1833 - 1843
[29] Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data
Chen, Yuhao
Wang, Zhimu
Zulkernine, Farhana
2024 IEEE INTERNATIONAL CONFERENCE ON DIGITAL HEALTH, ICDH 2024, 2024, : 126 - 128
[30] A Novel and Open-Source Illumination Correction for Hyperspectral Digital Outcrop Models
Thiele, Samuel T.
Lorenz, Sandra
Kirsch, Moritz
Gloaguen, Richard
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60

← 1 2 3 4 5 →