Lasso-based variable selection methods in text regression: the case of short texts

被引:2
|
作者
Freo, Marzia [1 ]
Luati, Alessandra [2 ,3 ]
机构
[1] European Commiss, Joint Res Ctr JRC, Ispra, Italy
[2] Imperial Coll London, Dept Math, London, England
[3] Univ Bologna, Dept Stat, Bologna, Italy
关键词
Text mining; Lasso; Variable screening; Stability selection; Latent Dirichlet allocation; MODEL SELECTION; REGULARIZATION;
D O I
10.1007/s10182-023-00472-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Communication through websites is often characterised by short texts, made of few words, such as image captions or tweets. This paper explores the class of supervised learning methods for the analysis of short texts, as an alternative to unsupervised methods, widely employed to infer topics from structured texts. The aim is to assess the effectiveness of text data in social sciences, when they are used as explanatory variables in regression models. To this purpose, we compare different variable selection procedures when text regression models are fitted to real, short, text data. We discuss the results obtained by several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection, in terms of number and importance of selected variables, assessed through goodness-of-fit measures, inclusion frequency and model class reliance. Latent Dirichlet allocation results are also considered as a term of comparison. Our perspective is primarily empirical and our starting point is the analysis of two real case studies, though bootstrap replications of each dataset are considered. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms. The second regards open questions in surveys on satisfaction ratings. The case studies are different in nature and representative of different kinds of short texts, as, in one case, a concise descriptive text is considered, whereas, in the other case, the text expresses an opinion.
引用
收藏
页码:69 / 99
页数:31
相关论文
共 50 条
  • [41] Solution path efficiency and oracle variable selection by Lasso-type methods
    Chand, Sohail
    Ahmad, Sarah
    Batool, Madeeha
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2018, 183 : 140 - 146
  • [42] LASSO-type variable selection methods for high-dimensional data
    Fu, Guanghui
    Wang, Pan
    ADVANCES IN COMPUTATIONAL MODELING AND SIMULATION, PTS 1 AND 2, 2014, 444-445 : 604 - 609
  • [43] LASSO-type instrumental variable selection methods with an application to Mendelian randomization
    Qasim, Muhammad
    Mansson, Kristofer
    Balakrishnan, Narayanaswamy
    STATISTICAL METHODS IN MEDICAL RESEARCH, 2025, 34 (02) : 201 - 223
  • [44] Variable Selection in the Kernel Regression Based Short-Term Load Forecasting Model
    Dudek, Grzegorz
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, PT II, 2012, 7268 : 557 - 563
  • [45] Feature Selection for Thermal Comfort Modeling based on Constrained LASSO Regression
    Guenther, Janine
    Sawodny, Oliver
    IFAC PAPERSONLINE, 2019, 52 (15): : 400 - 405
  • [46] Variable Group Selection Based on Regression Trees: Paper Machine Case Study
    Ivannikova, Elena
    Hamalainen, Timo
    Luostarinen, Kari
    2014 IEEE CONFERENCE ON EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS (EAIS), 2014,
  • [47] The Effects of Variable Selection Methods on Linear Regression-based Effort Estimation Models
    Amasaki, Sousuke
    Yokogawa, Tomoyuki
    2013 JOINT CONFERENCE OF THE 23RD INTERNATIONAL WORKSHOP ON SOFTWARE MEASUREMENT AND THE 2013 EIGHTH INTERNATIONAL CONFERENCE ON SOFTWARE PROCESS AND PRODUCT MEASUREMENT (IWSM-MENSURA), 2013, : 98 - 103
  • [48] Estimation of Soil Organic Matter Content Based on Characteristic Variable Selection and Regression Methods
    Li Guanwen
    Gao Xiaohong
    Xiao Nengwen
    Xiao Yunfei
    ACTA OPTICA SINICA, 2019, 39 (09)
  • [49] The LASSO and Sparse Least Squares Regression Methods for SNP Selection in Predicting Quantitative Traits
    Feng, Zeny Z.
    Yang, Xiaojian
    Subedi, Sanjeena
    McNicholas, Paul D.
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2012, 9 (02) : 629 - 636
  • [50] Variable selection and forecasting via automated methods for linear models: LASSO/adaLASSO and Autometrics
    Epprecht, Camila
    Guegan, Dominique
    Veiga, Alvaro
    da Rosa, Joel Correa
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2021, 50 (01) : 103 - 122