Lasso-based variable selection methods in text regression: the case of short texts

被引:2
|
作者
Freo, Marzia [1 ]
Luati, Alessandra [2 ,3 ]
机构
[1] European Commiss, Joint Res Ctr JRC, Ispra, Italy
[2] Imperial Coll London, Dept Math, London, England
[3] Univ Bologna, Dept Stat, Bologna, Italy
关键词
Text mining; Lasso; Variable screening; Stability selection; Latent Dirichlet allocation; MODEL SELECTION; REGULARIZATION;
D O I
10.1007/s10182-023-00472-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Communication through websites is often characterised by short texts, made of few words, such as image captions or tweets. This paper explores the class of supervised learning methods for the analysis of short texts, as an alternative to unsupervised methods, widely employed to infer topics from structured texts. The aim is to assess the effectiveness of text data in social sciences, when they are used as explanatory variables in regression models. To this purpose, we compare different variable selection procedures when text regression models are fitted to real, short, text data. We discuss the results obtained by several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection, in terms of number and importance of selected variables, assessed through goodness-of-fit measures, inclusion frequency and model class reliance. Latent Dirichlet allocation results are also considered as a term of comparison. Our perspective is primarily empirical and our starting point is the analysis of two real case studies, though bootstrap replications of each dataset are considered. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms. The second regards open questions in surveys on satisfaction ratings. The case studies are different in nature and representative of different kinds of short texts, as, in one case, a concise descriptive text is considered, whereas, in the other case, the text expresses an opinion.
引用
收藏
页码:69 / 99
页数:31
相关论文
共 50 条
  • [31] Variable selection with group LASSO approach: Application to Cox regression with frailty model
    Utazirubanda, Jean Claude
    M. Leon, Tomas
    Ngom, Papa
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2021, 50 (03) : 881 - 901
  • [32] Robust regression shrinkage and consistent variable selection through the LAD-lasso
    Wang, Hansheng
    Li, Guodong
    Jiang, Guohua
    JOURNAL OF BUSINESS & ECONOMIC STATISTICS, 2007, 25 (03) : 347 - 355
  • [33] Resampling methods for variable selection in robust regression
    Wisnowski, JW
    Simpson, JR
    Montgomery, DC
    Runger, GC
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2003, 43 (03) : 341 - 355
  • [34] A LASSO-Based Prediction Model for Child Influenza Epidemics: A Case Study of Shanghai, China
    Zhu, Jin
    Xu, Yu
    Yu, Guangjun
    Gao, Jie
    Liu, Yuan
    Cheng, Dayu
    Song, Ci
    Chen, Jie
    Pei, Tao
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
  • [35] Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors
    Kubkowski, Mariusz
    Mielniczuk, Jan
    ENTROPY, 2020, 22 (02)
  • [36] Robust Variable Selection and Regularization in Quantile Regression Based on Adaptive-LASSO and Adaptive E-NET
    Mudhombo, Innocent
    Ranganai, Edmore
    COMPUTATION, 2022, 10 (11)
  • [37] Simultaneous estimation and variable selection in median regression using Lasso-type penalty
    Jinfeng Xu
    Zhiliang Ying
    Annals of the Institute of Statistical Mathematics, 2010, 62 : 487 - 514
  • [38] Weighted LAD-LASSO method for robust parameter estimation and variable selection in regression
    Arslan, Olcay
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2012, 56 (06) : 1952 - 1965
  • [39] Sparse group variable selection based on quantile hierarchical Lasso
    Zhao, Weihua
    Zhang, Riquan
    Liu, Jicai
    JOURNAL OF APPLIED STATISTICS, 2014, 41 (08) : 1658 - 1677
  • [40] Simultaneous estimation and variable selection in median regression using Lasso-type penalty
    Xu, Jinfeng
    Ying, Zhiliang
    ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 2010, 62 (03) : 487 - 514