Lasso-based variable selection methods in text regression: the case of short texts

被引:2
|
作者
Freo, Marzia [1 ]
Luati, Alessandra [2 ,3 ]
机构
[1] European Commiss, Joint Res Ctr JRC, Ispra, Italy
[2] Imperial Coll London, Dept Math, London, England
[3] Univ Bologna, Dept Stat, Bologna, Italy
关键词
Text mining; Lasso; Variable screening; Stability selection; Latent Dirichlet allocation; MODEL SELECTION; REGULARIZATION;
D O I
10.1007/s10182-023-00472-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Communication through websites is often characterised by short texts, made of few words, such as image captions or tweets. This paper explores the class of supervised learning methods for the analysis of short texts, as an alternative to unsupervised methods, widely employed to infer topics from structured texts. The aim is to assess the effectiveness of text data in social sciences, when they are used as explanatory variables in regression models. To this purpose, we compare different variable selection procedures when text regression models are fitted to real, short, text data. We discuss the results obtained by several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection, in terms of number and importance of selected variables, assessed through goodness-of-fit measures, inclusion frequency and model class reliance. Latent Dirichlet allocation results are also considered as a term of comparison. Our perspective is primarily empirical and our starting point is the analysis of two real case studies, though bootstrap replications of each dataset are considered. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms. The second regards open questions in surveys on satisfaction ratings. The case studies are different in nature and representative of different kinds of short texts, as, in one case, a concise descriptive text is considered, whereas, in the other case, the text expresses an opinion.
引用
收藏
页码:69 / 99
页数:31
相关论文
共 50 条
  • [21] LASSO-based false-positive selection for class-imbalanced data in metabolomics
    Fu, Guang-Hui
    Yi, Lun-Zhao
    Pan, Jianxin
    JOURNAL OF CHEMOMETRICS, 2019, 33 (10)
  • [22] LASSO-based feature selection and naïve Bayes classifier for crime prediction and its type
    Gnaneswara Rao Nitta
    B. Yogeshwara Rao
    T. Sravani
    N. Ramakrishiah
    M. BalaAnand
    Service Oriented Computing and Applications, 2019, 13 : 187 - 197
  • [23] Combining Feature Selection and Classification Using LASSO-Based MCO Classifier for Credit Risk Evaluation
    Li, Xiufang
    Zhang, Zhiwang
    Li, Lingyun
    Pan, Hui
    COMPUTATIONAL ECONOMICS, 2024, 64 (5) : 2641 - 2662
  • [24] Variable Selection and Model Prediction Based on Lasso, Adaptive Lasso and Elastic Net
    Fan, Lei
    Li, Qun
    Chen, Shuai
    Zhu, Zhouli
    PROCEEDINGS OF 2015 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2015), 2015, : 579 - 583
  • [25] Robust variable selection based on the random quantile LASSO
    Wang, Yan
    Jiang, Yunlu
    Zhang, Jiantao
    Chen, Zhongran
    Xie, Baojian
    Zhao, Chengxiang
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2022, 51 (01) : 29 - 39
  • [26] Robust Variable Selection Based on Relaxed Lad Lasso
    Li, Hongyu
    Xu, Xieting
    Lu, Yajun
    Yu, Xi
    Zhao, Tong
    Zhang, Rufei
    SYMMETRY-BASEL, 2022, 14 (10):
  • [27] LAD-Lasso variable selection for doubly censored median regression models
    Zhou, Xiuqing
    Liu, Guoxiang
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2016, 45 (12) : 3658 - 3667
  • [28] Distribution based truncation for variable selection in subspace methods for multivariate regression
    Liland, Kristian Hovde
    Hoy, Martin
    Martens, Harald
    Saebo, Solve
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2013, 122 : 103 - 111
  • [29] A Loss-Based Prior for Variable Selection in Linear Regression Methods
    Villa, Cristiano
    Lee, Jeong Eun
    BAYESIAN ANALYSIS, 2020, 15 (02): : 533 - 558
  • [30] Lasso in infinite dimension: application to variable selection in functional multivariate linear regression
    Roche, Angelina
    ELECTRONIC JOURNAL OF STATISTICS, 2023, 17 (02): : 3357 - 3405