Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance

被引:0
|
作者
Antonio Irpino
Rosanna Verde
机构
[1] Second University of Naples,Department of Political Sciences “J. Monnet”
关键词
Modal symbolic variables; Probability distribution function; Histogram data; Regression; Wasserstein distance; 62J05; 62G30; 46F10;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper we present a new linear regression technique for distributional symbolic variables, i.e., variables whose realizations can be histograms, empirical distributions or empirical estimates of parametric distributions. Such data are known as numerical modal data according to the Symbolic Data Analysis definitions. In order to measure the error between the observed and the predicted distributions, the ℓ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _2$$\end{document} Wasserstein distance is proposed. Some properties of such a metric are exploited to predict the modal response variable as a linear combination of the explanatory modal variables. Based on the metric, the model uses the quantile functions associated with the data and thus is subject to a positivity constraint of the estimated parameters. We propose solving the linear regression problem by starting from a particular decomposition of the squared distance. Therefore, we estimate the model parameters according to two separate models, one for the averages of the data and one for the centered distributions by a constrained least squares algorithm. Measures of goodness-of-fit are also proposed and discussed. The method is validated by two applications, one on simulated data and one on two real-world datasets.
引用
收藏
页码:81 / 106
页数:25
相关论文
共 50 条