Bayesian multivariate linear regression

From testwiki
Jump to navigation Jump to search

Template:Short description Template:Regression bar In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

Details

Consider a regression problem where the dependent variable to be predicted is not a single real-valued scalar but an m-length vector of correlated real numbers. As in the standard regression setup, there are n observations, where each observation i consists of kโˆ’1 explanatory variables, grouped into a vector ๐ฑi of length k (where a dummy variable with a value of 1 has been added to allow for an intercept coefficient). This can be viewed as a set of m related regression problems for each observation i: yi,1=๐ฑi๐–ณ๐œท1+ฯตi,1โ‹ฎyi,m=๐ฑi๐–ณ๐œทm+ฯตi,m where the set of errors {ฯตi,1,โ€ฆ,ฯตi,m} are all correlated. Equivalently, it can be viewed as a single regression problem where the outcome is a row vector ๐ฒi๐–ณ and the regression coefficient vectors are stacked next to each other, as follows: ๐ฒi๐–ณ=๐ฑi๐–ณ๐+๐i๐–ณ.

The coefficient matrix B is a kร—m matrix where the coefficient vectors ๐œท1,โ€ฆ,๐œทm for each regression problem are stacked horizontally: ๐=[(๐œท1)โ‹ฏ(๐œทm)]=[(ฮฒ1,1โ‹ฎฮฒk,1)โ‹ฏ(ฮฒ1,mโ‹ฎฮฒk,m)].

The noise vector ๐i for each observation i is jointly normal, so that the outcomes for a given observation are correlated: ๐iโˆผN(0,๐œฎฯต).

We can write the entire regression problem in matrix form as: ๐˜=๐—๐+๐„, where Y and E are nร—m matrices. The design matrix X is an nร—k matrix with the observations stacked vertically, as in the standard linear regression setup: ๐—=[๐ฑ1๐–ณ๐ฑ2๐–ณโ‹ฎ๐ฑn๐–ณ]=[x1,1โ‹ฏx1,kx2,1โ‹ฏx2,kโ‹ฎโ‹ฑโ‹ฎxn,1โ‹ฏxn,k].

The classical, frequentists linear least squares solution is to simply estimate the matrix of regression coefficients ๐^ using the Moore-Penrose pseudoinverse: ๐^=(๐—๐–ณ๐—)โˆ’1๐—๐–ณ๐˜.

To obtain the Bayesian solution, we need to specify the conditional likelihood and then find the appropriate conjugate prior. As with the univariate case of linear Bayesian regression, we will find that we can specify a natural conditional conjugate prior (which is scale dependent).

Let us write our conditional likelihood as[1] ฯ(๐„|๐œฎฯต)โˆ|๐œฎฯต|โˆ’n/2exp(โˆ’12tr(๐„๐–ณ๐œฎฯตโˆ’1๐„)), writing the error ๐„ in terms of ๐˜,๐—, and ๐ yields ฯ(๐˜|๐—,๐,๐œฎฯต)โˆ|๐œฎฯต|โˆ’n/2exp(โˆ’12tr((๐˜โˆ’๐—๐)๐–ณ๐œฎฯตโˆ’1(๐˜โˆ’๐—๐))),

We seek a natural conjugate priorโ€”a joint density ฯ(๐,ฮฃฯต) which is of the same functional form as the likelihood. Since the likelihood is quadratic in ๐, we re-write the likelihood so it is normal in (๐โˆ’๐^) (the deviation from classical sample estimate).

Using the same technique as with Bayesian linear regression, we decompose the exponential term using a matrix-form of the sum-of-squares technique. Here, however, we will also need to use the Matrix Differential Calculus (Kronecker product and vectorization transformations).

First, let us apply sum-of-squares to obtain new expression for the likelihood: ฯ(๐˜|๐—,๐,๐œฎฯต)โˆ|๐œฎฯต|โˆ’(nโˆ’k)/2exp(โˆ’tr(12๐’๐–ณ๐’๐œฎฯตโˆ’1))|๐œฎฯต|โˆ’k/2exp(โˆ’12tr((๐โˆ’๐^)๐–ณ๐—๐–ณ๐—(๐โˆ’๐^)๐œฎฯตโˆ’1)), ๐’=๐˜โˆ’๐—๐^

We would like to develop a conditional form for the priors: ฯ(๐,๐œฎฯต)=ฯ(๐œฎฯต)ฯ(๐|๐œฎฯต), where ฯ(๐œฎฯต) is an inverse-Wishart distribution and ฯ(๐|๐œฎฯต) is some form of normal distribution in the matrix ๐. This is accomplished using the vectorization transformation, which converts the likelihood from a function of the matrices ๐,๐^ to a function of the vectors ๐œท=vec(๐),๐œท^=vec(๐^).

Write tr((๐โˆ’๐^)๐–ณ๐—๐–ณ๐—(๐โˆ’๐^)๐œฎฯตโˆ’1)=vec(๐โˆ’๐^)๐–ณvec(๐—๐–ณ๐—(๐โˆ’๐^)๐œฎฯตโˆ’1)

Let vec(๐—๐–ณ๐—(๐โˆ’๐^)๐œฎฯตโˆ’1)=(๐œฎฯตโˆ’1โŠ—๐—๐–ณ๐—)vec(๐โˆ’๐^), where ๐€โŠ—๐ denotes the Kronecker product of matrices A and B, a generalization of the outer product which multiplies an mร—n matrix by a pร—q matrix to generate an mpร—nq matrix, consisting of every combination of products of elements from the two matrices.

Then vec(๐โˆ’๐^)๐–ณ(๐œฎฯตโˆ’1โŠ—๐—๐–ณ๐—)vec(๐โˆ’๐^)=(๐œทโˆ’๐œท^)๐–ณ(๐œฎฯตโˆ’1โŠ—๐—๐–ณ๐—)(๐œทโˆ’๐œท^) which will lead to a likelihood which is normal in (๐œทโˆ’๐œท^).

With the likelihood in a more tractable form, we can now find a natural (conditional) conjugate prior.

Conjugate prior distribution

The natural conjugate prior using the vectorized variable ๐œท is of the form:[1] ฯ(๐œท,๐œฎฯต)=ฯ(๐œฎฯต)ฯ(๐œท|๐œฎฯต), where ฯ(๐œฎฯต)โˆผ๐’ฒโˆ’1(๐•0,๐‚0) and ฯ(๐œท|๐œฎฯต)โˆผN(๐œท0,๐œฎฯตโŠ—๐œฆ0โˆ’1).

Posterior distribution

Using the above prior and likelihood, the posterior distribution can be expressed as:[1] ฯ(๐œท,๐œฎฯต|๐˜,๐—)โˆ|๐œฎฯต|โˆ’(๐‚0+m+1)/2exp(โˆ’12tr(๐•0๐œฎฯตโˆ’1))ร—|๐œฎฯต|โˆ’k/2exp(โˆ’12tr((๐โˆ’๐0)๐–ณ๐œฆ0(๐โˆ’๐0)๐œฎฯตโˆ’1))ร—|๐œฎฯต|โˆ’n/2exp(โˆ’12tr((๐˜โˆ’๐—๐)๐–ณ(๐˜โˆ’๐—๐)๐œฎฯตโˆ’1)), where vec(๐0)=๐œท0. The terms involving ๐ can be grouped (with ๐œฆ0=๐”๐–ณ๐”) using: (๐โˆ’๐0)๐–ณ๐œฆ0(๐โˆ’๐0)+(๐˜โˆ’๐—๐)๐–ณ(๐˜โˆ’๐—๐)=([๐˜๐”๐0]โˆ’[๐—๐”]๐)๐–ณ([๐˜๐”๐0]โˆ’[๐—๐”]๐)=([๐˜๐”๐0]โˆ’[๐—๐”]๐n)๐–ณ([๐˜๐”๐0]โˆ’[๐—๐”]๐n)+(๐โˆ’๐n)๐–ณ(๐—๐–ณ๐—+๐œฆ0)(๐โˆ’๐n)=(๐˜โˆ’๐—๐n)๐–ณ(๐˜โˆ’๐—๐n)+(๐0โˆ’๐n)๐–ณ๐œฆ0(๐0โˆ’๐n)+(๐โˆ’๐n)๐–ณ(๐—๐–ณ๐—+๐œฆ0)(๐โˆ’๐n), with ๐n=(๐—๐–ณ๐—+๐œฆ0)โˆ’1(๐—๐–ณ๐—๐^+๐œฆ0๐0)=(๐—๐–ณ๐—+๐œฆ0)โˆ’1(๐—๐–ณ๐˜+๐œฆ0๐0).

This now allows us to write the posterior in a more useful form: ฯ(๐œท,๐œฎฯต|๐˜,๐—)โˆ|๐œฎฯต|โˆ’(๐‚0+m+n+1)/2exp(โˆ’12tr((๐•0+(๐˜โˆ’๐—๐๐ง)๐–ณ(๐˜โˆ’๐—๐๐ง)+(๐nโˆ’๐0)๐–ณ๐œฆ0(๐nโˆ’๐0))๐œฎฯตโˆ’1))ร—|๐œฎฯต|โˆ’k/2exp(โˆ’12tr((๐โˆ’๐n)๐–ณ(๐—T๐—+๐œฆ0)(๐โˆ’๐n)๐œฎฯตโˆ’1)).

This takes the form of an inverse-Wishart distribution times a Matrix normal distribution: ฯ(๐œฎฯต|๐˜,๐—)โˆผ๐’ฒโˆ’1(๐•n,๐‚n) and ฯ(๐|๐˜,๐—,๐œฎฯต)โˆผโ„ณ๐’ฉk,m(๐n,๐œฆnโˆ’1,๐œฎฯต).

The parameters of this posterior are given by: ๐•n=๐•0+(๐˜โˆ’๐—๐๐ง)๐–ณ(๐˜โˆ’๐—๐๐ง)+(๐nโˆ’๐0)๐–ณ๐œฆ0(๐nโˆ’๐0) ๐‚n=๐‚0+n ๐n=(๐—๐–ณ๐—+๐œฆ0)โˆ’1(๐—๐–ณ๐˜+๐œฆ0๐0) ๐œฆn=๐—๐–ณ๐—+๐œฆ0

See also

References

Template:Reflist Template:More footnotes

  1. โ†‘ 1.0 1.1 1.2 Peter E. Rossi, Greg M. Allenby, Rob McCulloch. Bayesian Statistics and Marketing. John Wiley & Sons, 2012, p. 32.