Bayesian multivariate linear regression

From testwiki
Revision as of 19:42, 29 January 2025 by imported>ArturGower (Corrected a typo where the Covariance matrix was placed. Now the forms fit the usual multi-variant Gaussian form.)
(diff) ← Older revision | Latest revision (diff) | Newer revision β†’ (diff)
Jump to navigation Jump to search

Template:Short description Template:Regression bar In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

Details

Consider a regression problem where the dependent variable to be predicted is not a single real-valued scalar but an m-length vector of correlated real numbers. As in the standard regression setup, there are n observations, where each observation i consists of kβˆ’1 explanatory variables, grouped into a vector 𝐱i of length k (where a dummy variable with a value of 1 has been added to allow for an intercept coefficient). This can be viewed as a set of m related regression problems for each observation i: yi,1=𝐱i𝖳β1+ϵi,1yi,m=𝐱i𝖳βm+ϵi,m where the set of errors {ϵi,1,,ϵi,m} are all correlated. Equivalently, it can be viewed as a single regression problem where the outcome is a row vector 𝐲i𝖳 and the regression coefficient vectors are stacked next to each other, as follows: 𝐲i𝖳=𝐱i𝖳𝐁+ϵi𝖳.

The coefficient matrix B is a k×m matrix where the coefficient vectors β1,,βm for each regression problem are stacked horizontally: 𝐁=[(β1)(βm)]=[(β1,1βk,1)(β1,mβk,m)].

The noise vector ϵi for each observation i is jointly normal, so that the outcomes for a given observation are correlated: ϵiN(0,Σϵ).

We can write the entire regression problem in matrix form as: 𝐘=𝐗𝐁+𝐄, where Y and E are n×m matrices. The design matrix X is an n×k matrix with the observations stacked vertically, as in the standard linear regression setup: 𝐗=[𝐱1𝖳𝐱2𝖳𝐱n𝖳]=[x1,1x1,kx2,1x2,kxn,1xn,k].

The classical, frequentists linear least squares solution is to simply estimate the matrix of regression coefficients 𝐁^ using the Moore-Penrose pseudoinverse: 𝐁^=(𝐗𝖳𝐗)1π—π–³π˜.

To obtain the Bayesian solution, we need to specify the conditional likelihood and then find the appropriate conjugate prior. As with the univariate case of linear Bayesian regression, we will find that we can specify a natural conditional conjugate prior (which is scale dependent).

Let us write our conditional likelihood as[1] ρ(𝐄|Σϵ)|Σϵ|n/2exp(12tr(𝐄𝖳Σϵ1𝐄)), writing the error 𝐄 in terms of 𝐘,𝐗, and 𝐁 yields ρ(𝐘|𝐗,𝐁,Σϵ)|Σϵ|n/2exp(12tr((π˜π—π)𝖳Σϵ1(π˜π—π))),

We seek a natural conjugate priorβ€”a joint density ρ(𝐁,Σϵ) which is of the same functional form as the likelihood. Since the likelihood is quadratic in 𝐁, we re-write the likelihood so it is normal in (𝐁𝐁^) (the deviation from classical sample estimate).

Using the same technique as with Bayesian linear regression, we decompose the exponential term using a matrix-form of the sum-of-squares technique. Here, however, we will also need to use the Matrix Differential Calculus (Kronecker product and vectorization transformations).

First, let us apply sum-of-squares to obtain new expression for the likelihood: ρ(𝐘|𝐗,𝐁,Σϵ)|Σϵ|(nk)/2exp(tr(12𝐒𝖳𝐒Σϵ1))|Σϵ|k/2exp(12tr((𝐁𝐁^)𝖳𝐗𝖳𝐗(𝐁𝐁^)Σϵ1)), 𝐒=π˜π—π^

We would like to develop a conditional form for the priors: ρ(𝐁,Σϵ)=ρ(Σϵ)ρ(𝐁|Σϵ), where ρ(Σϵ) is an inverse-Wishart distribution and ρ(𝐁|Σϵ) is some form of normal distribution in the matrix 𝐁. This is accomplished using the vectorization transformation, which converts the likelihood from a function of the matrices 𝐁,𝐁^ to a function of the vectors β=vec(𝐁),β^=vec(𝐁^).

Write tr((𝐁𝐁^)𝖳𝐗𝖳𝐗(𝐁𝐁^)Σϵ1)=vec(𝐁𝐁^)𝖳vec(𝐗𝖳𝐗(𝐁𝐁^)Σϵ1)

Let vec(𝐗𝖳𝐗(𝐁𝐁^)Σϵ1)=(Σϵ1𝐗𝖳𝐗)vec(𝐁𝐁^), where 𝐀𝐁 denotes the Kronecker product of matrices A and B, a generalization of the outer product which multiplies an m×n matrix by a p×q matrix to generate an mp×nq matrix, consisting of every combination of products of elements from the two matrices.

Then vec(𝐁𝐁^)𝖳(Σϵ1𝐗𝖳𝐗)vec(𝐁𝐁^)=(ββ^)𝖳(Σϵ1𝐗𝖳𝐗)(ββ^) which will lead to a likelihood which is normal in (ββ^).

With the likelihood in a more tractable form, we can now find a natural (conditional) conjugate prior.

Conjugate prior distribution

The natural conjugate prior using the vectorized variable β is of the form:[1] ρ(β,Σϵ)=ρ(Σϵ)ρ(β|Σϵ), where ρ(Σϵ)𝒲1(𝐕0,ν0) and ρ(β|Σϵ)N(β0,ΣϵΛ01).

Posterior distribution

Using the above prior and likelihood, the posterior distribution can be expressed as:[1] ρ(β,Σϵ|𝐘,𝐗)|Σϵ|(ν0+m+1)/2exp(12tr(𝐕0Σϵ1))×|Σϵ|k/2exp(12tr((𝐁𝐁0)𝖳Λ0(𝐁𝐁0)Σϵ1))×|Σϵ|n/2exp(12tr((π˜π—π)𝖳(π˜π—π)Σϵ1)), where vec(𝐁0)=β0. The terms involving 𝐁 can be grouped (with Λ0=𝐔𝖳𝐔) using: (𝐁𝐁0)𝖳Λ0(𝐁𝐁0)+(π˜π—π)𝖳(π˜π—π)=([π˜π”π0][𝐗𝐔]𝐁)𝖳([π˜π”π0][𝐗𝐔]𝐁)=([π˜π”π0][𝐗𝐔]𝐁n)𝖳([π˜π”π0][𝐗𝐔]𝐁n)+(𝐁𝐁n)𝖳(𝐗𝖳𝐗+Λ0)(𝐁𝐁n)=(π˜π—πn)𝖳(π˜π—πn)+(𝐁0𝐁n)𝖳Λ0(𝐁0𝐁n)+(𝐁𝐁n)𝖳(𝐗𝖳𝐗+Λ0)(𝐁𝐁n), with 𝐁n=(𝐗𝖳𝐗+Λ0)1(𝐗𝖳𝐗𝐁^+Λ0𝐁0)=(𝐗𝖳𝐗+Λ0)1(π—π–³π˜+Λ0𝐁0).

This now allows us to write the posterior in a more useful form: ρ(β,Σϵ|𝐘,𝐗)|Σϵ|(ν0+m+n+1)/2exp(12tr((𝐕0+(π˜π—ππ§)𝖳(π˜π—ππ§)+(𝐁n𝐁0)𝖳Λ0(𝐁n𝐁0))Σϵ1))×|Σϵ|k/2exp(12tr((𝐁𝐁n)𝖳(𝐗T𝐗+Λ0)(𝐁𝐁n)Σϵ1)).

This takes the form of an inverse-Wishart distribution times a Matrix normal distribution: ρ(Σϵ|𝐘,𝐗)𝒲1(𝐕n,νn) and ρ(𝐁|𝐘,𝐗,Σϵ)ℳ𝒩k,m(𝐁n,Λn1,Σϵ).

The parameters of this posterior are given by: 𝐕n=𝐕0+(π˜π—ππ§)𝖳(π˜π—ππ§)+(𝐁n𝐁0)𝖳Λ0(𝐁n𝐁0) νn=ν0+n 𝐁n=(𝐗𝖳𝐗+Λ0)1(π—π–³π˜+Λ0𝐁0) Λn=𝐗𝖳𝐗+Λ0

See also

References

Template:Reflist Template:More footnotes

  1. ↑ 1.0 1.1 1.2 Peter E. Rossi, Greg M. Allenby, Rob McCulloch. Bayesian Statistics and Marketing. John Wiley & Sons, 2012, p. 32.