# Partial Least Squares

## Introduction

The goal of Least-Squares Method is to find a good estimation of parameters that fit a function, f(x), of a set of data, [math]x_1 ... x_n[/math]. The Least-Squares Method requires that the estimated function has to deviate as little as possible from f(x) in the sense of a 2-norm. Generally speaking, Least-Squares Method has two categories, linear and non-linear. We can also classify these methods further: ordinary least squares (OLS), weighted least squares (WLS), and alternating least squares (ALS) and partial least squares (PLS).

To fit a set of data best, the least-squares method minimizes the sum of squared residuals (it is also called the Sum of Squared Errors, SSE.)

[math] S=\sum_{i=1}^{i=m}r_i^2[/math],

with, [math]r_i[/math], the residual, which is the difference between the actual points and the regression line, and is defined as

[math] r_i= y_i - f(x_i)[/math]

where the m data pairs are [math] (x_i, y_i)\! [/math], and the model function is [math] f(x_i) [/math].

Here, we can choose **n** different parameters for f(x), so that the approximated function can best fit the data set.

## Theory

Linear Least-Squares (LLS) Method assumes that the data set falls on a straight line. Therefore, [math] f(x) = ax + b [/math], where a and b are constants. However, due to experimental error, some data might not be on the line exactly. There must be error (residual) between the estimated function and real data. Linear Least-Squares Method (or [math] l_2 [/math] approximation) defined the best-fit function as the function that minimizes [math] S=\sum_{i=1}^{i=n}(y_i - (ax_i + b) )^2[/math]

The advantages of LLS:

1. If we assume that the errors have a normal probability distribution, then minimizing S gives us the best approximation of a and b.

2. We can easily use calculus determined the approximated value of a and b.

To minimize S, the following conditions must be satisfied [math] \frac{\partial S}{\partial a}=0 [/math], and [math] \frac{\partial S}{\partial b}=0 [/math]

Taking the partial derivatives, we obtain [math] \sum_{i=0}^{i=n}(2*((ax_i + b) - y_i))*x_i = 0[/math], and [math] \sum_{i=0}^{i=n}(2*((ax_i + b) - y_i)) = 0[/math].

This system actually consists of two simultaneous linear equations with two unknowns a and b. (These two equations are so-called normal equations.)

Based on the simple calculation on summation, we can easily find out that

[math] a = \frac{1}{c}*[(n+1)*\sum_{i=0}^{i=n}(x_i*y_i)-(\sum_{i=0}^{i=n}(x_i))(\sum_{i=0}^{i=n}(y_i))] [/math] and [math] b = \frac{1}{c}*[(\sum_{i=0}^{i=n}((x_i)^2))*(\sum_{i=0}^{i=n}(y_i))-(\sum_{i=0}^{i=n}(x_i))(\sum_{i=0}^{i=n}(x_i*y_i))] [/math]

where

[math] c = (n+1)*(\sum_{i=0}^{i=n}((x_i)^2))-(\sum_{i=0}^{i=n}(x_i))^2 [/math] .

Thus, the best estimated function for data set [math] (i, y_i) [/math], for i is an integer between [1, n], is

[math] y = ax + b [/math], where [math] a = \frac{1}{n^3-n}*[12*\sum_{i=1}^{i=n}(i*y_i)-6*(n+1)(\sum_{i=1}^{i=n}(y_i))] [/math] and [math] b = \frac{1}{n^2-n}*[(4*n+2)*\sum_{i=1}^{i=n}(y_i)-6*(\sum_{i=1}^{i=n}(i*y_i))] [/math].

### Least-Squares Method in statistical view

From equation [math] [[X]^T[X]]{A} = {[X]^T[Y]} [/math], we can derive the following equation: [math]{A} = [[X]^T[X]]^{-1}{[X]^T{Y}}[/math].

From this equation, we can determine not only the coefficients, but also the approximated values in statistic.

Using calculus, the following formulas for coefficients can be obtained:

[math]a = [/math][math]S_{xy} \over{S^2_x} [/math] and [math] b = \bar{y} - a*\bar{x} [/math]

where

[math] S_{xy} = [/math][math]\sum_{i=1}^{i=n}((x_i-\bar{x})*(y_i-\bar{y})) \over{n-1}[/math]

[math] S^2_x = [/math][math]\sum_{i=1}^{i=n}((x_i-\bar{x})^2) \over{n-1}[/math]

[math] \bar{x} = [/math][math]\sum_{i=1}^{i=n}(x_i) \over{n}[/math]

[math] \bar{y} = [/math][math]\sum_{i=1}^{i=n}(y_i) \over{n}[/math].

Moreover, the diagonal values and non-diagonal values matrix [math][[X]^T[X]]^{-1}[/math] represents variances and covariances of coefficient [math] a_i [/math], respectively.

Assume the diagonal values of [math][[X]^T[X]]^{-1}[/math] is [math] x_{i,i} [/math] and the corresponding coefficient is [math] a_i [/math], then

[math]var(a_{i-1}) = x_{i,i}*s^2_{y/x}[/math] and [math]cov(a_{i-1}, a_{j-1}) = x_{i,j}*s^2_{y/x}[/math]

where [math] s_{y/x} [/math] is called stand error of the estimate, and [math] s_{y/x} = \sqrt{S \over {n - 2}} [/math].

(Here, lower index, y/x, means that the error of certain x is caused by the inaccurate approximation of corresponding y.)

We have many application on these two information. For example, we can derive the upper and lower bound of intercept and slope.