Partial Least Squares

Introduction
The goal of Least-Squares Method is to find a good estimation of parameters that fit a function, f(x), of a set of data, $$x_1 ... x_n$$. The Least-Squares Method requires that the estimated function has to deviate as little as possible from f(x) in the sense of a 2-norm. Generally speaking, Least-Squares Method has two categories, linear and non-linear. We can also classify these methods further: ordinary least squares (OLS), weighted least squares (WLS), and alternating least squares (ALS) and partial least squares (PLS).



To fit a set of data best, the least-squares method minimizes the sum of squared residuals (it is also called the Sum of Squared Errors, SSE.)

$$ S=\sum_{i=1}^{i=m}r_i^2$$,

with, $$r_i$$, the residual, which is the difference between the actual points and the regression line, and is defined as

$$ r_i= y_i - f(x_i)$$

where the m data pairs are $$ (x_i, y_i)\! $$, and the model function is $$ f(x_i) $$.

Here, we can choose n different parameters for f(x), so that the approximated function can best fit the data set.

Theory
Linear Least-Squares (LLS) Method assumes that the data set falls on a straight line. Therefore, $$ f(x) = ax + b $$, where a and b are constants. However, due to experimental error, some data might not be on the line exactly. There must be error (residual) between the estimated function and real data. Linear Least-Squares Method (or $$ l_2 $$ approximation) defined the best-fit function as the function that minimizes $$ S=\sum_{i=1}^{i=n}(y_i - (ax_i + b) )^2$$

The advantages of LLS:

1. If we assume that the errors have a normal probability distribution, then minimizing S gives us the best approximation of a and b.

2. We can easily use calculus determined the approximated value of a and b.

To minimize S, the following conditions must be satisfied $$ \frac{\partial S}{\partial a}=0 $$, and $$ \frac{\partial S}{\partial b}=0 $$

Taking the partial derivatives, we obtain $$ \sum_{i=0}^{i=n}(2*((ax_i + b) - y_i))*x_i = 0$$, and $$ \sum_{i=0}^{i=n}(2*((ax_i + b) - y_i)) = 0$$.

This system actually consists of two simultaneous linear equations with two unknowns a and b. (These two equations are so-called normal equations.)

Based on the simple calculation on summation, we can easily find out that

$$ a = \frac{1}{c}*[(n+1)*\sum_{i=0}^{i=n}(x_i*y_i)-(\sum_{i=0}^{i=n}(x_i))(\sum_{i=0}^{i=n}(y_i))] $$ and $$ b = \frac{1}{c}*[(\sum_{i=0}^{i=n}((x_i)^2))*(\sum_{i=0}^{i=n}(y_i))-(\sum_{i=0}^{i=n}(x_i))(\sum_{i=0}^{i=n}(x_i*y_i))] $$

where

$$ c = (n+1)*(\sum_{i=0}^{i=n}((x_i)^2))-(\sum_{i=0}^{i=n}(x_i))^2 $$.

Thus, the best estimated function for data set $$ (i, y_i) $$, for i is an integer between [1, n], is

$$ y = ax + b $$, where $$ a = \frac{1}{n^3-n}*[12*\sum_{i=1}^{i=n}(i*y_i)-6*(n+1)(\sum_{i=1}^{i=n}(y_i))] $$ and $$ b = \frac{1}{n^2-n}*[(4*n+2)*\sum_{i=1}^{i=n}(y_i)-6*(\sum_{i=1}^{i=n}(i*y_i))] $$.

Least-Squares Method in statistical view
From equation $$ X]^T[X{A} = {[X]^T[Y]} $$, we can derive the following equation: $${A} = X]^T[X^{-1}{[X]^T{Y}}$$.

From this equation, we can determine not only the coefficients, but also the approximated values in statistic.

Using calculus, the following formulas for coefficients can be obtained:

$$a = $$$$S_{xy} \over{S^2_x} $$ and $$ b = \bar{y} - a*\bar{x} $$

where

$$ S_{xy} = $$$$\sum_{i=1}^{i=n}((x_i-\bar{x})*(y_i-\bar{y})) \over{n-1}$$

$$ S^2_x = $$$$\sum_{i=1}^{i=n}((x_i-\bar{x})^2) \over{n-1}$$

$$ \bar{x} = $$$$\sum_{i=1}^{i=n}(x_i) \over{n}$$

$$ \bar{y} = $$$$\sum_{i=1}^{i=n}(y_i) \over{n}$$.

Moreover, the diagonal values and non-diagonal values matrix $$X]^T[X^{-1}$$ represents variances and covariances of coefficient $$ a_i $$, respectively.

Assume the diagonal values of $$X]^T[X^{-1}$$ is $$ x_{i,i} $$ and the corresponding coefficient is $$ a_i $$, then

$$var(a_{i-1}) = x_{i,i}*s^2_{y/x}$$ and $$cov(a_{i-1}, a_{j-1}) = x_{i,j}*s^2_{y/x}$$

where $$ s_{y/x} $$ is called stand error of the estimate, and $$ s_{y/x} = \sqrt{S \over {n - 2}} $$.

(Here, lower index, y/x, means that the error of certain x is caused by the inaccurate approximation of corresponding y.)

We have many application on these two information. For example, we can derive the upper and lower bound of intercept and slope.