# Partial Least Squares

## Introduction

The goal of Least-Squares Method is to find a good estimation of parameters that fit a function, f(x), of a set of data, $x_1 ... x_n$. The Least-Squares Method requires that the estimated function has to deviate as little as possible from f(x) in the sense of a 2-norm. Generally speaking, Least-Squares Method has two categories, linear and non-linear. We can also classify these methods further: ordinary least squares (OLS), weighted least squares (WLS), and alternating least squares (ALS) and partial least squares (PLS).

To fit a set of data best, the least-squares method minimizes the sum of squared residuals (it is also called the Sum of Squared Errors, SSE.)

$S=\sum_{i=1}^{i=m}r_i^2$,

with, $r_i$, the residual, which is the difference between the actual points and the regression line, and is defined as

$r_i= y_i - f(x_i)$

where the m data pairs are $(x_i, y_i)\!$, and the model function is $f(x_i)$.

Here, we can choose n different parameters for f(x), so that the approximated function can best fit the data set.

## Theory

Linear Least-Squares (LLS) Method assumes that the data set falls on a straight line. Therefore, $f(x) = ax + b$, where a and b are constants. However, due to experimental error, some data might not be on the line exactly. There must be error (residual) between the estimated function and real data. Linear Least-Squares Method (or $l_2$ approximation) defined the best-fit function as the function that minimizes $S=\sum_{i=1}^{i=n}(y_i - (ax_i + b) )^2$

1. If we assume that the errors have a normal probability distribution, then minimizing S gives us the best approximation of a and b.

2. We can easily use calculus determined the approximated value of a and b.

To minimize S, the following conditions must be satisfied $\frac{\partial S}{\partial a}=0$, and $\frac{\partial S}{\partial b}=0$

Taking the partial derivatives, we obtain $\sum_{i=0}^{i=n}(2*((ax_i + b) - y_i))*x_i = 0$, and $\sum_{i=0}^{i=n}(2*((ax_i + b) - y_i)) = 0$.

This system actually consists of two simultaneous linear equations with two unknowns a and b. (These two equations are so-called normal equations.)

Based on the simple calculation on summation, we can easily find out that

$a = \frac{1}{c}*[(n+1)*\sum_{i=0}^{i=n}(x_i*y_i)-(\sum_{i=0}^{i=n}(x_i))(\sum_{i=0}^{i=n}(y_i))]$ and $b = \frac{1}{c}*[(\sum_{i=0}^{i=n}((x_i)^2))*(\sum_{i=0}^{i=n}(y_i))-(\sum_{i=0}^{i=n}(x_i))(\sum_{i=0}^{i=n}(x_i*y_i))]$

where

$c = (n+1)*(\sum_{i=0}^{i=n}((x_i)^2))-(\sum_{i=0}^{i=n}(x_i))^2$ .

Thus, the best estimated function for data set $(i, y_i)$, for i is an integer between [1, n], is

$y = ax + b$, where $a = \frac{1}{n^3-n}*[12*\sum_{i=1}^{i=n}(i*y_i)-6*(n+1)(\sum_{i=1}^{i=n}(y_i))]$ and $b = \frac{1}{n^2-n}*[(4*n+2)*\sum_{i=1}^{i=n}(y_i)-6*(\sum_{i=1}^{i=n}(i*y_i))]$.

### Least-Squares Method in statistical view

From equation $[[X]^T[X]]{A} = {[X]^T[Y]}$, we can derive the following equation: ${A} = [[X]^T[X]]^{-1}{[X]^T{Y}}$.

From this equation, we can determine not only the coefficients, but also the approximated values in statistic.

Using calculus, the following formulas for coefficients can be obtained:

$a =$$S_{xy} \over{S^2_x}$ and $b = \bar{y} - a*\bar{x}$

where

$S_{xy} =$$\sum_{i=1}^{i=n}((x_i-\bar{x})*(y_i-\bar{y})) \over{n-1}$

$S^2_x =$$\sum_{i=1}^{i=n}((x_i-\bar{x})^2) \over{n-1}$

$\bar{x} =$$\sum_{i=1}^{i=n}(x_i) \over{n}$

$\bar{y} =$$\sum_{i=1}^{i=n}(y_i) \over{n}$.

Moreover, the diagonal values and non-diagonal values matrix $[[X]^T[X]]^{-1}$ represents variances and covariances of coefficient $a_i$, respectively.

Assume the diagonal values of $[[X]^T[X]]^{-1}$ is $x_{i,i}$ and the corresponding coefficient is $a_i$, then

$var(a_{i-1}) = x_{i,i}*s^2_{y/x}$ and $cov(a_{i-1}, a_{j-1}) = x_{i,j}*s^2_{y/x}$

where $s_{y/x}$ is called stand error of the estimate, and $s_{y/x} = \sqrt{S \over {n - 2}}$.

(Here, lower index, y/x, means that the error of certain x is caused by the inaccurate approximation of corresponding y.)

We have many application on these two information. For example, we can derive the upper and lower bound of intercept and slope.