The intricacies of linear regression

y = a + bx +e

a = intercept, b = slope, e = errors

in R lm (y~x) produces the linear model.

the fitted values are of the model. The difference betweent he fitter values and the actual observed values are the residuals.

The fitted values are represented by y(hat) ^y.

1. The **p** (probability) values for the constant (a) and X, actually the slope of the line (bThese values measure the probability that the values for a and b are not derived by chance. These p values are not a measure of ‘goodness of fit’ *per se*, rather they state the confidence that one can have in the estimated values being correct, given the constraints of the regression analysis (*ie.*, linear with all data points having equal influence on the fitted line).

2. The **R-squared** and **adjusted R-squared** values are estimates of the ‘goodness of fit’ of the line. They represent the **% variation of the data explained by the fitted line**; the closer the points to the line, the better the fit. Adjusted R-squared is not sensitive to the number of points within the data. R-squared is derived from

R-squared = 100 * SS(regression) / SS(total)

For linear regression with one explanatory variable like this analysis, R-squared is the same as the square of *r*, the correlation coefficient.

R^2 is the **co.eeficient of determination** that explains tha mount of variation as explained by the model

3. The sum of squares (SS) represents variation from several sources.

**SS(regression)** describes the variation within the fitted values of **Y**, and is the sum of the squared difference between each **fitted** value of **Y** and the mean of **Y**. The squares are taken to ‘remove’ the sign (+ or -) from the residual values to make the calculation easier.

**SS(error)** **SSE **describes the variation of observed Y from estimated (fitted) Y. It is derived from the cumulative addition of the square of each **residual**, where a residual is the distance of a data point above or below the fitted line (see **Fig 2.2**).

**SS(total)** **SST** describes the variation within the values of **Y**, and is the sum of the squared difference between each value of **Y** and the mean of **Y**.

Source of Variation df S. of Squares Mean Square F

Regression 1 SSR SSR /1 SSR /s^2

Error n − 2 SSE s^2 = SSE /(n − 2)

Total n − 1 SST

references : http://www.le.ac.uk/bl/gat/virtualfc/Stats/regression/regr1.html

http://www.math.umbc.edu/~kofi/Courses/Stat355/Chap12.pdf

http://www.econ.ucsb.edu/~pjkuhn/AEASTP/Lecture Notes is a good point for linear regression notes

From Wood, GAM.pdf

The results of the linear model need to be checked if the assumption for the erros are met: namely-independant errors and homoscedasticity. This is done by using residual plots-

1. the residuals x fitted a scattered around the 0 value without any discernible pattern

2. QQ plot should show a normal distribution