Monthly Archives: December 2012

testing relationships

Some variable, or set of variables, y, are predicted to have a particular relationship with
some predictor variable (or variables) denoted in x.

In the simplest case when both x and y are continuous variables, the analysis is called a regression analysis, if x has more than one predictor variable then it is called a multiple regression, and if y is binary it is a logistic regression.

However, if the predictor variable is categorical the model is called an analysis of variance with many variants depending upon the number and relationship of categorical predictor variables in x. Finally, if predictor variables consist of categorical and continuous variables then it is called an analysis of covariance

Linear probability model (LPM) is one where the probability of an event occuring is tested. the dependent variable is binar.

Linear regression is suitable for measurement var, but not suitable for nominal variables like ratios and proportions either. Cobning measurement vars with nominal vars in a regression test, would produce erroneous results. In the case of nominal vars, logistic regression is more appropriate


Managing data frames

  1. ATTRIBUTES: to get all atributes of a dataframe
  2. MERGE: to join two tables that have atleast one common column (description) that the merge can be based on
  3. complete.cases() to extract only those rows of a data frame that are without NAs

linear regression

The intricacies of linear regression
 y = a + bx +e
a = intercept, b = slope, e = errors
in R lm (y~x) produces the linear model.
the fitted values are of the model. The difference betweent he fitter values and the actual observed values are the residuals.
The fitted values are represented by y(hat) ^y.
1. The p (probability) values for the constant (a) and X, actually the slope of the line (bThese values measure the probability that the values for a and b are not derived by chance. These p values are not a measure of ‘goodness of fit’ per se, rather they state the confidence that one can have in the estimated values being correct, given the constraints of the regression analysis (ie., linear with all data points having equal influence on the fitted line).
2. The R-squared and adjusted R-squared values are estimates of the ‘goodness of fit’ of the line. They represent the % variation of the data explained by the fitted line; the closer the points to the line, the better the fit. Adjusted R-squared is not sensitive to the number of points within the data. R-squared is derived from

R-squared = 100 * SS(regression) / SS(total)

For linear regression with one explanatory variable like this analysis, R-squared is the same as the square of r, the correlation coefficient.
R^2 is the co.eeficient of determination that explains tha mount of variation as explained by the model
3. The sum of squares (SS) represents variation from several sources.

SS(regression) describes the variation within the fitted values of Y, and is the sum of the squared difference between each fitted value of Y and the mean of Y. The squares are taken to ‘remove’ the sign (+ or -) from the residual values to make the calculation easier.

SS(error) SSE describes the variation of observed Y from estimated (fitted) Y. It is derived from the cumulative addition of the square of each residual, where a residual is the distance of a data point above or below the fitted line (see Fig 2.2).

SS(total) SST describes the variation within the values of Y, and is the sum of the squared difference between each value of Y and the mean of Y.

Source of Variation     df      S. of Squares        Mean Square          F
Regression                   1             SSR                  SSR /1            SSR /s^2
Error                         n − 2          SSE         s^2 = SSE /(n − 2)
Total                         n − 1          SST

references : Notes is a good point for linear regression notes

From Wood, GAM.pdf

The results of the linear model need to be checked if the assumption for the erros are met: namely-independant errors and homoscedasticity. This is done by using residual plots-

1. the residuals x fitted a scattered around the 0 value without any discernible pattern

2. QQ plot should show a normal distribution

Ivanek: modelling env and met influecing pathogen

Obj: to isolate spatially explicit determinants affecting pathogen prevalence

Normally, LR (logistic regression) would be used in such cases to predict pathogen prevalence affected by covariates. Here they compare LR with classification trees (CT)

They used weather data for a few days prior to sample collection, using time windows of different sizes and using them for analysis like correlation, LR and CT.

Freeze-thaw cycles were calculated as described in Williams, C. et al, 2006. Days with min below 0 amd max above zero, or consecutuve days with such conditions were count as one cycle. Such cycles were also constructed for different periods, like from 1 day to a period of 5 days when the cycle could have occured.

PCA was done for weather variables. Since the units/measurements are different for the variables, data were standardised by subtracting the mean and dividing by the sd– is this the equivalent of performing the eigenanalysis of the correlation matrix? as they claim?

Holt Winters exponential smoothing

Holt Winters
It is different form the smooth smoothing techniques since it applies exponentially decreasing weights over time.

It may not be useful at the data exploratory level in my case since I am interested in a moving average that applies equal weights to all periods. However, when it comes to modelling and forecasting based on past data, it might be useful to apply reduced weights for past data.


  1. create matrices using symbols – useful for large correlation matrices, or logical ones, where numbers are replaced by symbols for easy identification
    symnum(x, cutpoints = c(0.3, 0.6, 0.8, 0.9, 0.95),
    symbols = if(numeric.x) c(" ", ".", ",", "+", "*", "B")
    else c(".", "|"),
    legend = length(symbols) >= 3,
    na = "?", eps = 1e-5, numeric.x = is.numeric(x),
    corr = missing(cutpoints) && numeric.x,
    show.max = if(corr) "1", show.min = NULL,
    abbr.colnames = has.colnames,
    lower.triangular = corr && is.numeric(x) && is.matrix(x),
    diag.lower.tri = corr && !is.null(show.max))