Assessing fit and comparing models

Continuing our discussion of modelling and simulation from the last unit, a "model" is a mathematical construct that has the objective of mimicking a situation in the "real world". This definition seems a bit arcane, but suffice it to say that the model is usually some kind of equation.

In Statistics, our models are usually of the form:

y = β_{0} + β_{1}X_{1} + β_{2}X_{2} + ... + β_{n}X_{n} + ε

where the y is an observed value, also called the response variable, and the X's are measured effects (age, treatment, etc.), also called explanatory variables, and ε is an overall error. Ofter, y is replaced by some function of y as we saw in the last unit where we worked with log(y) as an observed value. Recall that in the previous video, we had n=1 and we called β_{0} the intercept (it was "a" then) and β_{1} was the slope (our "b" then).

Nowadays, we tend to write models in a vector format. So, we use the beta and y vectors, and X (a matrix xith rows and columns) **:**

**β** = (β_{0},β_{1},β_{2},...,β_{n}), **X**=(**1**,X_{1},X_{2},...,X_{n}), **y**=(y_{1},y_{2},...,y_{N},)

where N is the number of observations and vector-matrix multiplication is as shown below in the graphic.

Then the model above, in vector form, is:

**y = βX + ε**

It is crucial here to note that, so far, all our observations y must be independent from one another. So, if we make repeated measurements on, say, each of a group of patients, the statistics must be adapted to the correlation within each patient. This consideration will be disxussed later in the course.

In Unit 9, we saw that R has the "lm" function that estimates the betas. Estimating betas was achieved by either the least squares method or the maximum likelihood method. We will come back to more detail on this later.

Once we have the betas, we can ask how good the model fits the data. There are several approaches to answering this question. We will look at two:

- Test to see if β
_{i}is zero, or - Leave the chosen β
_{i}out, fit the new model and compare old and new.

We'll come back to these questions after the video.

Review of and comments on the video:

Contact me at: dtudor@germinalknowledge.com

To compare two models, we use the Anova table:

Source of variation | Degrees of freedom |
Sum of squares |
Mean square |
F | probability |

Model with β_{0} |
q | b_{0}^{T}X_{0}^{T}y |
|||

Change from β _{0} to β_{1} |
p-q | b - _{1}^{T}X_{1}^{T}yb_{0}^{T}X_{0}^{T}y |
(SSat left)÷(p-q) |
MSleft÷MSresidual |
Is the F at left large enough to be significant at α=0.05 ? |

Residual | N-p | y^{T}y - b_{1}^{T}X_{1}^{T}y |
(SS at left)÷(N-p) |
||

Total | N | y^{T}y |

How to multiply a matrix by a vector:

Classic Anova table:

So, if we divide the EMS for among by the EMS for within, we see that the quotient, F, is 1 if and only if all μ_{i}s are equal.

Here is a zip file with the R-code for our examples, including the simulations, and the outline for this unit's video. Please download it and have a look. There is more material than just presented in the video.

I have been asked to do simulations of clinical trials for presentation at conferences in marketing. This is not a good idea and here is why:

First, let me say that simulation is a wonderful exercise to try to understand the characteristics of diseases and substances used to treat them. The reason here is that, when you do a simulation, you have to build in all relationships that you want to study (sex and weight factors, treatment effects per arm, etc.). You are forced to research the literature to find out what is known about these effects. Then you build them into the model.

Because you have built the factors into the model, you have "hidden your easter egg". Now, you can test out statistical methods on the simulated data (best practice for the link between statistics and simulation). If your presumptions about the relationships simulated are correct, and your statistical methodology is pertinent and adequate, you will find the easter egg that you hid in the data.

Thus, you will have shown that, to the best of your knowledge and simulation, you will discover the truth in the real data about your assumptions and by using the chosen and "simulation tested" statistical methodology.

The simulation HAS NOT discovered the truth. It has merely said that "if our assumptions are true (to discover in a real study), then we have a good chance of finding it with the experimental methodolgy tested in a hypothetical situation.

©
Germinal Knowledge. All rights reserved