## Unit 10 - Modeling and Simulation, part 2Assessing fit and comparing models

### A modern way of looking at statistical testing

Continuing our discussion of modelling and simulation from the last unit, a "model" is a mathematical construct that has the objective of mimicking a situation in the "real world". This definition seems a bit arcane, but suffice it to say that the model is usually some kind of equation.

In Statistics, our models are usually of the form:

y = β0 + β1X1 + β2X2 + ... + βnXn + ε

where the y is an observed value, also called the response variable, and the X's are measured effects (age, treatment, etc.), also called explanatory variables, and ε is an overall error. Ofter, y is replaced by some function of y as we saw in the last unit where we worked with log(y) as an observed value. Recall that in the previous video, we had n=1 and we called β0 the intercept (it was "a" then) and β1 was the slope (our "b" then).

Nowadays, we tend to write models in a vector format. So, we use the beta and y vectors, and X (a matrix xith rows and columns) :

β = (β012,...,βn),    X=(1,X1,X2,...,Xn),  y=(y1,y2,...,yN,)

where N is the number of observations and vector-matrix multiplication is as shown below in the graphic.

Then the model above, in vector form, is:

y = βX + ε

It is crucial here to note that, so far, all our observations y must be independent from one another. So, if we make repeated measurements on, say, each of a group of patients, the statistics must be adapted to the correlation within each patient. This consideration will be disxussed later in the course.

In Unit 9, we saw that R has the "lm" function that estimates the betas. Estimating betas was achieved by either the least squares method or the maximum likelihood method. We will come back to more detail on this later.

Once we have the betas, we can ask how good the model fits the data. There are several approaches to answering this question. We will look at two:

1. Test to see if βi is zero, or
2. Leave the chosen βi out, fit the new model and compare old and new.

We'll come back to these questions after the video.

Review of and comments on the video:

Contact me at: dtudor@germinalknowledge.com

To compare two models, we use the Anova table:

 Source of variation Degrees of freedom Sum of squares Mean square F probability Model with β0 q b0TX0Ty Change from β0 to β1 p-q b1TX1Ty - b0TX0Ty (SSat left)÷(p-q) MSleft÷MSresidual Is the F at left large enough to be significant at α=0.05 ? Residual N-p yTy - b1TX1Ty (SS at left)÷(N-p) Total N yTy

How to multiply a matrix by a vector:

Classic Anova table: So, if we divide the EMS for among by the EMS for within, we see that the quotient, F, is 1 if and only if all μis are equal.

Here is a zip file with the R-code for our examples, including the simulations, and the outline for this unit's video.  Please download it and have a look. There is more material than just presented in the video.

### Simulation: finding the Easter egg you hid yourself

I have been asked to do simulations of clinical trials for presentation at conferences in marketing. This is not a good idea and here is why:

First, let me say that simulation is a wonderful exercise to try to understand the characteristics of diseases and substances used to treat them. The reason here is that, when you do a simulation, you have to build in all relationships that you want to study (sex and weight factors, treatment effects per arm, etc.). You are forced to research the literature to find out what is known about these effects. Then you build them into the model.

Because you have built the factors into the model, you have "hidden your easter egg". Now, you can test out statistical methods on the simulated data (best practice for the link between statistics and simulation). If your presumptions about the relationships simulated are correct, and your statistical methodology is pertinent and adequate, you will find the easter egg that you hid in the data.

Thus, you will have shown that, to the best of your knowledge and simulation, you will discover the truth in the real data about your assumptions and by using the chosen and "simulation tested" statistical methodology.

The simulation HAS NOT discovered the truth. It has merely said that "if our assumptions are true (to discover in a real study), then we have a good chance of finding it with the experimental methodolgy tested in a hypothetical situation. 