We usually do statistics on measurements of something. The major element of the definition of a measurement is that there only be one result per object measured. Also, the result should be independent of the person doing the measurement as well as being independent in time. So, if I measure an object today, and again tomorrow (under the same conditions), I should get the same answer.
Now, we will name our measurements, for example, temperature we will call "t". So, t will be called a variable.
There are two broad categories of variables: continuous and discrete. Within each of these two types, there are other types.
I could measure temperature Celsius as a continuous type (theoretically, we can have 37,3° or 37,29°, etc.).
I could have the number, n, of children in a family as my variable, which would take on values from 0 to something like 20 (whatever the largest family in human history might have been at the moment). We do not have fractions of children, so n is always a whole number.
Another discrete variable could be, for example, the number of days, m, after my 60th birthday. On my birthday m=0 (not yet "after"). The day after, m=1, and, as far as we know, m will go on infinitely far (so the set of possible values is infinite).
Among the continuous variables, we could have a bounded set of values, like the numbers between 0 and 1, bounded below by 0 and above by 1, or the set of values could be bounded in either direction (positive or negative).
Each type of variable has its own type of statistical analysis. Here are some types:
|Measure||Type||Analysis / probability distribution|
|Systolic blood pressure||continuous, bounded||anova or t|
|Visual analogue scale (how do you feel on a scale of 0 to 10 - marked on a line segment)||continuous, bounded|
|Time to event (death, onset, etc.)||continuous, time to event, censured data||Cox model|
|"success" (yes or no variable)||discrete, binomial||chi-squared, logodds, logistic reg|
|Odds of success||continuous||logodds,|
|Number of heart attacks had||discrete||poisson|
|Number of tries before succeeding||discrete||geometric / negative binomial|
Now, sometimes we have a variable where it is not known yet what is the best method of analysis. Systolic blood pressure is continuous on a bounded interval (0 would mean the person is dead, 500 seems impossible (there is a max, I just don't know it)). But, we notice that, in spite of the normal distribution (bell curve) being unbounded (values from to ) the curve of many patients' data will look like the bell curve with the tails "dying out" within the bounds of death and the person with the highest pressure in the world.
Survival data is a good example. Historically, there was no way to properly take the censured data into account. That is to say, in a survival study where death is the endpoint, the patients who do not die during the study have (logically) no endpoint. Cox found a solution to include censured data into the analysis, thus survival data now has its own methods of analysis (see Kalbfleisch and Prentice).
This video showed the graphs of f(x)=x/(1-x), for x in [0,1) and g(x)=log(f(x)) for x in (0,1).
These functions are important when x takes on the values of probability, p, of a binary event.
For example, if we test a drug and study the event "healed" versus "not healed", the probability of healing, p, is the type of variable where our f and g functions above are important. Here's why:
The odds of healing are p/(1-p), the f function above, and the log of the odds is the g function above.
The fact that the log odds takes on values from -∞ to +∞, so it can be modelled (Unit 9) by the bell curve (Unit 4).