Ever got confused with difference between the two GLMs? With why ANCOVA and GLM do the same thing? Ever had trouble remembering assumptions? Stop thinking in tests and start thinking in mathematical models, and everything will be much clearer.

Math can be complicated, but it also simplifies talking about complicated concepts. The clearest way to define is by writing down the equations. That gives no room for confusion about what type of model you are talking about.

# Outline

Starting with the most general formulation of GLM (the generalized linear model), passing through the definition of the general linear model we will arrive at the familiar $y = mx+b$ of the simple linear model. We will also  see how the linear model relates to ANOVA and ANCOVA. I will provide animations and examples on how to generate synthetic data conforming to each specific model. Playing around with such simulated data is a good way to understand how certain models behave, so I encourage you to change the parameters and see what happens. Finally, I will refer you to some further reading, as there's an excellent, graphic summary of all things linear models.

# Generalized linear models

## Notation and variables

Let $Y$ be our dependent (or response) variable, while let $X$ be our independent (or explanatory) variable. In this writing, we will only have one response variable, but we might have several explanatory variables. In the latter case, I will index them as $X_a, X_b$, etc. Capital letters will denote the variable in a theoretical sense (as a random variable), while lowercase letters $y, x$ (or $x_{a}, x_{b}$,  etc.) will denote actual measurements of that variable.

This distinction between the variable and its measurements is an important concept. The $Y$ response variable has a probability distribution, in most cases it has a mean and a variance. But if I measure it say three times I will get three distinct values of $y$, $y_1$, $y_2$, and $y_3$. These values can now be used to estimate for example the mean and variance of $Y$, or more generally, the distribution of $Y$. In fact, the entire concept of mathematical modeling is trying to figure out what the distribution of the response variable is, and how the explanatory variables influence this distribution.

The parameters of the models will be denoted by $\beta$'s. These are fix numbers, but we do not know them and must be estimated based on our data.

## Formulating the GLM

With that out of the way, we can now formulate the generalized linear model as two equations:

(Y \mid X = x) \sim f(x,\Theta(x))

g(E(Y \mid X = x)) = \eta(x)

Conceptually, this is probably the hardest part, so let's break this down carefully.

In Eq. 1 the right hand side and the left hand side is divided by the $\sim$ sign and it reads: $Y$, when $X$ is set to a specific value $x$, is drawn from the distribution on the right hand side. The right hand side is one of the probability distributions from exponential family of distributions. This is a big family, but includes the most common distributions, for example the normal (Gaussian) and the Poisson distributions.

$\Theta$ stands for all the parameters these distributions can have, like $\mu$ and $\sigma$ in the normal or $\lambda$ in the Poisson distribution. It's written as $\Theta(x)$ to show that these parameters also change as $X$ changes.

The linear in the GLM comes from Eq. 2. Here $g$ is the so-called link function (more on this later, but this is usually a simple function like the logarithm or the identity function, i.e. the do nothing function), $E(Y \mid X = x)$ is the expected value (mean) of $Y$, when $X$ is set to a specific $x$, and $\eta$ is the linear predictor. $\eta$ is called that, because it is a linear combination of the explanatory variables. Think $\eta = \beta_a\cdot x_a + \beta_b\cdot x_b + \beta_0$, where the $\beta$'s are parameters of the model and $X_a$ and $X_b$ are two explanatory variables.

I also wrote only one $X$, but there could be many more. In that case, each of them would be set to a specific value in the above equations.

## But what does this mean?

Let's look at simple examples, which we can actually visualize.

### Example with normal distribution

Taking only one explanatory variable, we can set the simplest linear predictor as $\eta = \beta x + \beta_0$. We choose our distribution (Eq. 1, right hand side), to be the normal (Gaussian) distribution and the link function to be the identity function. This sets the left hand side of Eq. 2 as:

g(E(Y \mid X = x)) = E(Y\mid X = x) = \mu(x)

thus from Eq. 2, with our simple $\eta$ we have $\mu(x) = \eta = \beta x + \beta_0$. Eq. 1, thus becomes

(Y \mid X = x) \sim \mathcal{N}(\mu(x),\sigma) = \mathcal{N}(\beta x + \beta_0, \sigma)

where $\mathcal{N}(\mu,\sigma)$ is the usual way to denote a normal distribution with a mean of $\mu$ and a variance of $\sigma$.  We can write it more simply by leaving off the given $X = x$ part:

Y \sim  \mathcal{N}(\beta x + \beta_0, \sigma)

To put it in words: if we fix the $X$ (say $X$ is the temperature and it doesn't change) then the measurement $y$ of $Y$ will be drawn from the distribution $\mathcal{N}(\beta x + \beta_0, \sigma)$. If we change $X$, then we change the mean of this distribution, thus the distribution of $Y$ will also change. It also means that whatever the change in $X$, the $\beta$'s and $\sigma$ will stay constant. Actually, the latter is where the homoscedasticity assumption comes from in linear regression. From the form Eq. 5, the normality assumption for the residuals should also be readily apparent.

This is something we can now visualize. For each given $x$, the mean of $Y$ is a straight line, and the error on it is a normal distribution with a variance of $\sigma$. It  also shows why $\beta_0$ is called the intercept: if you set $x=0$, the blue line will intercept the $y$ axis at $y = \beta_0$. Unless you are absolutely sure that your model (the blue line) goes through the origin ($x = 0$ and $y = 0$), always include an intercept.

We can also write this into a more familiar formula. Taking into account, that adding to a normal distribution is the same as shifting its mean, we can split up Eq. 5 as thus:

$$Y \sim \beta x + \beta_0 + \mathcal{N}(0,\sigma)$$

Setting the usual $\epsilon \sim \mathcal{N}(0,\sigma)$, we arrive at

$$y = \beta x + \beta_0 + \epsilon$$.

Let's look at a simulation for this in R. Step one is to create synthetic data based on our model. We take our linear predictor, $\eta = \beta x + \beta_0$ and choose the parameters as $\beta = 3 and \beta_0 = 5$. Then we generate values of $X$ and place them in the vector x. It doesn't matter how we generate the $x$ values, it could have been evenly spaced as well, but it looks a bit more natural to use a uniform distribution, like here. From the $x$ values we use vector operations to calculate eta and y, so the elements of x are $x_1, x_2, ..., x_N$ and the elements of y are $y_1,y_2, ..., y_N$.  Then we use rnormto generate values based on $\eta$ as in Eq. 5.

With the synthetic data is ready, we go on to step two, and use glm to estimate the parameters we just set. To specify the model, we set the distribution by specifying the Gaussian family (i.e. we set $Y \sim \mathcal{N}$), pass the link function to be the identity function, and specify the formula in the R model mini-language to be y ~ x. The formula tells R, that the left hand side will be the response variable, and that the right hand side will be the linear predictor $\beta x + \beta_0$. Note, for the formula y ~ x are automatically creates the $\beta_0$ intercept. To have a model without $\beta_0$, you would need to specify y ~ x - 1, explicitly telling R to NOT use it.

Input:

N = 40
x = runif(N,0,15)
eta = 3*x + 5
y = rnorm(N,eta)


Output:

Call:
glm(formula = y ~ x, family = gaussian(link = "identity"))

Deviance Residuals:
Min        1Q    Median        3Q       Max
-2.03481  -0.54737   0.00221   0.45503   2.47848

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  5.20135    0.32163   16.17   <2e-16 ***
x            2.99943    0.03665   81.85   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.9330834)

Null deviance: 6286.109  on 39  degrees of freedom
Residual deviance:   35.457  on 38  degrees of freedom
AIC: 114.69

Number of Fisher Scoring iterations: 2


In the output you can see the estimate of $\beta_0$ as (Intercept) and $\beta$ as x, which is rather close to the original we put in, showing we choose the correct model in glm.  I suggest to play around with different values of the parameters and different sample sizes.

## aov, anova, Anova in R

Yes, actually, there is basically two things that are called ANOVA. One is doing a linear model with a categorical variable (as we did above), which is done with the aov command in R (which actually calls lm in the backgroun). The other is hypothesis testing based on sum of squares to determine whether the linear predictors we fitted have actual significance to the model or not, which is the anova or the car::Anova function in R.

There is a more in-depth article, aptly titled the "The ANOVA Controversy", which you can find here.

# In summary:

Linear models with a normal distribution and the identity link function:

• ANOVA: categorical explanatory variable
• ANCOVA = general linear model: categorical and continuous explanatory variables

And it's a generalized linear model if either the distribution is not normal or the link function is not the identity.

# Many other things are GLM as well

Originally I wanted to show how some other well known tests are also linear models, but when researching material that, I came across an excellent page by Lindeløv. He does a much better job at it than I planned, with a graphical summary table, non-parametric tests and all, so definitely worth to read!