Joseph Bulbulia
August 13, 2013
Tutorial 2
=======================================================
Last we discussed some elementary R commands, simulated some data, fit some models, and graphed them.
Today we begin tying up some lose threads from our discussions.
## Linear Regression
Linear regression models the covariances among two type of data, "outcomes" (dependent variables) and "predictors" (independent variables). Sometimes these terms are not so easily distinguished. For example, we might model "arousal" and "valence" as outcomes but arousal might also affect valence. What to do? We'll come back to that. For now, think covariance of one thing with (at least) one other thing.
```{r fig.width=7, fig.height=6}
predictor<-runif(100,20,40)
outcome<- predictor + rnorm(100,-1,1)
library(ggplot2)
qplot(predictor,outcome, geom="point") + geom_smooth(method="lm")
```
## Writing a linear model.
It's good intellectual hygiene to write your model in math or quasi-math. Words are fine. Just know what your model is predicting and how.
There are many ways to write a linear model, e.g.
$$
\overbrace{Y}^{outcome}= \overbrace{a}^{intercept} + \overbrace{b}^{slope} ~\overbrace{X}^{predictor} + \overbrace{e}^{residual}
$$
Alternatively,
$$latex
y_i = \alpha + \beta_1 x_i + e_i
$$
This equation can be expanded for multiple predictors
$$latex
y_i = \alpha + \beta_1 Xp1_i + \beta_2 Xp2_i + \dots + e_i, ~~~\forall i= 1 \dots N~\text{datapoints}, ~~~\forall p=1\dots P~\text{predictors}
$$
Or in matrix notation
$$latex
\textbf{y} = \mathbf{XB} + \mathbf{e}
$$
which amounts to
$$
\mathbf{y} = \begin{bmatrix} X_{intercept} & X_{predictor 1} & X_{predictor 2} & \dots \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \end{bmatrix} + \boldsymbol\varepsilon
$$
Thinking of models in terms of matrices is \textit{nice} because it forces you to think about what your model is doing, and how what your coefficients mean.
$$
\begin{bmatrix}\text{y}_1\\\vdots \\ \text{y}_n\end{bmatrix} = \begin{bmatrix} {X}_{I_1} & {X}_{R_1} & {X}_{P_{1_1}} & {X}_{P_{2_1}} & {X}_{R \times P_{1_1}} & {X}_{{R \times P_{2_1}}} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {X}_{I_a} & {X}_{R_a} & {X}_{P_{1_n}} & {X}_{P_{2_n}} & {X}_{R \times P_{1_n}} & {X}_{{R \times P_{2_n}}} \end{bmatrix} \begin{bmatrix} \beta_{I_1}\\ \vdots\\ \beta_{R_1} \\ \vdots \\ \beta_{R_1} \\ \vdots \\ \beta_{P_{1_n1}} \\ \vdots \\ \beta_{P_{2_1}} \\ \vdots \\\beta_{R \times P_{1_1}}\\ \vdots \\ \beta_{R \times P_{2_1}} \\ \vdots \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_n \end{bmatrix}
$$
The expectation of y $E(\mathbf{Y} \mid \mathbf{X}=x_i)$ is the deterministic component of the model **XB** or if you prefer, a + bX. Often, the expectation is written $\hat{y}$.
$$
E(\textbf{y})= \hat{\mathbf{y}} =\boldsymbol\eta = \mathbf{XB}\\
\mathbf{y}\sim N\left(\boldsymbol\eta,\boldsymbol\sigma^2\right)
$$
Where "N" is the normal distribution (though the normal distribution is only one of many, as we shall soon see).
residuals $\varepsilon$ are assumed to be distributed
$$
\varepsilon_i \sim \text{i.d.d.}\left(0,\sigma^2\right)
$$
Where 'i.d.d.' means "independently and identically distributed." This assumption is frequently violated, and it is also central to the usual linear model we are discussing here. Hence, we will later dispatch with the usual model.
Note that residuals are the difference between the fitted and observed values
$$
\varepsilon_i = y_i − \hat{y}_i
$$
Say you have a complicated linear predictor. You might be temped to write
$$
\mathbf{y}\sim N\left(\textbf{b}_{0}+\textbf{b}_{1}\text{time}+\textbf{b}_{2}\text{time}^2+\textbf{b}_{3}\text{role}+\textbf{b}_{4}\text{role}\times\text{time} + \textbf{b}_{5}\text{role}\times\text{time}^2,\boldsymbol\sigma^2\right)\\
$$
But you could more simply write
$$
\boldsymbol\eta=
\textbf{b}_{0}\text{}+\textbf{b}_{1}\text{time}+\textbf{b}_{2}\text{time}^2+\textbf{b}_{3}\text{role}+\textbf{b}_{4}\text{role}\times\text{time} + \textbf{b}_{5}\text{role}\times\text{time}^2 \\
\mathbf{y}\sim N\left(\boldsymbol\eta,\boldsymbol\sigma^2 \right)
$$
When we get into multi-level models you'll see there are more ways to write a model still. Gelman and Hill (2006) Chapter 12 section 5. discusses various conventions.
## Bayesian versus Frequentist estimation
I'm going to save dealing with this topic for later. For now, notice that frequentists estimate the likelihood of the data conditional on a fixed parameter. Suppose we have a standard normal distribution with a mean of 0 and a standard deviation of 1, which we'll write $\theta$. The probability of a data point plucked from $\theta$ is
$$
\Pr(y\mid \theta)
$$
So we are conditioning on a fixed parameter. As Bayesianists point out (e.g. Hadfield, Course Notes, p.6) this is weird because we're certain of the data (which we've observed) and uncertain of the parameter, which we haven't. Bayesian data analysis assumes that the parameters come from a random distribution, and condition on the observations
$$
\Pr(\theta\mid y)
$$
However, this probability is proporitational to the likelihood
$$
\Pr(\theta \mid y)\propto \Pr(y\mid \theta)\Pr(\theta)
$$
In supplying a prior to the model we give a mean and variance ($\theta$), generally with vary wide boundaries e.g.
$$
\theta \sim (0,10^8)
$$
and we let the MCMC algorithm search parameter space to find satisfactory values for our parameter estimates. Though when we do that there's little difference between frequentists and bayesian estimates. Many (e.g. Gelman) argue for the use of informative priors. Others worry this introduces too many researcher degrees of freedom. We'll come back to this.
## Assumptions of the Usual Linear Model
Four assumptions, about which\dots later $\dots$
## Model checking
## A Poisson model in R
## This week's R tips