Chapter 3 Fitting Bayesian regression models with brms

In this chapter we’re going to start answering basic research questions with Bayesian regression models using the brms package in R. The model we’ll use initially is not ‘correct’ for our data, but it’s simple enough to work as an introduction to Bayesian regression models. In the next chapter we’ll use brms to build models that are closer to ‘correct’ given the structure of our data. Before using a Bayesian regression model to investigate our data, we’ll explain what we mean by regression model and what specifically makes the models in this book Bayesian. We’ll leave a discussion of the ‘multilevel’ aspect of our models for the following chapter.

Throughout this book, we will present formal descriptions of our models. The relationship between statistical concepts and the formal notation used to represent them is very similar to the ability to play music and read musical notation. Someone who can play a song undoubtedly knows that song. However, in the absence of formal musical training that same person might not recognize sheet music representing the song. This person may also lack the vocabulary to discuss components of the song, and may find it difficult to learn to play new pieces without instruction. In the same way, most people have an excellent intuitive understanding of many statistical concepts (to be discussed in upcoming chapters) such as slopes, interactions, error, and so on. However, they may lack the knowledge of formal statistical notation that enables a deeper understanding of these concepts, and the ability to generalize these concepts to new situations. As a result, learning to read/write this notation will help you describe your models efficiently, and understand the models used by other people more effectively.

3.1 Chapter pre-cap

In this chapter we introduce the concept of a regression model and explain what makes a regression model ‘Bayesian’. We discuss posterior probabilities, and also brms, an R package that can be used to fit Bayesian regression models using the Stan programming language. After this, we fit an ‘intercept only’ Bayesian regression model to our experimental data. We present topics related to model fitting such as thinning, chains, and iterations, and also discuss the specification of prior probabilities. After that, we outline the interpretation of the brm model print statement, and introduce the concepts of errors and residuals. Finally, we provide information regarding how to work directly with the model posterior samples, and discuss the log prior and posterior densities of our models.

3.2 What are regression models?

It’s difficult to offer a precise definition for regression because the term is so broad, but regression models are often models that help you understand systematic variation in the mean parameter (\(\mu\)) of a normal distribution. Actually, you can model variation in other parameters and use a variety of other probability distributions, and we will do so later in this book. However, for now we’ll focus on models based on the normal distribution. Here is a general summary of the concepts underlying a regression that assumes normally-distributed errors:

  • You have a variable you are interested in, \(y\), which is a vector containing \(n\) observations. We can refer to any one of these observations like this: \(y_{[i]}\) for the \(i^{th}\) observation. Although it’s a bit atypical and, strictly speaking, not necessary, we’re going to put the index variables associated with trial number (\(i\)) in brackets like this \(y_{[i]}\). This is just to make it easier to identify, and to highlight the similarity to vectors in R (e.g., mens_height[i]).

  • You assume that the random variation in your data is well described by a normal probability distribution centered at \(\mu\). This is a mathematical function (\(\mathrm{N}(\mu,\sigma)\)) that describes what is and is not probable based on two parameters. You also assume that the random variation in each observation is independent from the random variation in each other observation. This means, for example, that you can’t really say why any two observations are above or below the mean.

  • The mean of this distribution is either fixed, or varies in a systematic manner. The standard deviation of the error distribution is usually fixed, but can vary (more on this in chapter 8).

  • The variation in the mean of this distribution can be understood using some predictor variables, and regression is a tool for modeling these relations.

We can write our model more formally as in equation (3.1). This says that we think the tokens of our dependent variable (\(y\)) are distributed according to (\(\sim\)) a normal distribution with parameters equal to \(\mu\) and \(\sigma\). Notice that \(y\) gets a subscript while \(\mu\) and \(\sigma\) do not. That means that we are modeling all of the observations with fixed parameters, while the value of \(y\) changes for each observation based on the \(i\) subscript.

\[ \begin{equation} y_{[i]} \sim \mathrm{N}(\mu,\sigma) \tag{3.1} \end{equation} \]

Equation (3.1) formalizes the fact that we think the shape of our data distribution will be like that of a normal distribution with a mean equal to \(\mu\) and a standard deviation equal to \(\sigma\). When you see this, \(\mathrm{N}(\mu,\sigma)\), picture in your mind the shape of a normal distribution just like if you see this \(y=x^2\) you may imagine a parabola. \(\mathrm{N}(\mu,\sigma)\) really just represents the shape of the normal distribution and associated expectations about more and less probable outcomes. The above relationship can also be presented as in equation (3.2).

\[ \begin{equation} y_{[i]} = \mu + \mathrm{N}(0,\sigma) \tag{3.2} \end{equation} \]

Notice that we got rid of the \(\sim\) symbol, moved \(\mu\) out of the distribution function, and that the mean of the distribution function is now 0. This representation breaks up our variable into two components:

  1. A systematic component, systematic variation in the expected value \(\mu\) of the variable \(y\).

  2. A random component, the error \(\mathrm{N}(0,\sigma)\), that causes unpredictable variation around the expected value, often \(\mu\).

Regression models separate observed variables into their systematic and random components. In this case, the systematic component is predictable for all observations in the data. The random component represents the noise, or error in our data, random variation around our expected values that isn’t explained. This doesn’t mean that it’s inexplicable in general, it only means that we’ve structured our model in a way that doesn’t let us explain it. In other words, a model like this thinks all variations from the mean are noise because it is structured in such a way that treats such deviations as noise.

In regression models, we can try to understand variation in \(\mu\) using predictor variables \(x\). These predictor variables co-vary (vary with) our \(y\) variable, and, we think (or hope), help explain the variation in \(y\). For example, in (3.3) we’re saying we think \(\mu\) is equal to some combination of the three predictor variables \(x_{1}\), \(x_{2}\), and \(x_{3}\). For example, we might expect that the apparent height of a speaker is affected by their fundamental frequency (\(x_{1}\)), vocal-tract length (\(x_{2}\)), and resonance (\(x_{1}\)). So, when we combine these predictors we think we can come up with a pretty good estimate of how tall someone will sound.

\[ \begin{equation} \mu = x_{1} + x_{2} + x_{3} \tag{3.3} \end{equation} \]

The values of the predictor variables will vary across observations, and are not fixed. In fact, often the whole point of running an experiment is to predict differences in observations based on differing predictor values. If we expect our predictors to vary from trial to trial, that means that the equation above should include \(i\) subscripts indicating that the equation refers to the value of the predictors for that trial rather than overall. If we expect the predictors to vary across observations, naturally it’s possible that \(\mu\) may take on different values from trial to trial, and it therefore also needs an \(i\) subscript. This update is reflected in equation (3.4).

\[ \begin{equation} \mu_{[i]} = x_{1[i]} + x_{2[i]} + x_{3[i]} \tag{3.4} \end{equation} \]

The predicted value (\(\mu_{[i]}\)) for a given trial is very unlikely to be an equal combination of the predictors (as in equation (3.4)), so that a weighting of the predictors will be necessary. We can use the symbol \(\alpha\) for these weights as in equation (3.5). For example, maybe \(x_{1}\) is twice as important as the other two predictors and so \(\alpha_1=2\), while \(\alpha_2=1\) and \(\alpha_3=1\). Actually, maybe one predictor has a negative effect so that \(\alpha_3=-1\). The ‘weights’ associated with each predictor are the coefficients (or parameters) of our model. Note that the weight terms (\(\alpha\)) do not get an \(i\) subscript. This is because they do not change from trial to trial. The values of the predictors change from trial to trial, but the way that they are combined to produce an expected value for any given observation is a stable property of the model.

\[ \begin{equation} \mu_{[i]} = \alpha_1 \cdot x_{1[i]} + \alpha_2 \cdot x_{2[i]} + \alpha_3 \cdot x_{3[i]} \tag{3.5} \end{equation} \]

We can insert equation (3.5) into equation (3.2), resulting in equation (3.6). At this point our model consists of an average value that has been broken up into three component parts and the random component represented by normally-distributed noise.

\[ \begin{equation} y_{[i]} = (\alpha_1 \cdot x_{1[i]} + \alpha_2 \cdot x_{2[i]} + \alpha_3 \cdot x_{3[i]} ) + \mathrm{N}(0,\sigma) \tag{3.6} \end{equation} \]

Often, \(\varepsilon\) is used to represent the random component, the error term, as in equation (3.7). Notice that the error term does get an \(i\) subscript, as in \(\varepsilon_{[i]}\). This is because the exact value of the error does change from trial to trial, even though the statistical characteristics of the error (i.e., \(\mathrm{N}(0,\sigma)\)) do not. This is a typical way to express a regression equation or a regression model. Fitting a regression model to data basically consists of finding the ‘best’ values of its coefficients, \(\alpha_1\), \(\alpha_2\), and \(\alpha_3\) and the parameter governing the error, \(\sigma\), given our data and model structure.

\[ \begin{equation} y_{[i]} = \alpha_1 \cdot x_{1[i]} + \alpha_2 \cdot x_{2[i]} + \alpha_3 \cdot x_{3[i]}+ \varepsilon_{[i]} \tag{3.7} \end{equation} \]

Notice that according to equation (3.6), regression models do not require that our data (\(y\)) be normally distributed, but only that the random error in our data (\(\varepsilon\)) be normally distributed. In (3.8), we see the sort of representation of our model that we will use in this book.

\[ \begin{equation} \begin{split} y_{[i]} \sim \mathrm{N}(\mu,\sigma)\\ \mu_{[i]} = \alpha_1 \cdot x_{1[i]} + \alpha_2 \cdot x_{2[i]} + \alpha_3 \cdot x_{3[i]} \end{split} \tag{3.8} \end{equation} \] This representation presents the random and systematic components of our regression model separately, and it clearly and succinctly describes the structure of our model. In plain English this is:

“Our observations are expected to be randomly distributed around the mean value according to a normal distribution with a standard deviation equal to sigma (\(\sigma\)), and a mean of mu (\(\mu\)). We expect the mean of our variable to vary from trial to trial based on three predictors. The combination of these predictors is based on model-specific coefficients (\(\alpha_1,\alpha_2,\alpha_3\)) that are static across trials.”.

By the way, the combination of parameters we use is a linear combination, one in which terms (\(x\)) are added together after being multiplied by a real number (\(a\)). The fact that our regression models make predictions using linear combinations is what makes them linear regression models. We will be focusing entirely on linear regression models in this book.

3.3 What’s ‘Bayesian’ about these models?

In one common ‘traditional’ approach to statistics, model parameters are estimated by trying to find the values that maximize the value of likelihood functions based on a theoretical probability distribution of interest, and given a particular model structure (i.e., maximum-likelihood estimation, as discussed in Chapter 2). We’re not going to discuss this approach to statistical inference in any detail as there are hundreds, if not thousands, of books available on the subject (for recommended readings, see the preface). Rather than dwell on these ‘traditional’ approaches to statistical inference, we’re going to talk about what makes Bayesian inference Bayesian, focusing on practical rather than ‘philosophical’ differences.

Instead of estimating parameters using only information from likelihood functions (and data), Bayesian models estimate the posterior probabilities of parameters. To explain what posterior probabilities are, we need to talk about joint probabilities again. We’ll begin by stating something obvious: The probability of events \(A\) and \(B\) occurring is the same as the probability of \(B\) and \(A\) occurring.

\[ \begin{equation} P(A \,\&\, B) = P(B \,\&\, A) \tag{3.9} \end{equation} \]

Equation (3.9) can be reformulated as in (3.10) (see section 2.3.2).

\[ \begin{equation} P(A|B) \cdot P(B) = P(B|A) \cdot P(A) \tag{3.10} \end{equation} \]

Recall that \(P(A)\) and \(P(B)\) are simply placeholders for the probability of some event. We can replace these with \(P(y)\) and \(P(\mu)\), which represent “the probability of observing your data” and “the probability of observing a certain parameter value” respectively. This is seen in equation (3.11), which states that “the probability of the parameter and the data is the same as the probability of the data and parameter”.

\[ \begin{equation} P(\mu|y) \cdot P(y) = P(y|\mu) \cdot P(\mu) \tag{3.11} \end{equation} \]

We can isolate \(P(\mu|y)\) on the left hand side by dividing both sides by \(P(y)\), resulting in equation (3.12). When structured in this way, this is a common formulation of Bayes theorem, and it underlies Bayesian statistical inference. Note however, that all we did was state a basic principle of probability theory (equation (3.9)) and then expand and re-arrange some terms.

\[ \begin{equation} P(\mu|y) = \frac{P(y|\mu) \cdot P(\mu)}{P(y)} \tag{3.12} \end{equation} \]

Each of the components in (3.12) has a name:

  • \(P(\mu)\) is the prior probability of the parameter \(\mu\). This is the a priori (before observation) probability of parameter values independent of the data \(y\). This a priori expectation can come from world knowledge, previous experiments, common sense, or some combination thereof. For example, before you measure the height of adults in San Francisco, you know the average is not 4 feet and it is not 7 feet.

  • \(P(y|\mu)\) is the likelihood, discussed at length in chapter 2. This is the product of the probability density values corresponding to the data points for a given value of \(\mu\). The likelihood reflects the probability that the data, \(y\), would be observed or generated for particular values of \(\mu\). As a result, the likelihood tells us about the distribution of possible/credible parameter values given the data and probability model.

  • \(P(\mu|y)\) is the posterior probability of the parameter given the data, the probability model, and the prior. This is the a posteriori (after observation) probability that the parameter is \(\mu\) given your data \(y\), and the structure of your model. You get the posterior probability by combining the prior distribution and the likelihood, and in doing so incorporating your current observations into your prior beliefs.

  • \(P(y)\) is the marginal probability of the data. This is necessary to scale the numerator so that the posterior density has a total area under the curve equal to one. However, note that the marginal probability does not vary as a function of \(\mu\). As a result, this does not affect the relative posterior probability of different values of \(\mu\). For this reason, and for computational reasons described later, you don’t typically need to worry about the marginal probability. In fact, you can sample from the posterior even if you don’t calculate the denominator of (3.12).

As noted above, ‘traditional’ models focus primarily (or exclusively) on how likely different conclusions are given your data. In contrast, Bayesian models focus on the posterior probability of different parameter values, that is on the combination of the likelihood and prior probabilities of parameters.

3.3.1 Prior probabilities

In a Bayesian model, every parameter whose value is being estimated needs a prior probability distribution to be specified. For example, imagine you are interested in estimating the mean of a set of values, \(\mu\). You decide to use a Bayesian model and decide that \(\mu\) will have a normal prior with a mean and standard deviation equal to \(\mu_{prior}\) and \(\sigma_{prior}\). In other words, the prior distribution of \(P(\mu)\) in (3.12) is \(\mathrm{N}(\mu_{prior},\sigma_{prior})\) such that \(\mu \sim \mathrm{N}(\mu_{prior},\sigma_{prior})\). To estimate this model you would need to provide fixed values for \(\mu_{prior}\) and \(\sigma_{prior}\), for example we could use \(\mu_{prior}=3\) and \(\sigma_{prior}=5\) so that \(\mu \sim \mathrm{N}(3, 5)\). Note that the parameters of the prior do not get prior distributions themselves. This is because we are not estimating values of \(\mu_{prior}\) and \(\sigma_{prior}\), but are instead setting them to fixed values such as 3 and 5. Only estimated parameters need priors.

The use of prior probabilities is sometimes said to make Bayesian models inherently ‘subjective’ but this concern is a bit overblown in most situations. First, when there is plenty of data (as is often the case for repeated measures data), prior probabilities have little to no effect on outcomes. This is because when you have many observations, the posterior probability of parameters is dominated by the likelihood (as will be discussed in 3.3.2). Second, a researcher will always use ‘common sense’ (i.e. their prior expectations) to interpret their data.

For example, if a listener was asked to judge apparent height and reported that all adult males were 90 cm tall, a researcher would have to wonder if this subject understands height in centimeters or if they were carrying out the experiment in good faith. So, even when they do not explicitly assign prior probabilities to parameter values, researchers still often use their expectations to ‘screen’ results in a manner broadly consistent with the use of prior probabilities in Bayesian reasoning. A Bayesian model requires you to build your expectations into your model. It formalizes them, makes them definable and replicable.

Finally, every model involves arbitrary decisions which can substantially affect our results so that the design of a model can never be said to be strictly ‘objective’. As a result, there is no particular reason to worry about the objectivity involved in establishing a prior in Bayesian modeling without also worrying about the objectivity involved in model building more generally.

3.3.2 Posterior distributions

The calculation of posterior distributions involves the combination of the likelihood function with the prior probability distribution and the marginal probability. The marginal probability does not affect the ‘shape’ of the posterior distribution, and exists to scale the posterior so that the area under the curve is equal to one (to satisfy the requirements of probability theory). As a result, we will focus on the combination of the likelihood and prior probabilities.

The combination of the likelihood and prior probability density functions is straightforward conceptually: You multiply the values of the two functions (i.e. curves) at each x-axis location. This works because we are interested in the joint probability of the likelihood and the prior, reformulated as the product of a conditional probability (the likelihood) and a marginal probability (the prior). The resulting curve represents the joint density of the two distributions. In figure 3.1, several likelihoods and priors are combined, showing the effects of variations in priors, and the number of observations, on posterior distributions.

Demonstration of the effect of different prior probabilities and number of observations on resulting posterior distributions. In each case, the posterior is the product of the likelihood and the prior. All curves have been scaled to have the same height in the figure. This makes the figures visually interpretable but does not affect any of the points made in the discussion.

Figure 3.1: Demonstration of the effect of different prior probabilities and number of observations on resulting posterior distributions. In each case, the posterior is the product of the likelihood and the prior. All curves have been scaled to have the same height in the figure. This makes the figures visually interpretable but does not affect any of the points made in the discussion.

The different standard deviations used for the prior probabilities of \(\mu\) encode different levels of prior belief regarding expected apparent heights for adult male speakers. Researchers often distinguish between different types of priors based on how much they are expected to affect conclusions. For example, priors are often referred to as vague, weakly informative, or informative. The boundaries between these categories are fuzzy, and the ‘informativeness’ of a prior is more continuous than discrete, therefore, we will discuss these categories in terms of how they might differently affect your models. It’s important to note however, that there is no generally accepted definition for these terms and what follows is just an informal guide to thinking about these categories of priors.

Recall that the ‘width’ of the likelihood depends on the amount of underlying variability and the sample size. The underlying variability in a measurement or parameter determines the underlying ‘width’ of the likelihood. However, as the sample size increases the likelihood becomes narrower and narrower with respect to the underlying variation in the measurement/parameter (see 2.7.3). As a result, a prior that is wider than the underlying variability in the measurement has no chance of meaningfully affecting outcomes given even only a small number of observations. This is shown in the top row of figure 3.1, where even three observations can overwhelm a very broad prior. As a result, we will be thinking of the ‘informativeness’ of priors with respect to the variation we expect in them a priori, with priors going from vague to informative as they become smaller than the expected random variation.

Each column of figure 3.1 differs in terms of the number of observations used to calculate the likelihood (n = 3, 10, 675), and rows differ in terms of the standard deviation of the prior probabilities of the mean with \(\mu\) equal to 180 cm and \(\sigma\) = 100, 15, 1. In each case, the calculation of the likelihood assumes that the data has the same standard deviation as our apparent height data (7.8 cm). Note that the yellow curve (the likelihood) is the same in each column. What changes is that priors become narrower across rows top to bottom, and therefore have an increasing effect on posterior densities.

In the middle row of Figure 3.1 we see the influence of a weakly informative prior. This prior is a normal distribution with a mean of 180 and a standard deviation of 15 (i.e. \(P(\mu)=\mathrm{N}(180,15)\)). The standard deviation was set to twice the value of the standard deviation of adult male heights in the United States. As a result, it places reasonable constraints on our expectations but it will not strongly influence results within that reasonable range. We see that in this case, the likelihood still dominates the posterior even when only three observations are available.

In the top row of Figure 3.1 we see the influence of a vague or diffuse prior on inference. Since this distribution has a standard deviation of 100 and a mean of 180, this means that basically all average heights from zero to 400 cm are plausible a priori. A prior probability this wide is only going to place very minimal constraints on the posterior probabilities that fall inside this range. This is reflected in the top row of figure 3.1 where even in the case of only three observations the posterior probability almost exactly matches the likelihood. So, in the presence of very weak prior beliefs, the most credible values for your parameters a posteriori (after incorporating your data) will be dominated by the most likely values. This makes sense.

In the bottom row of Figure 3.1 we see the influence of a informative prior. This distribution has a standard deviation of 1 and a mean of 180. A prior distribution this narrow basically says that we are sure that the average height is around 180 cm. This is because the prior probability of any value outside of 178-182 cm is nearly zero, meaning that we will be very hesitant to believe values outside this bound before collecting our data. We can see that under these conditions, the prior distribution can actually have a strong effect on the posterior, especially in cases with few observations. However, with a large enough sample size the likelihood can still come to dominate even the narrowest prior distributions.

3.3.3 Posterior distributions and shrinkage

If we focus on the general characteristics of posterior distributions with respect to the characteristics of priors and likelihoods, a pattern emerges. Specifically, the posterior is a mix of the prior and likelihood which takes their relative ‘variances’ into account. We know that the width of the likelihood function is dependent on the underlying error in the data and on the sample size. Greater error increases the width of the likelihood, since a wider spread of data will be compatible with a wider range of parameter values. On the other hand, more observations will tend to decrease the width of the likelihood, since it consists of the product of density values across all the observations in a data set. So, we see that the characteristics of the posterior distribution will depend simultaneously on the ‘width’ of the prior distribution, the amount of noise in your dependent variable, and the size of the sample involved in the calculation of your likelihood.

A consequence of the ‘merging’ of prior information with the likelihood is that posterior distributions can be shrunk towards the prior. Recall from chapter 2 that the maximum likelihood estimate of a parameter can be thought of as the ‘best’ parameter in that it is the parameter that makes your data as probable as it can be given the model structure. In figure 3.1 we see that in some cases the posterior distribution is not exactly like the likelihood and has been pulled towards the prior. This means that the maximum a posteriori parameter value is not necessarily equal to the maximum likelihood parameter value.

The pull exerted by priors is referred to as shrinkage, because it tends to shrink the magnitude of effects by pulling them towards the prior mean (which is often zero). Broadly speaking, deviations from prior expectations are maintained when there is good enough evidence for them, and shrunk when there is not. What constitutes ‘enough’ evidence is based on the structure of the model and the nature of the data. As seen in figure 3.1, a wide enough prior will not meaningfully affect estimates even for extremely small sample sizes. You may wonder, is shrinkage a good thing? It turns out that shrinkage can help models arrive at more reliable parameter estimates by reducing weakly supported values that deviate substantially from prior expectations. This will be discussed further in chapter 4.

3.4 Sampling from the posterior using Stan and brms

We want to understand the posterior distribution of parameters. How do we get this information? For very simple models, posterior distributions can be derived analytically, that is by finding exact solutions to a series of equations. However, for more complicated models, such as the ones we’ll discuss in this book, it can be difficult (if not impossible) to understand the characteristics of the posterior distributions of parameters in this way. As a result, these questions are answered numerically using software that samples from posterior distributions using algorithms designed to do so as accurately and efficiently as possible. Given enough samples from the posterior distribution, we can estimate the properties of the distribution (just as we could by sampling the distribution of other variable).

Many different software approaches to sampling from posterior distributions have been developed through the years including winBUGS, JAGS, PyMC, and Stan. The software used in this book is Stan. We use this because it is (relatively) fast, reliable, extremely flexible, and widely adopted. However, the modeling and statistical principles explained in this book apply to all Bayesian models regardless of the software used to fit them, and the central concepts extend to linear modeling more generally.

One downside with working with Stan directly is that you need to write your own models. This is not too difficult, but it is not particularly easy either, and it can be a bit time consuming, especially for complicated models. In this book we will rely on the brms package to use Stan. brms simplifies the use of Stan by making the specification of highly-efficient models very simple, and providing us with a great deal of flexibility in doing so. It also includes many helper functions that make working with Bayesian models very convenient. So, even though we will use brms for simplicity, the topics we discuss in this book could be used to directly write your own models for Stan (or any other statistical software).

In order to sample from the posterior distribution using software like Stan, the user provides data and a description of a model which specifies:

  • The relations between the variables in the data. For example, what is the dependent variable? What are the independent variables? How do these relate?

  • The nature of random variation in the model. To this point we have only discussed single sources of normally-distributed noise in our models.

  • Prior distributions for all estimated parameters.

Given this information, Stan samples from the posterior distributions of your parameters and returns vectors containing these samples (one vector for each estimated parameter), rather than looking for the single ‘best’ estimate. The result of this process is a chain of samples from the posterior distribution for each of the parameters you are estimating in your model. Together, the samples in these chains tell us about the characteristics of the parameter they represent. Incredibly, under a very reasonable set of conditions, randomly sampling from the posterior distribution will result in a distribution of sampled values of parameters (e.g. \(\mu\)) that will converge on the posterior distribution of the parameter given your data and model structure (including prior probabilities).

3.5 Estimating a single mean with the brms package

3.5.1 Data and Research Questions

We’re going to use the same experimental data we looked at last chapter: The height judgments collected for the adult male speakers in our experiment. For more information on the experiment, see section 1.3.2. Below we load the book package and extract only those trials including adult male speakers. In addition, we will focus only on the natural productions (contained in exp_data), excluding those trials involving the manipulated ‘big’ resonance level.

# load book package and brms
library (bmmb)
library (brms)

# load and subset experimental data
data (exp_data)
men = exp_data[exp_data$C_v=='m',]
mens_height = men$height

We’re going to revisit the research questions posed at the beginning of Chapter 2:

(Q1) How tall does the average adult male sound?

(Q2) Can we set limits on credible average apparent heights based on the data we collected?

However, this time we’re going to approach these questions more formally using a Bayesian regression model using brms (and Stan).

3.5.2 Description of the model

We’re beginning with a model that treats all of our data as random deviations drawn from a single, undifferentiated normal distribution. Our model for a single group of normally distributed values can be thought of in several different ways. In (3.13), the value of your dependent variable for any given trial (\(y_{[i]}\)) is thought of as being a normally-distributed variable with a constant mean of \(\mu\), and a fixed standard deviation \(\sigma\).

\[ \begin{equation} y_{[i]} \sim \mathrm{N}(\mu_{[i]},\sigma) \tag{3.13} \end{equation} \]

Our trial-specific mean will be constant across trials for now, but we introduce the formalism of the trial-dependent mean (\(\mu_{[i]}\)) now in (3.14) as it will be useful in our discussion below. So, we can still present our variable as coming from a trial-specific value of \(\mu\) even if we expect \(\mu_{[i]}=\mu_{[j]}\) for all values of \(i\) and \(j\) (some other number), i.e. that all observations share the same \(\mu\).

\[ \begin{equation} y_{[i]} \sim \mathrm{N}(\mu_{[i]},\sigma) \tag{3.14} \end{equation} \]

We can also think of this model as in (3.15), which says that your dependent variable is the sum of some expected value for that trial, (\(\mu_{[i]}\)) and some specific random error for that trial (\(\varepsilon_{[i]}\)). The random error is expected to be normally distributed with a mean of 0 and some unknown standard deviation (as in: \(\varepsilon_{[i]} \sim \mathrm{N}(0,\sigma)\)).

\[ \begin{equation} y_{[i]} = \mu_{[i]} + \varepsilon_{[i]} \tag{3.15} \end{equation} \]

In general, we use regression models to understand orderly variation in \(\mu_{[i]}\) from trial to trial by breaking it up into predictors (\(x_{1}, x_{2},...\)) that are combined based on weights as determined by the model coefficients (e.g., \(\alpha_1, \alpha_2,...\)). However, in this case we expect the value of \(\mu_{[i]}\) to actually be equal for all trials. When we are only trying to estimate a single average, we don’t have any predictors to explain variation in \(\mu_{[i]}\). In fact, our model structure suggests we expect no variation in \(\mu_{[i]}\) from trial to trial. However, mathematically we can’t just say ‘we have no predictor’ since everything needs to be represented by a number. As a result, we use a single ‘predictor’ \(x\) with a value of 1 so that our regression equation is as in (3.16)). Now, our model is trying to guess the value of a single coefficient (\(\alpha_1\)), and we expect this coefficient to be equal to \(\mu_{[i]}\) since it is being multiplied by a ‘predictor’ with a constant value of 1.

\[ \begin{equation} \mu_{[i]} = \alpha_1 \cdot 1 \tag{3.16} \end{equation} \]

This kind of model is called an Intercept only model. Regression models are really about representing differences, differences between groups and across conditions. When you are encoding differences, you need an overall reference point. For example, saying that something is ‘5 miles north’ is only interpretable given some reference point. The ‘reference point’ used by your model is called your ‘Intercept’, and it is the center of your model’s universe. At this point our model consists only of a single reference point, and the \(\alpha_1\) parameter reflects its value (as shown in equation (3.16)). As a result, the \(\alpha_1\) coefficient is called the ‘Intercept’ in our model. When a coefficient is just being multiplied by a ‘fake’ predictor that always equals one, we can omit it from the regression model (but its still secretly there). So, our model investigating the apparent heights of adult males can be formalized like this:

\[ \begin{equation} \begin{split} height_{[i]} \sim \mathrm{N}(\mu_{[i]},\sigma) \\ \mu_{[i]} = \mathrm{Intercept} \\ \end{split} \tag{3.17} \end{equation} \]

Put in plain English, each line in the model says the following:

  • We expect that apparent height for a given observation \(i\) is normally distributed according to some trial-specific expected value and some unknown (but fixed) standard deviation.

  • The expected value for any given trial (\(\mu_{[i]}\)) is equal to the intercept of the model for all trials. This means it’s fixed and we have the same expected value for all tokens.

3.5.3 Errors and residuals

Our model above implicitly says that the error, the random variation around \(\mu\), is drawn from a normal distribution with a mean of 0 and a standard deviation of \(\sigma\). This distribution represents all deviations in apparent height around the mean apparent height for the sample (\(\mu_{[i]}\)). In other words, the error for this model is expected to look like:

\[ \begin{equation} \varepsilon_{[i]} \sim \mathrm{N}(0,\sigma) \tag{3.18} \end{equation} \]

We can rearrange the terms in (3.15) to isolate the random term on the left side, as in (3.19). When we do this, we see that error is what we call the difference between the value of an observation and the expected value for that observation.

\[ \begin{equation} \varepsilon_{[i]} = y_{[i]} - \mu_{[i]} \tag{3.19} \end{equation} \]

In practice, you never know the true expected value, the real exact parameter for whatever distribution you are working with. Instead, you work with an estimate of the predicted value \(\hat{\mu}\). As a consequence, you do not have access to the exact errors but instead to estimated errors \(\hat{\varepsilon}\), as seen in (3.20). Estimated errors are called residuals.

\[ \begin{equation} \hat{\varepsilon}_{[i]} = y_{[i]} - \hat{\mu}_{[i]} \tag{3.20} \end{equation} \]

As noted in section 3.2, regression models assume that the random error, the unpredictable deviations about the expected value across trials, are independent and identically distributed. We can now be more specific and say that we expect our residuals to be independent and identically distributed. This assumption is obviously violated for our experimental data since we have multiple observations from each listener, each of who had their own tendencies (as discussed in section 2.3.1). For this reason, we can say that this model is ‘wrong’; it’s built in such a way that we know it is not a good fit for our data. We will discuss this, and the problems it causes, in the following chapter.

3.5.4 The model formula

Model structures are expressed in R using a very specific syntax. Think of writing a model formula as a sub-language within R. Generally, model formulas in R have the form:

y ~ predictors

The variable we are interested in understanding (y) goes on the left hand side of the tilde (~) and our predictors go on the right hand side. Notice that information regarding the random term (\(\varepsilon\)) is not included in the model formula. The formula above can be read as ‘y is distributed according to some predictor’, which really means “we think there is systematic variation in our y variable that can be understood by considering its joint variation with our predictor variable(s).”

For intercept-only models, the number 1 is included in the model formula to indicate that a single constant value is being estimated (as in (3.16)). As a result, our model formula will have the form seen below.

height ~ 1

This model formula could be said out loud like “we are trying to estimate the mean height” or “we are predicting mean height given only an intercept”.

3.5.5 Fitting the model: Calling the brm function

The brms package contains the brm (Bayesian regression models) function, which we will use to fit our models. The brm function takes a model specification, data, and some other information, and fits a model that estimates all the model parameters. Unless otherwise specified, brm assumes that the error component (\(\varepsilon\)) of your model is normally distributed. The first argument in the function call below is the model formula, and the second argument tells the function where to find the data (a data frame called men). The other arguments tell the function to estimate a single set of samples (chains = 1) using a single core on your CPU (cores = 1). These arguments will be discussed in more detail in the next chapter.

model = brms::brm (height ~ 1, data = men, chains = 1, cores = 1)

## Compiling Stan program...
## Start sampling
## 
## SAMPLING FOR MODEL '03859e54349182b6cd9cd51aa7ca25d3' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.103 seconds (Warm-up)
## Chain 1:                0.057 seconds (Sampling)
## Chain 1:                0.16 seconds (Total)

By default, brms takes 2000 samples, throwing out the first 1000 and returning the last 1000. The first 1000 samples are the warmup, the time the model uses to find appropriate parameter values for the model, and to tune the behavior of the sampling algorithm. The output above shows you that the sampler is working, and tells you about the progress as it works. This is a small amount of data and a simple model so it should fit pretty quickly. You can fit the models discussed in this book on your own computer using the code provided, or you can download them directly from the book GitHub repo using the code below.

# Download the model above from the GitHub page.
# File names are in the formant: 'chapterNumber_modelName'
model = bmmb::get_model ('3_model.RDS')

3.5.6 Interpreting the model: The print statement

Typing the model name into the console and hitting enter prints the default brms model print statement:

# inspect model
model

The first part provides you with some basic information and tells you some technical details that we don’t have to worry about for now (though some are obvious).

## Family: gaussian 
##  Links: mu = identity; sigma = identity 
##Formula: height ~ 1 
##   Data: men (Number of observations: 675) 
##  Draws: 1 chains, each with iter = 2000; warmup = 1000; thin = 1;
##         total post-warmup draws = 1000

Next we see estimated effects for our predictors, in this case only an intercept. This is an estimated population-level effect because it is shared by all observations in our sample, and it is not specific to any particular subset of observations.

## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   173.78      0.30   173.16   174.33 1.00     1055      714

The information above provides the mean (Estimate) and standard deviation (Est. Error) of the posterior distribution of \(\mu\) (Intercept), i.e. \(P(\mu | y)\). The values of l-95% CI and u-95% CI represent the lower and upper 95% credible interval of the posterior distribution. An \(x\%\) credible interval of a parameter is an interval such that the parameter has an \(x\%\) chance (\(0.x\) probability) of falling inside the interval. The credible intervals provided by brms are based on quantiles so that the l-95% CI and u-95% CI represent 2.5% and 97.5% quantiles of the posterior samples of a parameter. Based on its 95% credible interval, we see that there is a 95% probability that \(\mu\) is between 173.2 and 174.3 cm given our data and model structure.

Our model also provides us an estimate of the error standard deviation(\(\sigma\)), under Family Specific Parameters: sigma. This estimate closely matches our sample standard deviation estimate of 7.77 cm. In addition, we also get a 95% credible interval for this parameter, spanning from 2.5% = 7.37 to 97.5% = 8.21. Although our focus is often on the estimation of mean parameters, it’s very important to keep in mind that our model in (3.15) involves the estimation of two parameters: \(\mu\) and \(\sigma\).

## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     7.77      0.22     7.37     8.21 1.00     1139      741

This last section is just boilerplate and contains some basic reminders which will generally look the same.

## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

3.5.7 Seeing the samples

In section 3.4, we discussed that Bayesian modeling software (like Stan) takes samples of the posterior distributions of parameters given the data and model structure. It’s helpful to see that our model is really just a series of posterior samples, that is, samples of values from the distribution of \(\mu\) given your data and priors, i.e. \(P(\mu | y)\).

Compact descriptions of our models, such as the one in the print described above, are just summaries of the information contained in the posterior samples. Below we get the posterior samples from the model we fit above, in the form of a data frame using the get_samples function from the bmmb package. As expected, we have 1000 samples of each parameter. The first column represents the model intercept (b_Intercept), the second column is the error (sigma) (i.e. \(P(\sigma | y)\)). The third and fourth columns (lprior, lp__) are the log prior and log posterior densities, to be discussed in section 3.8.

# get posterior samples from model
samples = bmmb::get_samples (model)

# check number of samples
nrow (samples)
## [1] 1000

# see first six samples
head (samples)
##   b_Intercept sigma lprior  lp__
## 1       173.2 7.518 -5.885 -2347
## 2       174.1 7.613 -5.880 -2345
## 3       173.3 7.940 -5.945 -2346
## 4       174.2 7.564 -5.872 -2346
## 5       173.7 7.879 -5.924 -2345
## 6       174.0 7.859 -5.916 -2345

We can plot the individual samples for the intercept parameter on the left in figure 3.2, and on the right we can see a histogram of the samples.

(left) Individual samples from the posterior distribution of the model intercept parameter. (right) A histogram of the samples on the left. The curve shows the density of a normal distribution with a mean of 173.8 and a standard deviation of 0.30.

Figure 3.2: (left) Individual samples from the posterior distribution of the model intercept parameter. (right) A histogram of the samples on the left. The curve shows the density of a normal distribution with a mean of 173.8 and a standard deviation of 0.30.

Recall that our model output provides information about 95% credible intervals for the mean parameter: It was expected to fall between 173.2 and 174.3 cm. We know that this interval simply corresponds to the 2.5% and 97.5% quantiles of the posterior samples. We can confirm this by checking the quantiles on the vector containing our posterior samples and see that these exactly correspond to the values of Estimate, l-95% CI, and u-95% CI in the model print statement above.

quantile (samples[,"b_Intercept"], c(.025, .975))
##  2.5% 97.5% 
## 173.2 174.3

One of the great things about Bayesian models is that you can make your own summaries of the posterior samples, summarize them in several ways as required, and ask different questions easily. For example, there is no special status for the 2.5 and 97.5% quantiles, and we can easily check the values of other ones, such as the first and third quartiles:

quantile (samples[,"b_Intercept"], c(.25, .75))
##   25%   75% 
## 173.6 174.0

We can also use the posterior distribution to find the probability that the mean parameter is over/under any arbitrary value:

# probability that the intercept is less than 174 cm
mean (samples[,"b_Intercept"] < 174)
## [1] 0.76

Let’s take a second to think about why this works. Recall that the probability is the odds that something will occur, relative to all other outcomes. Our vector samples[,"b_Intercept"] represents 1000 observations of a random variable, 1000 possible values of the posterior distribution of the mean apparent height of adult males (\(P(\mu | y)\)). If we find the total number of these observations that were below 174 cm and then divide by the total number of observations (1000), we are effectively calculating the probability of observing a mean estimate below 174 cm. As a result, the calculation above says that there is a 0.76 probability (a 76% chance) that the mean apparent height of adult male speakers in this population is under 174 cm, given our data and model structure. We come to this conclusion by finding that 76% of the posterior samples of the parameter of interest are below 174 cm.

3.5.8 Getting the residuals

We can get the model residuals using the residuals function. By default it returns a data frame where one row corresponds to each observation in your data, and the different columns provide you information about the estimate. You will notice that you also get credible intervals for each of the estimated residuals. This is because there is one prediction for each set of posterior samples. Since we have 1000 posterior samples that means we have 1000 slightly different predicted values for each observation and therefore 1000 slightly different estimated errors for each observation.

model_residuals = residuals (model, )
head (model_residuals)
##      Estimate Est.Error     Q2.5    Q97.5
## [1,]  -3.8844    0.2988  -4.4347  -3.2638
## [2,]  -0.2844    0.2988  -0.8347   0.3362
## [3,]  -1.7844    0.2988  -2.3347  -1.1638
## [4,] -16.0844    0.2988 -16.6347 -15.4638
## [5,] -20.3844    0.2988 -20.9347 -19.7638
## [6,]   0.4156    0.2988  -0.1347   1.0362

By default, information regarding the posterior distribution of residuals for each data point is presented. However, you can get all the posterior estimates for each individual residual by setting summary=FALSE. When you do this, you will get a matrix of size m (rows) by n (columns) for m posterior samples and n data points. We can see below that we get 1000 rows representing our posterior samples and 675 columns representing our data points.

model_residuals = residuals (model, summary=FALSE)
dim (model_residuals)
## [1] 1000  675

In the left plot of figure 3.3 we show a histogram of our residuals and compare this to a normal distribution centered at zero, with a standard deviation equal to the error in our model (sigma = 7.78). It’s no surprise that these match because this is precisely what the sigma parameter in our model is an estimate of, the standard deviation of the residuals. Since our model consists of only an intercept, a single expected value (\(\mu\)) for all instances of the variable, all variation around the mean constitutes error. We can show this by comparing our residual estimates to the centered data (right plot of figure 3.3) and seeing that these are basically the same.

(left) Histogram of the residuals for `model`. (right) A comparison of our residuals and centered height judgments shows that these are nearly equal.

Figure 3.3: (left) Histogram of the residuals for model. (right) A comparison of our residuals and centered height judgments shows that these are nearly equal.

3.6 Checking model convergence

Our parameter estimates are based on a set of samples from the posterior distribution of a parameter. As with any other inference based on samples, our parameter estimates will be unreliable if we don’t have enough samples, or if our samples do not represent the population we are trying to understand. For this reason, it’s important to look at the ESS values (the effective sample size), and the Rhat values provided for brm model print statements.

Rhat tells you about whether your chains have converged, and it basically compares how much within-chain variation there is relative to the amount of between-chain overlap. Ideally, you want your chains to overlap almost entirely in the values that they span since they are supposed to be sampling the same thing. As noted in the ‘boilerplate’ at the end of the brm model print statement, values of Rhat near 1 are good, and values higher than around 1.1 are a bad sign.

ESS tells you approximately how many independent samples you have taken from the posterior. Bulk ESS tells you how many samples the sampler took in the ‘thick’ part of the density, and Tail ESS reflects how much time the sampler spent in the ‘thin’ part, in the tails of the distribution. Ideally we would like several hundred samples (at least) for mean estimates, and thousands to be confident in the 95% credible intervals. You may sometimes fit a model and get a warning message like this:

## Warning messages:
## 1: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and  
## medians may be unreliable. Running the chains for more iterations may help. See:
## http://mc-stan.org/misc/warnings.html#bulk-ess
## 2: Tail Effective Samples Size (ESS) is too low, indicating posterior variances 
## and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

That is brms telling you that you need to collect more samples in order to be confident in your parameter estimates. To get more samples, we can run the model for more iterations, or we can use more chains. Each chain is a separate set of samples for your parameter values. A model can be fit in parallel across several cores on your computer, resulting in several independent chains in roughly the same amount of computing time. Since these chains are all supposed to be sampling from the same posterior distribution, their samples can be merged across chains after sampling. There is a fixed number of samples a single core of your computer can make in a given amount of time. When you do this across \(n\) cores, you can get (approximately) \(n\) times as many samples in the same amount of time. Since many personal computers these days have 4-8 (or more) cores, we can take advantage of parallel processing to fit models faster. Before fitting a model across multiple cores, you should confirm how many you have. You can use the following command (you may need to install the parallel package):

parallel::detectCores()

The example code throughout this book will use four cores to fit models. If you only have four total cores on your computer, you should change the models to use 2-3 chains and cores so that your computer can take care of other necessary functions while you fit your model(s). One thing to keep in mind is that these models can be computationally intensive to fit. As the data sets become larger and the models become more complicated, more powerful computers are needed in order to fit a model in a reasonable amount of time. Below, we re-fit our initial model but run it on four chains, and on four cores at once.

# Fit the model yourself
model_multicore =  
  brms::brm (height ~ 1, data = men, chains = 4, cores = 4)
# Or download it from the GitHub page:
model_multicore = bmmb::get_model ('3_model_multicore.RDS')

We print the model below, and can see that using four chains has substantially increased our ESS, without taking up much more computing time. Towards the top of our print statement we see that 4 chains have collected total post-warmup samples = 4000. This means our model has 4000 samples for every parameter in the model. However, for some parameters we have only about 3000 ‘effective samples’. This means some of our samples are basically dead weight, taking up space and slowing down future computations for no good reason. The discrepancy between the number of samples and the ‘effective’ number of samples is due to something called autocorrelation, the self-similarity of nearby observations in a series of observations (discussed further in section 5.8.2).

# inspect model
model_multicore
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: height ~ 1 
##    Data: men (Number of observations: 675) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   173.79      0.30   173.21   174.38 1.00     3139     2534
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     7.78      0.21     7.39     8.21 1.00     3671     2476
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Sometimes consecutive samples can be too similar and so don’t give you that much independent information. When this happens you end up with less information about a parameter than you might think based on the number of samples you have. Think of it like measuring the temperature of a place to get an idea of its average annual weather. Measurements need to be well separated in order to be really independent and to give an accurate picture of the true average. If you were to measure the temperature every 5 minutes these measurements would have a high autocorrelation, and would not give you a good impression of the range of temperatures that place tends to experience in a calendar year.

One way to increase the ESS without increasing the final number of posterior samples is to run longer chains and keep only every \(n^{th}\) one. This strategy is called thinning, and it lets your models be smaller while containing approximately the same information. To do this you have to change the iter, warmup and thin parameters when you fit your model. Default behavior is that the models you fit keep every sample after the warmup is done, up to the iter maximum. So if iter=3000 and warmup=1000 you will end up with 2000 samples. If you set thin to some value other than 1, you keep only one every thin samples. As a result, you will end up with (iter-warmup) / thin samples per chain. If you are doing this across cores cores, then you will end up with (iter-warmup) / thin) \(\times\) cores samples in total. Below, we ask for 3000 samples per chain. Since the warmup is 1000 this means we will keep 2000 post warmup per chain, for 8000 total samples across all four chains. However, since thin=2, we will keep only half of these of these. As a result, we will end up with 4000 samples in total (i.e. ((3000-1000) / 2) \(\times\) 4).

# Fit the model yourself
model_thinned =  
  brms::brm (height ~ 1, data = men, chains = 4, cores = 4,
       warmup = 1000, iter = 3000, thin = 2)
# Or download it from the GitHub page:
model_thinned = bmmb::get_model ('3_model_thinned.RDS')

We inspect the model print statement and see that despite having the same number of samples as the model_multicore, the ESS for this model is higher than for the previous model, in particular for the sigma parameter. Before moving on from ESS we just want to note that it is possible for your ESS to be higher than your actual (real) number of samples. This is simply due to the way that ESS is calculated and is nothing to worry about, in fact, it may mean that your model is sampling the posterior very efficiently.

# inspect model
model_thinned
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: height ~ 1 
##    Data: men (Number of observations: 675) 
##   Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 2;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   173.78      0.30   173.21   174.39 1.00     3772     3556
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     7.77      0.21     7.38     8.20 1.00     3933     3568
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Before moving on we want to talk about another problem you may run into: Divergent transitions. When you see an error like this:

## There were n divergent transitions after warmup. Increasing adapt_delta 
## above 0.8 may help. See 
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup`

It means your model encountered \(n\) (some integer) divergent transitions during sampling. Your model has very specific expectations regarding parameter likelihoods and samples from the posterior distribution on this basis. A divergent transition indicates that as it sampled, it came across things that were not as expected. The fact that there is a difference between what your model though would happen and what actually happened then calls the reliability of the entire simulation into question. One way to fix this problem is to increase adapt_delta to 0.95 or 0.99. The adapt_delta parameter is set to 0.8 by default, and has a maximum value of one. This can be done as seen below.

brms::brm (height ~ 1, data = men, chains = 4, cores = 4, warmup = 1000, 
           iter = 3000, thin = 2, control = list(adapt_delta = 0.9))

Increasing the adapt_delta parameter results in fewer divergences and a more robust, but slower, sampler. Think of it like walking on ice taking little tiny steps; you are less likely to fall but will take longer to get there. If you raise adapt_delta to 0.999 or something and are still getting divergent transitions, there may be an error in your data or in the way that your model is specified. We can’t really be more specific than that at this point (see section 8.8), however, we can say that divergent transitions are often not a Stan problem but rather a data or model structure problem.

3.7 Specifying prior probabilities

In section 3.3 we mentioned that in Bayesian models all estimated parameters must have prior probability distributions specified for them. And yet, to this point we’ve been fitting models without explicitly specifying prior probability distributions for their parameters. If you don’t specify prior probabilities for your parameters, brm will use its own default priors using the characteristics of your data. We can use the function get_prior in brms to see what the default priors are for our model, and to see which parameters in our model require priors. Of course, we should know this based on the structure of our model but this method is useful to help verify our expectations.

Below we can see that our model requires priors for our two estimated parameters, the Intercept (\(\mu\)) and sigma (\(\sigma\)) parameters, and that these have been given default values. The default values use a t distribution (student_t()), which we will discuss in section 5.9. We’ve omitted a few (empty) columns below so the printout will fit on the page, but we encourage you to erase the [,-c(7:9)] part in the code below and run it yourself so you can see the full output.

brms::get_prior (height ~ 1, data = men)[,-c(7:9)]
##                     prior     class coef group resp dpar  source
##  student_t(3, 174.5, 7.1) Intercept                      default
##      student_t(3, 0, 7.1)     sigma                      default

brms makes it easy to specify prior probabilities for specific parameters or whole ‘classes’ of parameters. Setting priors for entire classes of parameters is faster for you and makes the model run faster, so it is a good idea to do it where possible. Right now, our model only includes the following classes of parameters:

  • Intercept: This is a unique class, only for intercepts.
  • sigma: This is for the standard deviation of our error parameters. Our model only has one for now, sigma (\(\sigma\)), but it can have more.

We’re going to set weakly informative prior probabilities for our parameters. This means we’re going to set our priors to be about the same size as the variation we expect in the data itself. To set these you have to use what you know about your variables and the world in general. Since we know that the average male over 20 in the US is 176 cm tall, this seems like a reasonable prior expectation for how tall the adult males in our sample will sound. We also know that the standard deviation of adult male heights in the US is 7.5 cm, and will double this for our priors. This is to account for the fact that there may be more variation in how tall people ‘sound’ compared to how tall they are. The code to set the priors for our model looks like this:

prior = c(brms::set_prior("normal(176, 15)", class = "Intercept"),
          brms::set_prior("normal(0, 15)", class = "sigma"))

The code above tells our model to use a normal distribution with a mean of 176 and a standard deviation of 15 (normal(176, 15)) for the prior distribution of the intercept. Around 90-95% of the mass of normal distributions is within two standard deviations of the mean. This means that we are saying that we expect, a priori, that the intercept should be between around 146 (176 - 15 \(\times\) 2) and 206 cm (176 + 15 \(\times\) 2). This is probably too broad, but it places the expected outcomes within reasonable human ranges.

The random error, sigma, was given a prior with a normal distribution with a mean of 0 and a standard deviation of 15 (normal(0, 15)). Again, this is likely an overestimation of the magnitude of the random error in this data. However, it is likely to be in the ballpark. Our prior specifies a normal distribution centered at 0 for the standard deviation. Since standard deviations, like variances, can only be positive, the sampler (Stan) used by brm ignores the negative half and uses only the positive half of the prior distribution. This prior basically says that we expect the average variation around the mean to be less than 30 cm, which it is very likely to be.

The left plot in figure 3.4 compares the normal distribution we used (blue line) to a histogram of our height judgments. As we can see, the prior distribution we used for the intercept is much broader (more vague) than the data distribution so that it will probably have little to no practical effect on our posterior distribution (but will help our model fit properly). The right plot compares the prior for the standard deviation parameters to the absolute value of the centered apparent heights. This presentation shows how far each observation is from the mean apparent height (at 174 cm), and again we see that most of these deviations are in the thicker part of the prior density. As a result, neither of these priors is going to have much of an effect on our parameter estimates given the size of our sample (see figure 3.1).

(left) The densities of the prior probability for our model intercept, compared to a histogram of height judgments for male speakers. (right) The distribution of absolute deviations from the mean height judgment, compared to the prior distribution for the error parameter ($\sigma$) in our model.

Figure 3.4: (left) The densities of the prior probability for our model intercept, compared to a histogram of height judgments for male speakers. (right) The distribution of absolute deviations from the mean height judgment, compared to the prior distribution for the error parameter (\(\sigma\)) in our model.

We can update the description of our model to include the specification of prior distributions for each estimated parameter, as in (3.21). In the future, our model descriptions will always include these. Our model specification now makes it clear: We expect height judgments to be normally distributed, we expect the mean to always equal the intercept, and we have specific prior distributions in mind for all estimated model parameters (\(\mathrm{Intercept}\) and \(\sigma\)).

\[ \begin{equation} \begin{split} \\ height_{[i]} \sim \mathrm{N}(\mu_{[i]},\sigma) \\ \mu_{[i]} = \mathrm{Intercept} \\ \\ \textrm{Priors:} \\ \mathrm{Intercept} \sim \mathrm{N}(176, 15) \\ \sigma \sim \mathrm{N}(0, 15) \\ \end{split} \tag{3.21} \end{equation} \]

We can fit this new model below, passing the lines that specify the prior distributions for our parameters to the prior parameter of the brm function.

# Fit the model yourself, or
model_priors =  
  brms::brm (height ~ 1, data = men, chains = 4, cores = 4,
       warmup = 1000, iter = 3500, thin = 2,
       prior = c(brms::set_prior("normal(176, 15)", class = "Intercept"),
                 brms::set_prior("normal(0, 15)", class = "sigma")))
# Download the model above from the GitHub page.
model_priors = bmmb::get_model ('3_model_priors.RDS')

We can use the short_summary function in the bmmb package to get an abridged version of the model print statements. These shorter versions are not a replacement for the complete statement as they omit important information about our models. However, these abbreviated print statements will help us compare models while minimizing redundant information on the page, and so are useful for this book. If we compare the output of model_thinned:

# inspect model
bmmb::short_summary (model_thinned)
## Formula:  height ~ 1
## Population-Level Effects:
##           Estimate Est.Error l-95% CI u-95% CI
## Intercept    173.8       0.3    173.2    174.4
## 
## Family Specific Parameters:
##       Estimate Est.Error l-95% CI u-95% CI
## sigma     7.77      0.21     7.38      8.2

To that of the model where we specified our own priors (model_priors), we see that there is no noticeable effect on our results.

bmmb::short_summary (model_priors)
## Formula:  height ~ 1
## Population-Level Effects:
##           Estimate Est.Error l-95% CI u-95% CI
## Intercept    173.8      0.31    173.2    174.4
## 
## Family Specific Parameters:
##       Estimate Est.Error l-95% CI u-95% CI
## sigma     7.77      0.21     7.37     8.19

This is because the prior matters less and less when you have a lot of data, and because we have set priors that are appropriate (but vague) given our data. Although the priors may not matter much for models as simple as these, they can be very important when working with more complex data, and are a necessary component of Bayesian modeling.

3.8 The log prior and log posterior densities

The information presented in this section is not strictly necessary to understand the models outlined in this book. However, it is useful information to know and so we put it here so that you will know when to find it when you want it.

When we got our posterior samples with the get_samples function, in addition to our mean and standard deviation estimates we got estimates of something called lprior and lp__. We can see these below:

samples = bmmb::get_samples (model_priors)
head (samples)
##   b_Intercept sigma lprior  lp__
## 1       173.4 8.041 -6.719 -2347
## 2       173.5 7.997 -6.716 -2346
## 3       173.8 8.043 -6.715 -2346
## 4       174.8 7.936 -6.704 -2351
## 5       173.4 7.630 -6.705 -2346
## 6       173.8 7.965 -6.712 -2346

These are the log prior and the (unnormalized) log posterior densities, respectively. To explain what these are, let’s generalize equation (3.12), which presented the posterior probability of a \(\mu\) parameter given your data and priors, to encompass all of our model parameters represented by \(\theta\) in (3.22). Now, (3.22) represents the posterior probability of all our model parameters given our data.

\[ \begin{equation} P(\theta|y) = \frac{P(y|\theta) \cdot P(\theta)}{P(y)} \tag{3.22} \end{equation} \]

First, we can discuss lprior, the log prior density. This is the logarithm of the joint prior density of all of your parameters (i.e. the sum of their log densities), in other words \(\log(P(\theta))\). Specifically for our model, the is equal to \(\log(P(\mu \: \& \: \sigma))\). We can see the prior probabilities of our model parameters below:

model_priors$prior
##            prior     class coef group resp dpar nlpar lb ub source
##  normal(176, 15) Intercept                                    user
##    normal(0, 15)     sigma                             0      user

And we can use this information to calculate lprior for the first and fourth values of lprior seen in the samples above (-6.719 and -6.704). In the lines below we find the log density of the mean, and the log density of the error term (sigma) for the first and fourth posterior estimates of \(\mu\) and \(\sigma\). For example, dnorm (173.4,176,15,log=TRUE) returns the log (since log=TRUE) density over a point at 173.4 (the first posterior sample of the mean) given a normal distribution with the characteristics of the prior of our mean (i.e., a mean of 176 cm and a standard deviation of 15 cm). Since variances are bounded by zero, Stan ignores the negative half of the density. As a result, add log(2) (i.e. multiply by two) to multiply the density of the sigma term by two, and make the area under the curve still be equal to one. By adding these terms together, we can recreate the values of lprior above.

dnorm (173.4,176,15,log=TRUE) + dnorm (8.041,0,15,log=TRUE) + log(2)
## [1] -6.72
dnorm (174.8,176,15,log=TRUE) + dnorm (7.936,0,15,log=TRUE) + log(2)
## [1] -6.704

The log posterior density (lp__) is basically the overall posterior probability of your model parameters given your data. The unnormalized posterior density ignores the marginal distribution (\(P(y)\)) and so returns a density that is proportional to the posterior density up to some constant, as in (3.23).

\[ \begin{equation} \begin{split} P(\theta|y) \propto P(y|\theta) \cdot P(\theta) \end{split} \tag{3.23} \end{equation} \]

By saying that it’s defined up to a proportional constant, this is what we mean. We cannot, or do not bother, calculating the exact posterior density. However, we can calculate a value that when multiplied by some constant \(C\) (which we happen to know is equal to (\(1/P(y)\))), equals the posterior density. This value is \([P(y|\theta) \cdot P(\theta)]\), presented in square brackets in (3.24). When we model the unnormalized posterior density we take this approach, we try to estimate \(P(\theta|y)\) while ignoring \(C\) (i.e., \(1/P(y)\)).

\[ \begin{equation} \begin{split} P(\theta|y) = [P(y|\theta) \cdot P(\theta)] \cdot C \end{split} \tag{3.24} \end{equation} \]

If we log-transform both sides of (3.24), we get the log-posterior density, shown in (3.25).

\[ \begin{equation} \log (P(\theta|y)) = \log (P(y|\theta)) + \log(P(\theta)) + \log(C) \tag{3.25} \end{equation} \]

Let’s pause to think about what (3.25) means and why it’s a useful measure. First, let’s talk about \(P(y|\theta)\) which is the model likelihood: The joint density of the data given the values of the parameters. We can think about the probability of observing each of the observations in our height data given the posterior mean estimates of \(\mu\) and \(\sigma\) provided by model_priors. Below, we see the values of the probability density above the first six observations, and then the logarithm of these values below.

# density over the first 6 observations
head (dnorm (mens_height, 173.8, 7.77))
## [1] 0.045267 0.051306 0.049985 0.006000 0.001636 0.051276

# log density over first 6 observations
head (dnorm (mens_height, 173.8, 7.77, log = TRUE))
## [1] -3.095 -2.970 -2.996 -5.116 -6.416 -2.971

To find the joint density of the data given the model parameters, we can sum the log-densities as seen below. This provides us an estimate of the log likelihood of the model given all parameter values (\(\log (P(y|\theta))\)).

sum (dnorm (mens_height, 173.8, 7.77, log = TRUE))
## [1] -2341

We can then add the logarithm of the priors for our parameters to this. By combining our priors and likelihoods in this way, we are calculating the unnormalized posterior density.

sum (dnorm (mens_height, 173.8, 7.78, log = TRUE)) + # the likelihood
  dnorm (173.8,176,15,log=TRUE) +   # prior probability of mu 
  dnorm (7.77,0,15,log=TRUE)+log(2) # prior probability of sigma 
## [1] -2347

We can see that our calculation is very close to the average value of the lp__ estimates provided by our model.

mean (samples$lp__)
## [1] -2346

The log posterior density, or values related to it, will come up again in chapter 6 when we discuss model comparison. This is because this value captures the posterior probability of the parameters given the data. Broadly speaking, models with a higher posterior probability are more believable because they are more likely to generate the observed data, and/or they have parameter values that are more probable a priori.

3.9 Answering our research questions

Finally, let’s return again to the research questions we posed initially in chapter 2, and again at the beginning of this chapter:

(Q1) How tall does the average adult male sound?

(Q2) Can we set limits on credible average apparent heights based on the data we collected?

Here’s what we had to say about this at the end of chapter 2:

“the average male speaker is most likely to sound about 174 cm tall. We can also conclude informally based on Figure 2.8 that the most likely mean values fall between (approximately) 173 and 174.5 cm”

We can reconsider the answers to these questions provided by our final model, model_priors. Usually, parameters should be reported with at least the mean/median and standard deviations of the posterior distribution, in addition to some useful credible interval (e.g. 50%, 95%) around that parameter. Based on the result of our final model, an answer to each question might be something like this:

(A1) Based on our model the average apparent height for adult males is likely to be 174 cm. In a paper we might report this like: “The mean height is 174 cm (s.d. = 0.3, 95% CI = [173.2, 174.4])”.

(A2) Yes we can. There is a 95% probability that the population mean is between 173.2 and 174.4 given our data and model structure. In other words, 95% of the posterior density is concentrated between the values of 173.2 and 174.4.

Notice that our answers correspond closely to what we concluded at the end of last chapter. The reason for this correspondence is because we made our inferences at the end of chapter 2 using only the likelihood and, due to the shape of the prior and the number of observations in our data, the posterior distribution of our model is being dominated by the likelihood. As a result, the two approaches converge on approximately the same solution in this situation.

3.10 ‘Traditionalists’ corner

In traditionalists corner, we’re going to compare the output of brms to some more ‘traditional’ approaches. We’re not going to talk about the traditional models in any detail, the focus of this section is simply to highlight the similarities between different approaches, and to point out where to find equivalent information in the different models. If you are already familiar with these approaches, these sections may be helpful. If not, some of the information provided here may not make much sense, although it may still be helpful. If you want to know more about the statistical methods being discussed here, please see the preface for a list of suggested background reading in statistics.

3.10.1 One-sample t-test vs. intercept-only Bayesian models

A one sample t-test helps investigate whether the mean of a set of observations is consistent with a true underlying value of zero. The t-test answers questions using the likelihood of parameters given the data (priors and posteriors are not involved). In addition, t-tests are calculated using analytic (exact) methods, and do not involve sampling from distributions. The t-test assumes that all of your observations are independent, so it is not appropriate to use for our experimental data. However, we just want to compare it to our initial Bayesian model (which made the same assumptions). We can apply a one-sample t-test to our vector of apparent-height judgments:

t.test (mens_height)
## 
##  One Sample t-test
## 
## data:  mens_height
## t = 582, df = 674, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  173.2 174.4
## sample estimates:
## mean of x 
##     173.8

Notice that the interval provided around the mean (95 percent confidence interval: 173.2 174.4) corresponds very well to the 95% credible interval around the intercept provided in model (173.16, 174.33). The reason they align so well is because both models have the same general structure and make similar mathematical assumptions. In addition, our Bayesian estimate is being dominated by the likelihood, and the more ‘traditional’ t-test only considers information from the likelihood.

3.10.2 Intercept-only ordinary-least-squares regression vs. intercept-only Bayesian models

Ordinary-least-squares (OLS) regression is an approach to fitting regression models using likelihoods (without priors or posteriors). OLS regression assumes that your residuals are independent and that your error variation is normally distributed. We can fit an OLS model using the lm (linear model) function in R, using the same model formula we used for our Bayesian models.

ols_model = lm (mens_height ~ 1)
summary (ols_model)
## 
## Call:
## lm(formula = mens_height ~ 1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34.09  -4.59   0.71   5.31  18.51 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  173.788      0.299     582   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.76 on 674 degrees of freedom

Again, we see a close similarity to our initial model. The Std.error for the Intercept above (0.299) corresponds closely to the Est.error of the intercept below (0.30), and the Residual standard error above (7.76) corresponds closely to the sigma estimate below (7.77).

## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   173.78      0.30   173.16   174.33 1.00     1055      714
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     7.77      0.22     7.37     8.21 1.00     1139      741

Just as with the t-test, these similarities are due to the models having the same general structure, making the same assumptions, and being dominated by the likelihood of the parameters.

3.11 Exercises

The analyses in the main body of the text all involve only the unmodified ‘actual’ resonance level (in exp_data). Responses for the stimuli with the simulated ‘big’ resonance are reserved for exercises throughout. You can get the ‘big’ resonance in the exp_ex data frame, or all data in the exp_data_all data frame.

Fit and interpret one or more of the suggested models:

  1. Easy: Analyze the (pre-fit) model that’s exactly like model_priors, except using the data in exp_ex (bmmb::get_model("3_model_priors_ex.RDS")).

  2. Medium: Fit a model just like model_priors, but for the data from some other group, for either the original or big resonance levels.

  3. Hard: Fit two models like model_priors for two arbitrary groups, and compare results across models.

In any case, describe the model, present and explain the results, and include some figures.

3.12 Plot Code


################################################################################
### Figure 3.1
################################################################################

library (bmmb)
# load and subset experimental data
data (exp_data, package = "bmmb")
men = exp_data[exp_data$C_v=='m',]
mens_height = men$height

par(mfcol = c(3,3), mar =c(.1,.25,.1,.25), oma = c(4,3,3,1))  

x = seq (135, 210, 0.05)
####
likelihood = dnorm (x, mean (mens_height), sd (mens_height) / sqrt ( 3 ) )
prior = dnorm (x, 180, 100) ; posterior = likelihood * prior
plot (x, likelihood / max(likelihood), type = 'l', ylab='Density',lwd=4,yaxs='i', 
      xlim = c(140, 210), ylim = c(0,1.1),xlab='',xaxt='n',cex.axis=1.3,yaxt='n',
      col = bmmb::cols[3])
grid()
lines (x, prior / max(prior),col=bmmb::cols[4],lwd=4)
lines (x, posterior / max (posterior),col=bmmb::cols[7], lty=2,lwd=4)

mtext (side=3,outer=FALSE,text="Observations = 3",line=1)
mtext (side=2,outer=FALSE,text=expression(paste("P(", mu,") = N(180,100)")),
       line=1,cex=0.9)

likelihood = dnorm (x, mean (mens_height), sd (mens_height) / sqrt ( 3 ) )
prior = dnorm (x, 180, 15) ; posterior = likelihood * prior
plot (x, likelihood / max(likelihood), type = 'l', ylab='',lwd=4,yaxs='i', 
      xlim = c(140, 210), ylim = c(0,1.1),xaxt='n',cex.axis=1.3,yaxt='n',
      col=bmmb::cols[3])
grid()
lines (x, prior / max(prior),col=bmmb::cols[4],lwd=4)
lines (x, posterior / max (posterior),col=bmmb::cols[7], lty=2,lwd=4)

mtext (side=2,outer=FALSE,text=expression(paste("P(", mu,") = N(180,15)")),
       line=1,cex=0.9)

likelihood = dnorm (x, mean (mens_height), sd (mens_height) / sqrt ( 3 ) )
prior = dnorm (x, 180, 1) ; posterior = likelihood * prior
plot (x, likelihood / max(likelihood), type = 'l', ylab='',lwd=4,yaxs='i', 
      xlim = c(140, 210), ylim = c(0,1.1),cex.axis=1.3,yaxt='n',
      col = bmmb::cols[3])
grid()
lines (x, prior / max(prior),col=bmmb::cols[4],lwd=4)
lines (x, posterior / max (posterior),col=bmmb::cols[7], lty=2,lwd=4)

mtext (side=2,outer=FALSE,text=expression(paste("P(", mu,") = N(180,1)")),
       line=1,cex=0.9)

####
likelihood = dnorm (x, mean (mens_height), sd (mens_height) / sqrt ( 10 ) )
prior = dnorm (x, 180, 100) ; posterior = likelihood * prior
plot (x, likelihood / max(likelihood), type = 'l', ylab='Density',lwd=4,yaxs='i', 
      xlim = c(160, 185), ylim = c(0,1.1),xlab='',yaxt='n',xaxt='n',
      col = bmmb::cols[3])
grid()
lines (x, prior / max(prior) ,col=bmmb::cols[4],lwd=4)
lines (x, posterior / max (posterior),col=bmmb::cols[7], lty=2,lwd=4)

mtext (side=3,outer=FALSE,text="Observations = 10",line=1)

likelihood = dnorm (x, mean (mens_height), sd (mens_height) / sqrt ( 10 ) )
prior = dnorm (x, 180, 15) ; posterior = likelihood * prior
plot (x, likelihood / max(likelihood), type = 'l', ylab='',lwd=4,yaxs='i', 
      xlim = c(160, 185), ylim = c(0,1.1),yaxt='n',xaxt='n',
      col = bmmb::cols[3])
grid()
lines (x, prior / max(prior),col=bmmb::cols[4],lwd=4)
lines (x, posterior / max (posterior),col=bmmb::cols[7], lty=2,lwd=4)

likelihood = dnorm (x, mean (mens_height), sd (mens_height) / sqrt ( 10 ) )
prior = dnorm (x, 180, 1) ; posterior = likelihood * prior
plot (x, likelihood / max(likelihood), type = 'l', ylab='',lwd=4,yaxs='i', 
      xlim = c(160, 185), ylim = c(0,1.1),yaxt='n',cex.axis=1.3,
      col = bmmb::cols[3])
grid()
lines (x, prior / max(prior),col=bmmb::cols[4],lwd=4)
lines (x, posterior / max (posterior),col=bmmb::cols[7], lty=2,lwd=4)

####
likelihood = dnorm (x, mean (mens_height), sd (mens_height) / sqrt ( 675 ) )
prior = dnorm (x, 180, 100) ; posterior = likelihood * prior
plot (x, likelihood / max(likelihood), type = 'l', ylab='Density',lwd=4,yaxs='i', 
      xlim = c(170, 183), ylim = c(0,1.1),xlab='',yaxt='n',xaxt='n',
      col = bmmb::cols[3])
grid()
lines (x, prior / max(prior),col=bmmb::cols[4],lwd=4)
lines (x, posterior / max (posterior),col=bmmb::cols[7], lty=2,lwd=4)

mtext (side=3,outer=FALSE,text="Observations = 675",line=1)

legend (174.75,.9,legend=c(expression(paste("Prior  ","P(",mu,")")),
                        expression(paste("Likelihood  ","P(y|",mu,")")),
                        expression(paste("Posterior  ","P(",mu,"|y)"))),
        lwd=4,lty=c(1,1,4),
        col=c(bmmb::cols[4],bmmb::cols[3],bmmb::cols[7]),bty = "n",cex=1.2)

likelihood = dnorm (x, mean (mens_height), sd (mens_height) / sqrt ( 675 ) )
prior = dnorm (x, 180, 15) ; posterior = likelihood * prior
plot (x, likelihood / max(likelihood), type = 'l', ylab='',lwd=4,yaxs='i', 
      xlim = c(170, 183), ylim = c(0,1.1),yaxt='n',xaxt='n',
      col = bmmb::cols[3])
grid()
lines (x, prior / max(prior),col=bmmb::cols[4],lwd=4)
lines (x, posterior / max (posterior),col=bmmb::cols[7], lty=2,lwd=4)

likelihood = dnorm (x, mean (mens_height), sd (mens_height) / sqrt ( 675 ) )
prior = dnorm (x, 180, 1) ; posterior = likelihood * prior
plot (x, likelihood / max(likelihood), type = 'l', ylab='',lwd=4,yaxs='i', 
      xlim = c(170, 183), ylim = c(0,1.1),yaxt='n',cex.axis=1.3,
      col = bmmb::cols[3])
grid()
lines (x, prior / max(prior),col=bmmb::cols[4],lwd=4)
lines (x, posterior / max (posterior),col=bmmb::cols[7], lty=2,lwd=4)

mtext (side = 1, outer = TRUE, text = "Apparent height (cm)", line = 2.5)


################################################################################
### Figure 3.2
################################################################################

par (mfrow = c(1,2), mar = c(4,4,1,1))
plot (samples[,1], xlab = 'Sample number',ylab = 'Apparent height (cm)',col=teal,pch=16)
abline (h = mean (samples[,1]), lwd=4,col=yellow)
hist (samples[,1], freq = FALSE, breaks = 15,main='',xlab='Apparent height (cm)',col=maroon,ylim=c(0,1.3))
curve (dnorm (x, 173.80, 0.30), xlim = c(172,175), lwd=4,add=TRUE,col=darkorange)

par (mfrow = c(1,2), mar = c(4,4,1,1))
hist(model_residuals[1,],main="", col = bmmb::cols[14], 
     freq=FALSE, xlim = c(-35,30),xlab="Model Residuals")
curve (dnorm (x,0,7.8), xlim=c(-40,45),add=TRUE,lwd=4,col=bmmb::cols[9])

centered_height = mens_height - mean(mens_height)
plot (model_residuals[1,], centered_height,lwd=2,col=bmmb::cols[4],cex=1.5,
      xlab="Model Residuals",ylab="Centered apparent height")
abline (0,1,col=bmmb::cols[3],lwd=2)

################################################################################
### Figure 3.4
################################################################################

par (mfrow = c(1,2), mar = c(4,4,1,1))

## plot prior for mean
hist (mens_height, xlim = c(140,200), freq = FALSE, col = deepgreen, ylab = 'Density', xlab = 'Apparent height (cm)',
      ylim = c(0,0.07), main="")
curve (dnorm(x,176,15),lwd = 2, col = 4, add = TRUE)

## plot prior for standard deviations
hist (abs (mens_height - mean (mens_height)), freq = FALSE, col = yellow, main="",
      ylab = 'Density', xlab = 'Centimeters',ylim = c(0,0.12), xlim = c(0,50))
curve (dnorm(x,0,15),lwd = 2, col = 4, add = TRUE)