Chapter 8 Varying variances, more about priors, and prior predictive checks

So far our models have been a bit ‘old-fashioned’ in one important way: They have featured a single error term (\(\sigma\)) for all of our observations. This means that \(\sigma\) is the same for all speakers, listeners, and conditions in our experiment. However, we might imagine a situation where one listener’s responses (and residuals) are more widely distributed than another, resulting in a situation where, for example, \(\sigma_{[\mathsf{L}_{[i]}=1]} \neq \sigma_{[\mathsf{L}_{[j]}=2]}\) for observations \(i\) and \(j\) contributed by listeners one and two. In such a situation, we may not want to use a single value of \(\sigma\) for all listeners and may instead prefer to use a different value for each listener, e.g. \(\sigma_{[i]}\) for listener \(i\). In this chapter we’re going to discuss models that allow for more variation in their \(\sigma\) parameters. In addition, we’re going to go into more detail about setting priors for our models, and the use of prior predictive checks.

8.1 Chapter pre-cap

In this chapter we introduce prior predictive checks, and talk about the importance of these for model building. Then, examples of the results of different prior settings for models are provided. After that we discuss how to specify more specific prior probabilities for individual model parameters. We then introduce heteroscedastic models, that is, models with error terms that vary from observation to observation. First, we present a ‘simple’ model that includes only variation in the error term across two conditions. Following that, we present a ‘complex’ model that features listener-dependent error terms fit using shrinkage, in addition to the equivalent of listener ‘random effects’ for the error term. Finally, we discuss building identifiable models, and models supported by the available data, and describe the problems of collinearity, linear dependance, and saturated models.

8.2 Data and Research questions

Below we load the data for our experiment investigating apparent speaker height, in addition to the brms and bmmb packages.

library (brms)
library (bmmb)
data (exp_data)
options (contrasts = c('contr.sum','contr.sum'))

The models we will consider below use the following variables from our data frame:

  • L: A number from 1-15 indicating which listener responded to the trial.
  • height: A number representing the height (in centimeters) reported for the speaker on each trial.
  • S: A number from 1-139 indicating which speaker produced the trial stimulus.
  • G: The apparent gender of the speaker indicated by the listener, f (female) or m (male).
  • A: The apparent age of the speaker indicated by the listener, a (adult) or c (child).

We’re going to use models whose structure is similar to the final model we fit in chapter 7 (model_interaction). However, in this chapter we’re going to focus on questions related to variation in our standard deviation parameters. We would like to know three things:

(Q1) Does our error standard deviation vary as a function of apparent speaker age?

(Q2) Does our error standard deviation vary as a function of listener?

8.3 More about priors

To this point our focus has been on understanding the components that make up our models, and how these are represented using our model parameters. Now that we’ve covered most of the essentials of using categorical predictors, we can focus a bit more on the prior distributions of the parameters in our models. The reason we’ve been able to get away with not talking about priors very much is that we have substantial domain knowledge regarding the distribution of human height as a function of age and gender. In addition, our models thus far have been fairly simple, making the consideration of prior distributions relatively straightforward.

For example, we will lay out some expected coefficient values based on the assumption that apparent height basically corresponds to veridical height, on average. Based on the information in height_data, we know that 11 year-old boys and girls are about 150 cm tall, and women and men are 163 and 176 cm tall respectively, on average. This means that there is a difference of 19.5 cm ((13+26)/2) on average between adults and children between 11 and 12 years of age. A difference of 19.5 cm between age groups suggests about a 10 cm distance (half the group difference) between each age group and the overall intercept. In other words, if apparent height is similar to veridical height, we expect an apparent age (\(A\)) predictor of about 10 cm in magnitude given sum coding. In contrast, we have an expected height difference of 0 cm across genders for children, and 12 cm across genders for adults. This is an average difference of 6 cm between males and females across both ages. As a result, we expect an effect of about 3 cm for gender averaged across ages. Based on this, our prior standard deviation of 12 cm for our b (fixed-effects) parameters seems very reasonable, although perhaps a bit large for some effects.

Are they too large? An effect of 12 cm is reasonable for something related to human height. If we were astronomers, our priors might be on the order of 12 light-years rather than cm, where a light year is equal to \(9.461 \cdot 10^{17}\) cm. If we were studying variation in the size of E. coli bacteria, our priors might be on the order of 0.2 micrometers, or about \(2 \cdot 10^{-5}\) cm. When viewed from this perspective, worrying about a prior of 12 cm vs. 6 cm is less important than providing your model with information about the general magnitude of variation it can expect in your data and in parameter estimates. Furthermore, in many situations involving repeated measures data, you will have so many observations that slight differences in the settings of the prior will not have any noticeable effect on the posterior distributions of the model parameters (see figure 3.1). This fact makes arguments regarding which prior is ‘better’ in many situations more philosophical than practical. That being said, it is still useful and important to provide priors that conform to plausible ranges of expected parameter values.

As we noted above, this is generally straightforward for our experiment involving the perception of apparent height. However, imagine a situation where there was less certainty about reasonable values for our priors. This can happen for many reasons. For example, imagine you carry out a lexical decision task where participants listen to a combination of ‘real’ (e.g., ‘map’) and ‘fake’ words (e.g., ‘marp’) and have to decide if the word they heard is ‘real’ or not. You divide your real words into 5 groups based on a numerical measure of their frequency (how often they tend to appear in speech). To complicate matters further, you see that your reaction times are heavily skewed and so decide to model the logarithm of the reaction times. What should your priors be for your frequency groups? The prior distribution of a parameter represents the expected distribution of your parameters a priori. So what do we think are reasonable group differences in this situation? Even in this relatively simple case, setting prior distributions using only your intuitions would require understanding plausible variation in the logarithm of reaction times across word groups. This may not be realistic in this situation nor in many others.

8.3.1 Prior predictive checks

A prior predictive check can help understand the consequences of the prior distributions of our parameters, especially in situations where they are otherwise difficult to understand. The prior predictive check consists of generating fake data (i.e., \(\tilde{y}\)) based on linear predictors (i.e. \(\mu\)) simulated by sampling only from the prior. Conceptually, this is very similar to a posterior predictive check (discussed in section 7.4), save for the fact that prior predictive checks are not influenced by the model likelihood (or the data). To carry out a prior predictive check using brms you fit your model in the same way you normally would, except for setting sample_prior="only". When you do this, your model knows to generate parameter estimates and expected values (e.g., \(\mu\)) based only on the prior distributions of the model parameters.

Below we fit model_interaction from chapter 7 again, except this time we sample only from the prior. Imagine that we did not have much knowledge about speaker heights other than that humans tend to be between 1 and 2 meters tall. Based on this we decided to be cautious and use relatively uninformative priors and set all prior standard deviations to 1000 cm (10 meters).

# Fit the model yourself
priors = c(brms::set_prior("student_t(3,156, 1000)", class = "Intercept"),
           brms::set_prior("student_t(3,0, 1000)", class = "b"),
           brms::set_prior("student_t(3,0, 1000)", class = "sd"),
           brms::set_prior("lkj_corr_cholesky (1000)", class = "cor"), 
           brms::set_prior("student_t(3,0, 1000)", class = "sigma"))

prior_uninformative =  
  brms::brm (height ~ A + G + A:G + (A + G + A:G|L) + (1|S), 
             sample_prior="only", data = exp_data, chains = 4, cores = 4, 
             warmup = 1000, iter = 5000, thin = 4, prior = priors)
# Or download it from the GitHub page:
prior_uninformative = bmmb::get_model ('8_prior_uninformative.RDS')

We can use the predict function to get the prior predictions made by our model.

pp_uninformative = predict (prior_uninformative, summary = FALSE)

When we get ‘unsummarized’ predictions we get a prediction for each set of sampled parameters in our model. These predictions vary across samples by row and across data points by column. So, to get predictions for a single posterior sample we need to observe a single row of this matrix. This could be done using a histogram, as below:

hist (pp_uninformative[1,])

However, the bmmb package has a simple function called p_check that can be used to consider the density of multiple samples at once. By default this compares 10 random predictions, however the number of predictions can be changed, and the user may also specify specific samples to consider.

bmmb::p_check (pp_uninformative)

In the left plot of figure 8.1 we see the result of using p_check in ten random samples from our ‘uninformative’ prior prediction. As can be seen, our ‘cautious’ approach results in a range of simulated heights that make no sense even given our relatively limited prior knowledge, in this case that most humans are between 100 and 200 cm tall. In fact, our simulated data contains not only large negative values, which make no sense in the context of height, but also substantial values above 5000 cm, which is about the size of a 10-story office building.

Densities of 10 prior predictions of apparent height for the uninformative, mildly informative, and conservative priors.

Figure 8.1: Densities of 10 prior predictions of apparent height for the uninformative, mildly informative, and conservative priors.

It is difficult to prove that such a wide prior is ‘bad’, and in fact some authors have recommended the use of very uninformative priors (Krushcke 2014). However, it’s difficult to deny that a prior that results in prior predictions that are more in line with our domain knowledge is generally better for at least two reasons. First, more informative priors can help the model converge on credible parameter values, especially for complicated models with many parameters. Second, more informative priors can help improve out-of-sample prediction by providing actual information about plausible parameter values for our model. Basically, no fancy statistical model is going to convince anyone that 50 meters tall is a plausible apparent height for human speakers because human speakers are simply not nearly that tall. As a result, a model that acts as if 50 meters is a plausible apparent height for human speakers is likely to offer worse out-of-sample prediction than a model that correctly assigns little to no prior belief to this range of apparent heights. For a more in-depth discussion on the role and function of prior probabilities in multilevel Bayesian models see Gelman et al. (2017).

Below we again sample from the prior using the setting we’ve been using so far, and that we used in the previous chapter.

# Fit the model yourself
priors = c(brms::set_prior("student_t(3,156, 12)", class = "Intercept"),
           brms::set_prior("student_t(3,0, 12)", class = "b"),
           brms::set_prior("student_t(3,0, 12)", class = "sd"),
           brms::set_prior("lkj_corr_cholesky (12)", class = "cor"), 
           brms::set_prior("student_t(3,0, 12)", class = "sigma"))

prior_mildly_informative =  
  brms::brm (height ~ A + G + A:G + (A + G + A:G|L) + (1|S), 
             sample_prior="only", data = exp_data, chains = 4, cores = 4, 
             warmup = 1000, iter = 5000, thin = 4, prior = priors)
# Or download it from the GitHub page:
prior_mildly_informative = 
  bmmb::get_model ('8_prior_mildly_informative.RDS')

We can use the p_check function again to make plots of our prior predictions, as shown in middle plot of figure 8.1. As we can see, the priors we’ve been using allow for some implausibly large or small apparent height judgments. However, they are constraining the bulk of these to values between 100-200 cm, which is our approximate target. In our opinion, these priors are ‘good enough’. However, we may be unhappy that even some implausible values are being generated and want to test out even tighter priors. Below, we sample from priors that are half as wide as those of our mildly informative model above.

# Fit the model yourself
priors = c(brms::set_prior("student_t(3,156, 6)", class = "Intercept"),
           brms::set_prior("student_t(3,0, 6)", class = "b"),
           brms::set_prior("student_t(3,0, 6)", class = "sd"),
           brms::set_prior("lkj_corr_cholesky (2)", class = "cor"), 
           brms::set_prior("student_t(3,0, 6)", class = "sigma"))

prior_conservative =  
  brms::brm (height ~ A + G + A:G + (A + G + A:G|L) + (1|S), 
             sample_prior="only", data = exp_data, chains = 1, cores = 1, 
             warmup = 1000, iter = 5000, thin = 1, prior = priors)
# Or download it from the GitHub page:
prior_conservative = bmmb::get_model ('8_prior_conservative.RDS')

The prior predictions made by this model are presented in the right plot of figure 8.1. This time, we see that our prior predictions are even more tightly clustered within our desired range. For the sake of comparison, we fit the three models specified above by erasing sample_prior="only" from the respective function calls. We’ve named these models model_uninformative, model_mildly_informative, and model_conservative respectively. Below we load the models:

model_uninformative = 
  bmmb::get_model ("8_model_uninformative.RDS")

model_mildly_informative = 
  bmmb::get_model ("8_model_mildly_informative.RDS")

model_conservative = 
  bmmb::get_model ("8_model_conservative.RDS")

We use brmplot to compare the fixed effects estimates for the three models. As can be seen, there is little to no difference in parameter estimates or the credible intervals around them as a function of our prior probabilities. For example, although it may seem that there is a reduction in the A1 effect based on the prior, the posterior means of these effects differ by less than 0.2 cm across the three models. This is in part due to the large amount of data we have, and to the fact that our model is fairly simple relative to the information contained in our data.

Comparison of fixed effects estimates and 95% credible intervals for the uninformative (UI), mildly informative (MI), and conservative (C) models.

Figure 8.2: Comparison of fixed effects estimates and 95% credible intervals for the uninformative (UI), mildly informative (MI), and conservative (C) models.

8.3.2 More specific priors

The final model above was named conservative because its priors may have been too small. As noted above we actually expect an effect for age of about 10 cm based on variation in veridical height between children and adults, and about 3 cm for the gender effect. When we set our priors using the class parameter in the prior function, we can reduce the time our models take to fit, but we lose the ability to fine-tune priors for specific parameters. For example, we might want a model with a prior with a standard deviation of 10 cm for the age effect, and a prior standard deviation of 3 cm for the gender effect. We can use the prior_summary function in brms to see what priors we can set for our model. The output of this function is formatted in the same way as that of the get_priors function, discussed in sections 3.7 and 4.5.3. Actually, we’re going to use a clone of get_prior in the bmmb package just because the one in brms doesn’t like it when you omit columns from the output when printing.

# we omit empty columns to let the output fit on the page 
bmmb::prior_summary(model_mildly_informative)[,-c(5:9)]
##                  prior     class      coef group  source
##     student_t(3,0, 12)         b                    user
##     student_t(3,0, 12)         b        A1       default
##     student_t(3,0, 12)         b     A1:G1       default
##     student_t(3,0, 12)         b        G1       default
##   student_t(3,156, 12) Intercept                    user
##  lkj_corr_cholesky (2)         L                    user
##  lkj_corr_cholesky (2)         L               L default
##     student_t(3,0, 12)        sd                    user
##     student_t(3,0, 12)        sd               L default
##     student_t(3,0, 12)        sd        A1     L default
##     student_t(3,0, 12)        sd     A1:G1     L default
##     student_t(3,0, 12)        sd        G1     L default
##     student_t(3,0, 12)        sd Intercept     L default
##     student_t(3,0, 12)        sd               S default
##     student_t(3,0, 12)        sd Intercept     S default
##     student_t(3,0, 12)     sigma                    user

By the way, you can get the same information from your models by running the command model_mildly_informative$prior. We can copy the values of the class, coeff, and group columns in the output above to set more specific prior probabilities for the different parameters in our model. Below, We set some arbitrarily selected priors just to show an example of how these can be set for more specific groups of parameters.

# we omit the 'brms::' part so that the lines below will fit on the page
priors = 
  c(set_prior("student_t(3,156, 6)", class = "Intercept"),
    set_prior("student_t(3,0, 6)", class = "b"),
    set_prior("student_t(3,0, 10)", class = "b", coef = "A1"),
    set_prior("student_t(3,0, 3)", class = "b", coef = "G1"),
    set_prior("student_t(3,0, 10)", class = "sd"),
    set_prior("student_t(3,0, 5)", class = "sd", coef = "A1", group="L"),
    set_prior("student_t(3,0, 1.5)", class = "sd", coef = "G1", group="L"),
    set_prior("lkj_corr_cholesky (2)", class = "cor"), 
    set_prior("student_t(3,0, 6)", class = "sigma"))

options (contrasts = c('contr.sum','contr.sum'))
prior_informative =  
  brms::brm (height ~ A + G + A:G + (A + G + A:G|L) + (1|S), 
             sample_prior = "only", data = exp_data, chains = 1, cores = 1, 
             warmup = 1000, iter = 5000, thin = 1, prior = priors)

If we inspect the priors for this model and compare this to the previous one, we can see that now there is between-parameter variation in our priors.

bmmb::prior_summary(prior_informative)[,-c(5:8)]
##                  prior     class      coef group  source
##      student_t(3,0, 6)         b                    user
##     student_t(3,0, 10)         b        A1          user
##     student_t(3,0, 10)         b     A1:G1       default
##      student_t(3,0, 3)         b        G1          user
##    student_t(3,156, 6) Intercept                    user
##  lkj_corr_cholesky (2)         L                    user
##  lkj_corr_cholesky (2)         L               L default
##     student_t(3,0, 10)        sd                    user
##     student_t(3,0, 10)        sd               L default
##      student_t(3,0, 5)        sd        A1     L    user
##      student_t(3,0, 5)        sd     A1:G1     L default
##    student_t(3,0, 1.5)        sd        G1     L    user
##    student_t(3,0, 1.5)        sd Intercept     L default
##    student_t(3,0, 1.5)        sd               S default
##    student_t(3,0, 1.5)        sd Intercept     S default
##      student_t(3,0, 6)     sigma                    user

8.4 Heteroskedasticity and distributional (or mixture) models

Most ‘typical’ regression models assume that the error variance is the same for all observations, all conditions, all listeners, etc. The property of having a homogeneous variance is called homoscedasticity, and models that make this assumption are homoscedastic. Historically, the homogeneity of error variance has been assumed in models because it makes them simpler to work with and understand, and not because it is ‘true’ for all (or any) data. With Bayesian models it is straightforward to relax and test this assumption, and to build models that exhibit heteroscedasticity, differences in the error variance across different subsets of data. These sorts of models are sometimes called mixture models because they can be thought of as a mixture of different distributions or distributional models because they allow us to model variation in all of the parameters of the data distribution (e.g. \(\sigma\) in addition to \(\mu\)).

Consider the residuals for the random effects model we fit in chapter 7, which we can get from the model using the residuals function:

# download model
model_interaction = bmmb::get_model ('7_model_interaction.RDS')
# get residuals
residuals_interaction = residuals (model_interaction)

If we inspect the residuals we see that the distribution of these is not exactly equal for all listeners, and also appears to vary based on apparent speaker age. For example, the interquartile range for listener 9 is nearly as wide as the entire distribution of residuals for listener 12. In this section we will fit models with age and listener-specific error variances (\(\sigma\)). By allowing the random error to vary as a function of listener and apparent age, our model may be able to provide more-reliable information regarding our data, in addition to letting us ask questions like “is listener 12’s error variance actually smaller than listener 9’s, or does it just seem that way?” in a more formal manner.

(top left) Distribution of residuals for each listener. (top right) Distribution of residuals for apparent adults (a) and apparent children (c). (bottom) Distribution of residuals for each listener, divided according to apparent children and apparent adults. Each color represents a different listener and the left box in each pair represents apparent adults.

Figure 8.3: (top left) Distribution of residuals for each listener. (top right) Distribution of residuals for apparent adults (a) and apparent children (c). (bottom) Distribution of residuals for each listener, divided according to apparent children and apparent adults. Each color represents a different listener and the left box in each pair represents apparent adults.

To this point our models have all looked something like (8.1), with the generic predictors \(x_p\). In these models \(\mu\) got a subscript because it varied from trial to trial and was being predicted, but \(\sigma\) has never received a subscript because it hasn’t varied in our models.

\[ \begin{equation} \begin{split} height_{[i]} \sim \mathrm{N}(\mu_{[i]},\sigma) \\ \mu_{[i]} = \mathrm{x}_1 + \mathrm{x}_2+ \dots + \mathrm{x}_p \\ \end{split} \tag{8.1} \end{equation} \]

Instead, we can fit models that help us understand systematic variation in both the mean and the \(\sigma\) parameters in our data. We can imagine this by adding a subscript to \(\sigma\) and predicting its value from trial to trial using variables, as seen in (8.2). We’ve given these predictor variables little \(\sigma\) subscripts just to distinguish them conceptually from those predicting \(\mu\), but they could be the same variables used to understand \(\mu\) or a different set.

\[ \begin{equation} \begin{split} height_{[i]} \sim \mathrm{N}(\mu_{[i]},\sigma_{[i]}) \\ \mu_{[i]} = \mathrm{x}_1 + \mathrm{x}_2+ \dots + \mathrm{x}_p \\ \sigma_{[i]} = \mathrm{x}_{\sigma1} + \mathrm{x}_{\sigma2} + \dots + \mathrm{x}_{\sigma p} \\ \\ \end{split} \tag{8.2} \end{equation} \]

8.5 A ‘simple’ model: Error varies according to a single fixed effect

We’ll begin with a ‘simple’ heteroscedastic model that allows for variation in \(\sigma\) based on apparent age. After this, we’ll fit a model that includes listener-dependent variation in \(\sigma\) as well.

8.5.1 Description of our model

The brms package makes it easy to fit heteroscedastic models by letting you write formulas for the error just as you do for the mean. For example, the model formula for the last model we fit in chapter 7 (model_interaction) was:

height ~ A + G + A:G + (A + G + A:G|L) + (1|S)

This told brms “model apparent height as a function of an intercept, an effect for apparent age, an effect for apparent gender, the interaction of the two, listener-specific adjustments to all predictors, and speaker-specific intercepts”. Implicitly, we know that all of this models variation in the \(\mu\) parameter of a normal distribution with a fixed \(\sigma\). We want our model to consider the possibility that the value of \(\sigma\) may vary based on whether the speaker is judged to be an adult or a child. We can do this by including a separate formula for our error (called sigma by brms) using the following formula:

sigma ~ A

This formula tells brms “model the standard deviation of the error as varying around an overall intercept and deviations of this based on A, a predictor indicating apparent age”. We can ‘stick’ our two model formulas together using the bf (brmsformula) function as shown below:

model_formula = brms::bf(height ~ A + (A|L) + (1|S),
                         sigma ~ A)

The problem with modeling the standard deviation parameter directly is that negative values are impossible, so that standard deviations are bounded by zero. To deal with this, brms models log(sigma), the logarithm of the error standard deviation, rather than sigma directly. Logarithms help model standard deviations because they map all numbers from 0 to 1 to values between negative infinity and 0. This allows one to model very small values of \(\sigma\) while at the same time never worrying about estimating a negative standard deviation (for more on logarithms see section 2.7.2).

Since we’re modeling log-transformed sigmas, we now specify the priors for sigma in logarithmic values. Previously, our prior for sigma was a t-distribution with a standard deviation of 12. This meant that the majority of the mass of the distribution was going to be between 0 and 24 cm (two standard deviations from the mean). Now, we set the distribution to 1.5, meaning that we expect most predictors related to \(\sigma\) to be between \(\exp(-3)=0.05\) and \(\exp(3)=20.1\). The full model specification is provided below (omitting the ‘construction’ of the \(\mathrm{\Sigma}\) term):

\[ \begin{equation} \begin{split} height_{[i]} \sim \mathrm{t}(\nu, \mu_{[i]},\sigma_{[i]}) \\ \mu_{[i]} = \mathrm{Intercept} + A + G + A \colon G + \\ L_{[\mathsf{L}_{[i]}]} + A \colon L_{[\mathsf{L}_{[i]}]} + G \colon L_{[\mathsf{L}_{[i]}]} + A \colon G \colon L_{[\mathsf{L}_{[i]}]} + S_{[\mathsf{S}_{[i]}]} \\ \\ \log (\sigma_{[i]}) = \mathrm{Intercept_{\sigma}} + A_{\sigma} \\ \\ \mathrm{Priors:} \\ S_{[\bullet]} \sim \mathrm{t}(3,0,\sigma_S) \\ \begin{bmatrix} L_{[\bullet]} \\ A \colon L_{[\bullet]} \\ G \colon L_{[\bullet]} \\ A \colon G \colon L_{[\bullet]} \end{bmatrix} \sim \mathrm{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ \end{bmatrix}, \mathrm{\Sigma} \right) \\ \\ \mathrm{Intercept} \sim \mathrm{t}(3,156,12) \\ A, G, A \colon G \sim \mathrm{t}(3,0,12) \\ \sigma_L, \sigma_{A \colon L}, \sigma_{G \colon L}, \sigma_{A \colon G \colon L}, \sigma_S \sim \mathrm{t}(3,0,12) \\ \nu \sim \mathrm{gamma}(2, 0.1) \\ R \sim \mathrm{LKJCorr} (2) \\ \mathrm{Intercept_{\sigma}} \sim \mathrm{N}(0,1.5) \\ A_\sigma \sim \mathrm{N}(0,1.5) \\ \end{split} \tag{8.3} \end{equation} \]

A plain English description of this model is:

We are modeling apparent height as coming from a t distribution with unknown nu (\(\nu\)), mean (\(\mu\)), and scale (\(\sigma\)) parameters. The expected value for any given trial (\(\mu\)) is modeled as the sum of an intercept, an effect for apparent speaker age (\(A\)), gender (\(G\)), and their interaction (\(A \colon G\)), a listener effect (\(L\)), a listener dependent effect for apparent age (\(A \colon L\)), apparent gender (\(G \colon L\)), their interaction (\(A \colon G \colon L\)), and a speaker effect (\(S\)). The expected value of the logarithm of the error standard deviation (\(\sigma\)) is modeled as the sum of an intercept (\(Intercept_{\sigma}\)) and an effect for apparent age (\(A_\sigma\)) that are specific to the error. The speaker effects were drawn from a univariate normal distribution with a standard deviation (\(\sigma_S\)) estimated from the data. The four listener effects were drawn from a multivariate normal distribution with individual standard deviations (\(\sigma_L,\sigma_{A \colon L},\sigma_{G \colon L},\sigma_{A \colon G \colon L}\)) and a correlation matrix (\(R\)) that was estimated from the data. The remainder of the ‘fixed’ effects and correlations were given prior distributions appropriate for their expected range of values, including the terms related to the error (\(Intercept_{\sigma}\), and \(A_\sigma\)).

This model is quite large, but it only contains four new (or modified) lines compared to the model presented in (7.8). These lines are presented in (8.4), and effectively represent a simple two-group model for our \(\sigma\) parameter, that has been added to our model predicting \(\mu\). Our model now uses a trial-specific error term for trial \(i\) (\(\sigma_{[i]}\)), whereas it previously always used a constant standard deviation. Further, we are now modeling this trial-specific \(\sigma\) parameter using an intercept specific to our sigma term (\(\mathrm{Intercept_{\sigma}}\)), and a deviation from this base on apparent age (\(A_{\sigma}\)).

\[ \begin{equation} \begin{split} height_{[i]} \sim \mathrm{t}(\nu, \mu_{[i]},\sigma_{[i]}) \\ \sigma_{[i]} = \mathrm{Intercept_{\sigma}} + A_{\sigma} \\ \\ \mathrm{Intercept_{\sigma}} \sim \mathrm{t}(3, 0,1.5) \\ A_\sigma \sim \mathrm{t}(3, 0,1.5) \\ \end{split} \tag{8.4} \end{equation} \]

8.5.2 Prior predictive checks

Recall that when we use the set_priors function we can use the class parameter to specify priors for whole classes of parameters at a time. The classes we have discussed so far are:

  • Intercept: this is a unique class, only for intercepts.
  • sd: this is for our standard deviation parameters that relate to ‘batches’ of parameters, e.g. sd(Intercept) for L (\(\sigma_{L}\)).
  • sigma: the error term.
  • cor: priors for ‘random effect’ correlation terms.

For the first time we need to specify priors for terms for \(\sigma\) rather than \(\mu\). We can do this by setting dpar="sigma" in the set_prior function. As noted just above, you can see which priors need setting, and which parameters need to be specified for a model, by using the get_prior function and providing your formula, data, and family.

brms::get_prior (brms::bf(height ~ A*G + (A*G|L) + (1|S),
                          sigma ~ A),
                 data = exp_data, family="student")

If you run the command above you will see output that includes the lines below:

##                      prior     class      coef group resp  dpar
##                     (flat)         b                      sigma            
##                     (flat)         b        A1            sigma            
##       student_t(3, 0, 2.5) Intercept                      sigma            

We have two parameters where dpar=sigma: An overall intercept (class = "Intercept", dpar = "sigma"), and an age term for sigma (class = "b",coef="A1",dpar="sigma"). We can set priors for these elements using the lines:

brms::set_prior("normal(0, 1.5)", class = "Intercept", dpar = "sigma")
brms::set_prior("normal(0, 1.5)", class = "b", dpar = "sigma")

Notice that we set these priors using the classes we have already discussed, Intercept and b respectively. However, now we set dpar (distributional parameter) to sigma to tell brm that these priors are specifically for parameters related to the model error term, and not to variation in predicted values (i.e. \(\mu\)). Below we fit the model sampling only from our priors:

# Fit the model yourself
model_formula = brms::bf(height ~ A*G + (A*G|L) + (1|S),
                         sigma ~ A)
priors = 
  c(set_prior("student_t(3, 156, 12)", class = "Intercept"),
    set_prior("student_t(3, 0, 12)", class = "b"),
    set_prior("student_t(3, 0, 12)", class = "sd"),
    set_prior("gamma(2, 0.1)", class = "nu"),
    set_prior("normal(0, 1.5)", class = "Intercept", dpar = "sigma"),
    set_prior("normal(0, 1.5)", class = "b", dpar = "sigma"),
    set_prior("lkj_corr_cholesky (2)", class = "cor"))

prior_A_sigma = 
  brms::brm (model_formula, data = exp_data, chains = 4, cores = 4,
             warmup = 1000, iter = 3500, thin = 2, family="student",
             prior = priors, sample_prior = "only")
# Or download it from the GitHub page:
prior_A_sigma = bmmb::get_model ('8_prior_A_sigma.RDS')

We can investigate our prior predictions using the code below. We don’t plot them here, but we think the predictions are reasonable.

p_check (prior_A_sigma)

8.5.3 Fitting and interpreting the model

For the first time below, we specify both the model formula (using the bf function) and the priors (using the set_prior function) outside of the call to brm. We do this to preserve the legibility of the code in general.

# Fit the model yourself
model_formula = brms::bf(height ~ A*G + (A*G|L) + (1|S),
                         sigma ~ A)
priors = 
  c(set_prior("student_t(3, 156, 12)", class = "Intercept"),
    set_prior("student_t(3, 0, 12)", class = "b"),
    set_prior("student_t(3, 0, 12)", class = "sd"),
    set_prior("gamma(2, 0.1)", class = "nu"),
    set_prior("normal(0, 1.5)", class = "Intercept", dpar = "sigma"),
    set_prior("normal(0, 1.5)", class = "b", dpar = "sigma"),
    set_prior("lkj_corr_cholesky (2)", class = "cor"))

model_A_sigma = 
  brms::brm (model_formula, data = exp_data, chains = 4, cores = 4,
             warmup = 1000, iter = 3500, thin = 2, family="student",
             prior = priors)
# Or download it from the GitHub page:
model_A_sigma = bmmb::get_model ('8_model_A_sigma.RDS')

If we look at the model ‘fixed’ (`population-Level) effects:

fixef (model_A_sigma)
##                 Estimate Est.Error    Q2.5    Q97.5
## Intercept        158.383   1.10922 156.193 160.6196
## sigma_Intercept    1.724   0.03265   1.659   1.7881
## A1                11.271   1.19021   8.915  13.6901
## G1                -3.067   0.59091  -4.271  -1.9084
## A1:G1             -1.528   0.43464  -2.403  -0.6654
## sigma_A1          -0.247   0.02199  -0.290  -0.2036

We see two new lines reflecting our new model parameters, both in the ‘Population-level effects’: sigma_Intercept and sigma_A1. These terms represent the intercept of our sigma, term and the effect for apparent adultness on sigma. These effects can be interpreted in a very similar way as we interpret the corresponding effects for \(\mu\): Intercept and A1. The value of sigma_Intercept represents the grand mean for our sigma term. The value of sigma_A1 reflects the difference in sigma between the intercept and the value when speakers were perceived as adults. Since we are using sum coding A2, the effect for apparent children, is not estimated. However, we can recover A2 since A1 = -A2. One easy way to do this is using the hypothesis function (or its clone in bmmb) as follows:

sigmas = short_hypothesis(
  model_A_sigma, 
  c("exp(sigma_Intercept) = 0",                # overall sigma
    "exp(sigma_Intercept + sigma_A1) = 0",     # adult sigma
    "exp(sigma_Intercept - sigma_A1) = 0"))    # child sigma
sigmas[,-5]
##    Estimate Est.Error  Q2.5 Q97.5
## H1    5.609    0.1830 5.253 5.978
## H2    4.382    0.1526 4.085 4.680
## H3    7.184    0.3115 6.573 7.793

Since our model estimates log(sigma) and not sigma directly, to get our values of sigma we need to exponentiate our parameter estimates in order to recover our actual model error. We can do this directly in short_hypothesis as seen above. The three estimated error terms are presented in figure 8.4. Clearly, there are substantial differences in the value of \(\sigma\) based on the perceived adultness of the speaker: The value of \(\sigma\) was 61% greater (7.1/4.4) when listeners identified the speaker as a child.

(left) Estimates of overall data-level error ($\sigma$), to estimates of this for apparent adults and for apparent children. (right) Comparison of some fixed-effects parameters shared by our heteroscedastic and homoscedastic models.

Figure 8.4: (left) Estimates of overall data-level error (\(\sigma\)), to estimates of this for apparent adults and for apparent children. (right) Comparison of some fixed-effects parameters shared by our heteroscedastic and homoscedastic models.

We might wonder two things about this new model. First, is there any difference in our fixed-effect parameters? Second, is the additional model complexity justified? We can compare this heteroskedastic model to the equivalent homoskedastic model we fit in chapter 7 (model_interaction, loaded above). We can assess the first question visually using the brmplot function as seen in the right plot of figure 8.4. It doesn’t seem like the change has had much impact on our parameter estimates, or the intervals around them. We can make a more formal comparison using cross validation (discussed in section 6.4.3). We add the loo (leave one out) criterion to each model and make a comparison.

model_interaction = add_criterion(model_interaction,"loo")
model_A_sigma = add_criterion(model_A_sigma,"loo")

loo_compare (model_interaction, model_A_sigma)
##                   elpd_diff se_diff
## model_A_sigma       0.0       0.0  
## model_interaction -53.7      12.8

We can see that there is a substantial difference in \(\mathrm{elpd}\) between models, and that this difference is just over four times the size of the standard deviation. Based on this, it would seem that using the more complex model is justifiable. We will leave a discussion of whether the difference makes sense or matters for the section on answering our research questions.

8.6 A ‘complex’ model: Error varies according to fixed and random effects

In the previous section we fit a relatively ‘simple’ heteroscedastic model. This model allowed for error variances to differ based on the perceived adultness of the speaker, and so was very similar to the models we fit in chapter 5. Here, we’re going to fit a substantially more complex model that predicts variation in \(\sigma\) using listener-dependent ‘random effects’ in the same way we would for our prediction of \(\mu\).

8.6.1 Description of our model

Our previous model had the following formula for the sigma term:

sigma ~ A

This told brms “model sigma as a function of an intercept and an effect for apparent age”. We can add to our formula by specifying a ‘random effect’ term for listeners, and a ‘random slope’ for age (i.e. an interaction between apparent age and listener), both estimated using adaptive pooling.

sigma ~ A + (A|L)

This formula tells brms “model sigma as varying around an overall intercept, an effect for apparent age, listener-specific deviations, and the interaction of apparent age and listener”. Effectively, we have two separate model formulas: One for a model with two categorical predictors and an interaction (like those in chapter 7), and one for a model comparing two groups (like those in chapter 6). We can ‘stick’ our two model formulas together using the bf (or brmsformula) function as shown below:

model_formula = brms::bf(height ~ A*G + (A*G|L) + (1|S),
                         sigma ~ A + (A|L))

If we were to fit the model using the formula above, our model would include independent multivariate normal distributions for the listener effects related to the mean, and those related to the error term. So, we would have a four-dimensional normal distribution relating to the expected mean (\(\mu\)), representing the listener intercept, the listener-dependent effects for apparent age and gender, and the listener-dependent interaction between the two. We would then have a separate two-dimensional normal distribution representing the listener-dependent intercepts and effects for apparent age relating to the error term (\(\sigma\)). This may be desirable in some situations, however, it may also be useful to estimate all of these listener-dependent coefficients from a single six-dimensional multivariate normal.

Right now, our model formula consists of two separate sub-formulas, each of which contains a term representing listeners (L). If we want our model to know that the L term predicting the mean is related to the L term predicting our error, then we can do this by adding the same label in each corresponding random effects term in parentheses, in between two pipes. Below, we use the label |x| to tell brms to estimate all six listener-dependent parameters jointly. The label can be almost any string, it just needs to be the same thing for every batch of random effects terms that you want to estimate together.

model_formula = brms::bf(height ~ A*G + (A*G|x|L) + (1|S),
                         sigma ~ A + (A|x|L))

Just keep in mind, you should not use the same label for things you do not want grouped. So, if we wanted to estimate speaker effects for our error and wanted to associate these with the speaker effects for our means, we should use different labels as seen below.

model_formula = brms::bf(height ~ A*G + (A*G|x|L) + (1|y|S),
                         sigma ~ A + (A|x|L) + (1|y|S))

We’re not actually going to fit a model like that, as we don’t think we have enough observations per speaker (15) to justify the estimation of speaker-dependent errors given the complexity of our model. However, we wanted to provide an example of how this would be done. The full model specification is provided below:

\[ \begin{equation} \begin{split} height_{[i]} \sim \mathrm{N}(\mu_{[i]},\sigma_{[i]}) \\ \mu_{[i]} = \mathrm{Intercept} + A + G + A \colon G + \\ L_{[\mathsf{L}_{[i]}]} + A \colon L_{[\mathsf{L}_{[i]}]} + G \colon L_{[\mathsf{L}_{[i]}]} + A \colon G \colon L_{[\mathsf{L}_{[i]}]} + S_{[\mathsf{S}_{[i]}]} \\ \\ \log (\sigma_{[i]}) = \mathrm{Intercept_{\sigma}} + A_\sigma + A_\sigma \colon L_{\sigma[\mathsf{L}_{[i]}]} + L_{\sigma[\mathsf{L}_{[i]}]} \\ \\ \mathrm{Priors:} \\ \begin{bmatrix} L_{[\bullet]} \\ A \colon L_{[\bullet]} \\ G \colon L_{[\bullet]} \\ A \colon G \colon L_{[\bullet]} \\ L_{\sigma[\bullet]} \\ A_\sigma \colon L_{\sigma[\bullet]} \end{bmatrix} \sim \mathrm{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \end{bmatrix}, \mathrm{\Sigma} \right) \\ \\ S_{[\bullet]} \sim \mathrm{N}(0,\sigma_S) \\ \\ \mathrm{Intercept} \sim \mathrm{t}(3,156,12) \\ A \sim \mathrm{t}(3,0,12) \\ \sigma, \sigma_L, \sigma_{A \colon L}, \sigma_S \sim \mathrm{t}(3,0,12) \\ \\ \mathrm{Intercept_{\sigma}} \sim \mathrm{N}(0,1.5) \\ \mathrm{A_{\sigma}} \sim \mathrm{N}(0,1.5) \\ \sigma_{L_\sigma}, \sigma_{A_\sigma \colon L_\sigma} \sim \mathrm{N}(0,1.5) \\\\ R \sim \mathrm{LKJCorr} (2) \\ \end{split} \tag{8.5} \end{equation} \]

Since our plain English description is nearly identical to the one provided for our previous model, the description below focuses on the components related to our error term \(\sigma\):

We are modeling apparent height as coming from a t distribution with unknown nu (\(\nu\)), mean (\(\mu\)), and scale (\(\sigma\)) parameters. The expected value of the logarithm of the error standard deviation (\(\sigma\)) is modeled as the sum of an intercept (\(Intercept_{\sigma}\)), an effect for apparent age (\(A_\sigma\)), an effect for listener (\(L_\sigma\)), and the interaction between apparent age and listener (\(A_\sigma \colon L_\sigma\)). The two listener effects related to the error term were drawn from a multivariate normal distribution with individual standard deviations (\(\sigma_{L_\sigma}\),\(\sigma_{A_\sigma \colon L_\sigma}\)) and a correlation matrix (\(R\)) that was estimated from the data. The remainder of the ‘fixed’ effects and correlations were given prior distributions appropriate for their expected range of values.

8.6.2 Fitting and interpreting the model

We need to specify a prior distribution for the standard deviation of our listener-dependent sigma parameters, \(\sigma_{L_\sigma}\) and \(\sigma_{A_\sigma \colon L_\sigma}\). We can do this by specifying a prior for class = "sd" and dpar = "sigma". Note that this is very similar to the way we fit priors for the intercept and fixed effects for sigma.

# Fit the model yourself
model_formula = brms::bf(height ~ A*G + (A*G|x|L) + (1|S),
                         sigma ~ A + (A|x|L))
priors = 
  c(set_prior("student_t(3, 156, 12)", class = "Intercept"),
    set_prior("student_t(3, 0, 12)", class = "b"),
    set_prior("student_t(3, 0, 12)", class = "sd"),
    set_prior("gamma(2, 0.1)", class = "nu"),
    set_prior("normal(0, 1.5)", class = "Intercept", dpar = "sigma"),
    set_prior("normal(0, 1.5)", class = "b", dpar = "sigma"),
    set_prior("normal(0, 1.5)", class = "sd", dpar = "sigma"),
    set_prior("lkj_corr_cholesky (2)", class = "cor"))

model_A_L_sigma = 
  brms::brm (model_formula, data = exp_data, chains = 4, cores = 4,
             warmup = 1000, iter = 3500, thin = 2, family="student",
             prior = priors)

Below we print an abridged model summary. The only change is that there are a few new lines in the listener Group-Level Effects. First, we see sd(sigma_Intercept) and sd(sigma_A1), which represent the standard deviations of the listener-specific intercepts and of the age effects (for sigma). Second, we see the correlations between our new random effects (i.e. cor(sigma_Intercept,sigma_A1)).

bmmb::short_summary (model_A_L_sigma)
## ...
## sd(sigma_Intercept)                0.36      0.08     0.24     0.54
## sd(sigma_A1)                       0.17      0.04     0.10     0.27
## ...
## cor(A1,sigma_A1)                  -0.29      0.22    -0.67     0.16
## cor(G1,sigma_A1)                   0.22      0.24    -0.27     0.65
## cor(A1:G1,sigma_A1)                0.12      0.24    -0.35     0.58
## cor(sigma_Intercept,sigma_A1)     -0.44      0.21    -0.79     0.05
## ...
## 
## Population-Level Effects:
##                 Estimate Est.Error l-95% CI u-95% CI
## Intercept         158.29      1.10   156.15   160.51
## sigma_Intercept     1.74      0.10     1.55     1.93
## A1                 11.28      1.18     8.98    13.64
## G1                 -2.92      0.57    -4.06    -1.80
## A1:G1              -1.63      0.42    -2.47    -0.77
## sigma_A1           -0.23      0.05    -0.33    -0.14
## 
## Family Specific Parameters:
##    Estimate Est.Error l-95% CI u-95% CI
## nu     8.07      1.54     5.71    11.72

Our listener-specific adjustments to the sigma intercept were fit with adaptive pooling, i.e. they are ‘random’ effects. As such, we can get these in the same way as other random effects terms (see section 5.8.2). Below, we use short_hypothesis to get the listener random effects for sigma (\(L_{\sigma}\)) by setting scope=ranef. We get the sum of the listener sigma effects and the sigma intercept, \(\mathrm{Intercept}_{\sigma} + L_{\sigma[i]}\) for listener \(i\), by setting scope=coef.

# listener random effects for sigma
log_sigmas_centered = short_hypothesis(model_A_L_sigma, "sigma_Intercept=0", 
                                   scope = "ranef",group="L")

# the sum of the sigma intercept and the listener sigma random effect
log_sigmas = short_hypothesis(model_A_L_sigma, "sigma_Intercept=0",
                              scope = "coef",group="L")

# the exponent of the sum of the sigma intercept 
# and the listener sigma random effect
sigmas = short_hypothesis(model_A_L_sigma, "exp(sigma_Intercept)=0",
                              scope = "coef",group="L")

The log_sigmas calculated above represent the logarithm of the listener-specific error parameter. We can get the error terms themselves by exponentiating the parameter within the hypothesis function as shown for the calculation of sigmas above. Figure 8.5 compares the listener dependent error effects (log_sigmas_centered) to the listener dependent error terms, in their original scale (sigmas).

(left) Values of $L_{\sigma}$, listener-dependent variations from the `sigma` intercept (expressed as a logarithm). (right) Listener-specific sigma parameters. The result of $\exp(\mathrm{Intercept}_{\sigma} + L_{\sigma[i]})$ for listener $i$. We exponentiate the value in the plot so it reflects `sigma` rather than `log(sigma)`.

Figure 8.5: (left) Values of \(L_{\sigma}\), listener-dependent variations from the sigma intercept (expressed as a logarithm). (right) Listener-specific sigma parameters. The result of \(\exp(\mathrm{Intercept}_{\sigma} + L_{\sigma[i]})\) for listener \(i\). We exponentiate the value in the plot so it reflects sigma rather than log(sigma).

It certainly seems as though there is legitimate between-listener variation in the size of \(\sigma\). We might wonder, is this added complexity a good thing? We can add the loo criterion to our model:

model_A_L_sigma = add_criterion(model_A_L_sigma, "loo")

And compare this model, the previous ‘simple’ heteroscedastic model (model_A_sigma), and the homoscedastic model we fit in the previous chapter (model_interaction).

loo_compare (model_interaction, model_A_sigma, model_A_L_sigma)
##                   elpd_diff se_diff
## model_A_L_sigma      0.0       0.0 
## model_A_sigma     -122.1      15.6 
## model_interaction -175.8      20.8

This time we find a very large difference in \(\mathrm{elpd}\) that is 7.8 times greater than the standard error of the difference. Again, it seems like the more complex model is justified and arguably ‘better’.

8.7 Answering our research questions

The research questions we posed above were:

(Q1) Does our error standard deviation vary as a function of apparent speaker age?

(Q2) Does our error standard deviation vary as a function of listener?

Based on the information we’ve already provided, we can answer these questions: Yes and yes. Further, we think these results are reasonable based on what we know about the heights of adults and children, and about the average speaker’s knowledge of the heights of adults and children. For example, we think most people have a good handle on how tall adults tend to be. Does the average person know how tall 10-12 year-old children tend to be? We don’t think so. In addition, 10-12 year old children can show substantial variation in height based on differences in growth rates, which might also lead people to provide more variable estimates for the height of children. We think it also ‘makes sense’ that different listeners could be more or less systematic (i.e. predictable) than others, for many reasons.

We’re going to add another, more meta, question: Which model should we report? We can approach this from the perspective that the best model according to \(\mathrm{elpd}\) is the ‘real’ model. Based on the comparison of \(\mathrm{elpd}\) above, it seems that the last model we fit (model_A_L_sigma) is the ‘best’ model. Does this mean it is the ‘real’ model? And does it mean that this is the one we must report? The short answers to these questions are no and no. The fit between a model and data can’t ‘prove’ that the model represents the ‘real’ relationship underlying the data. The reason for this is that there are potentially a large number of other, slightly different, models that provide just about as good a fit to the data, or perhaps even a better one. This problem is called the undedetermination of theory with respect to data, basically that there are often (or always) multiple compatible interpretations of a set of observations, and that our observations alone are not enough to distinguish these.

Specifically with respect to our data, if the best model is the real one, how can we ever know we have the best model? How many did we try before settling on the ‘best’ one? Further, even if we did somehow find the best model for the data we do have, how could we guarantee that it will also be the best model for the data we don’t have? It’s impossible to know if some new data might come along that will not fit your model as well as one of the other models, making the ‘best’ model contingent on the data we have seen so far.

Ok, so we can’t prove that our best model is the real one, fine. But it’s still the best one we have. Do we necessarily need to report the ‘best’ model? This sort of reasoning can lead to serious problems because there is almost always a better model out there, and we often don’t try to find it. For example, many people would never think to fit a heteroscedastic model and so would not even worry about reporting it. Does the existence of a hypothetical better model mean that the model they report is invalid? We don’t think so. If it doesn’t invalidate their simpler model, then why should this be the case for the researcher that did think to try the heteroscedastic model? This would seem to penalize people for bothering to look for such variation in their data, or for knowing how to fit heteroscedastic models.

In general, we can imagine that 10 people might approach any given research question in 10 different ways, a concept sometimes referred to as researcher degrees of freedom (Gelman and Loken 2013) since it reflects model variation due to variation in researcher decisions rather than the data. Slight differences in model structure and data processing results in slight differences in model outputs, resulting in a sort of ‘distribution’ for any given result. How can a fixed underlying reality result in a distribution of results? When they are all slightly wrong. Rather than think of our models as representations of reality, we can think of them as mathematical implementations of our research questions. The model we report should include the information and structure that we think are necessary to represent and investigate our research questions. Using a different model can result in different results given the same data, but asking a different question can also lead to different results given the same data. When interpreting a statistical analysis, it is very useful to keep in mind that results are contingent on:

  1. The data you collected. Given other data you may have come to different conclusions.

  2. The model you chose. Given another model you may have come to different conclusions.

So which model should we report? One way to think about our models is that they provide different information about our data. If we are primarily interested in the fixed effects then maybe we are not interested in additional information related to the error term. However, if we want to know about differences in error variation then we need one of the more complicated models. It’s also worth considering how the differing model structures might affect the information they share in common. Figure 8.6 compares the fixed effects, random effects, and predicted values made by three models: model_interaction from chapter 7, and model_A_sigma and model_A_L_sigma from this chapter.

(left) Comparison of fixed effect estimates and 95% credible intervals for three models. We subtracted 159 from the model intercept (Int.) so that it would fit on the same scale as the other parameters. (right) Plots comparing parameter estimates, or predictions, for pairs of the models presented in the left plot. Each row compares a different pair of models and each column presents a different set of parameters.

Figure 8.6: (left) Comparison of fixed effect estimates and 95% credible intervals for three models. We subtracted 159 from the model intercept (Int.) so that it would fit on the same scale as the other parameters. (right) Plots comparing parameter estimates, or predictions, for pairs of the models presented in the left plot. Each row compares a different pair of models and each column presents a different set of parameters.

These three models have the same structure when it comes to prediction of the expected value of the normal distribution (\(\mu\)), however, they differ in terms of the prediction of the error term. The prediction equation for sigma for each model is presented below. We didn’t actually have a prediction equation for sigma in model_interaction, but this is equivalent to using the same sigma for each observation as would be done by the formula. As we can see below, these formulas correspond to ‘intercept only’(chapter 3-4), two-group fixed effects (chapter 5), and two-group random effects (chapter 6) structures for our error term.

sigma ~ 1             # model_interaction
sigma ~ A             # model_A_sigma
sigma ~ A + (A|L)     # model_A_L_sigma

We can see in figure 8.6 that the differences in the fixed effects are so small as to be largely meaningless. The three models also provide nearly identical listener ‘random effects’ and predicted values. Based on this, if we care primarily about the fixed effects and understanding apparent height judgments, any of the three models will work equally well. Where the models differ is in their estimates of the speaker effects. The inclusion of heteroscedasticity seems to reduce the magnitude of the speaker effects relative to the homoscedastic model (model_interaction). If we were specifically interested in these effects then we would need to take the time to understand the source, and meaning, of these differences. Of course, the experiment we ran was not designed to investigate speaker differences, so it does not make much sense to delve too far into these in our analysis. Our experiment was designed to investigate the perception of apparent height and the role of apparent speaker age and gender in these judgments. From this perspective, it seems that either of the three models is suitable.

8.8 Building identifiable and supportable models

Before finishing this chapter we want to talk about an extremely important issue that we have not discussed yet: Whether you can, or should, fit the complex model you are thinking of for your data. To this point, we haven’t discussed this, but as our models get more and more complicated it is something we need to think about. There are generally two constraints on the sorts of models you can fit to your data.

The first of these are issues related to the identifiability of different model parameters, that is, the ability of your model to estimate unique and independent values for each of your model parameters. When model parameters are not identifiable, that means that these parameters cannot be estimated no matter how much data you have available. We have actually seen several examples of this issue already. When we discussed contrasts (section 5.6), and when we introduced modeling factors with many levels (section 7.2.2), we talked about the fact that you can’t estimate all levels of a factor. The cause of these problems is a lack of identifiability of the parameters.

The second general constraint is related to whether your model is supported by your data, meaning you have enough data to realistically estimate all of your model parameters. In many cases involving repeated measures data this will not be an issue as there may be a large number of observations for each group (e.g. subject), and a medium to large number of groups (>30) in the experiment. However, it is still necessary to be aware of the limitations and the problems that (a lack of) support can cause for model fitting.

Because of the incredible array of regression models that can be designed, it’s very difficult to give a ‘high-level’ set of guidelines that will ensure that any particular model will be identifiable and supported by your data. In addition, it’s difficult to really explain model identifiability without talking in detail about matrix algebra (in addition to other mathematical topics). However, we will provide some examples of general cases that we hope are illustrative. You can get the models presented below by running the following line:

bmmb::get_model ('8_badmodels.Rda')

8.8.1 Collinearity

We’ve referenced the idea of linear dependence before, and we can provide a more formal definition of this now. Each of your predictors is a vector, a sequence of \(n\) values that predict your dependent variable for observation \(i\). A set of predictor vectors (\(x_1, x_2, \dots, x_p\)) is linearly independent when there is no vector of non-zero real numbers (\(a_1, a_2, \dots, a_p\)) that can be multiplied by our vectors such that the sum of the vectors equals zero for all elements of the vector. This is presented below:

\[ \begin{equation} \begin{split} 0 = x_1 \cdot a_1 + x_2 \cdot a_2 + \dots + x_n \cdot a_n \end{split} \tag{8.6} \end{equation} \]

Basically, if the vectors (\(x_1, x_2, \dots, x_p\)) are linearly independent, the values of the \(a\) variables above need to all be zero for the equation to be true. Including linearly dependent predictors in our models can cause serious problems. For example, imagine we wanted to predict apparent height based on speaker vocal-tract length (VTL) measured in centimeters (cm) and in meters (m). We only introduce the use of quantitative predictors (such as VTL) in the next chapter, but we can still show how including both of these predictors will cause problems for your model.

We can add a new predictor to our data, speaker VTL specified in meters rather than centimeters.

exp_data$vtl_m = exp_data$vtl / 100

Since this new variable equals 100 times our original VTL predictor, its clear that the following operation will result in a string of zeros (or values near zero because of rounding errors).

exp_data$vtl + exp_data$vtl_m * -100

In other words, \(\mathrm{vtl}_{cm} - (-100 \cdot \mathrm{vtl}_{m}) = 0\), meaning that \(\mathrm{vtl}_{cm}\) and \(\mathrm{vtl}_{m}\) are linearly dependent. We can fit a model including both of these predictors, as shown below.

model_bad_1 =  
  brms::brm (height ~ vtl_m + vtl, data = exp_data, chains = 4, cores = 4,
       warmup = 1000, iter = 3500, thin = 2,
       prior = c(brms::set_prior("normal(176, 50)", class = "Intercept"),
                 brms::set_prior("normal(0, 15)", class = "sigma")))

However, our model will not be able to find a satisfactory solution, and you will see a large list of warnings when trying to fit it. If you check out the model print statement you will see strange things in the population-level effects:

## Population-Level Effects: 
##           Estimate Est.Error   l-95% CI  u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    45.69      2.37      41.17     50.47 1.00     4031     4425
## vtl_m      1383.91 246800.93 -531293.80 343951.54 2.12        5       13
## vtl          -5.28   2468.01   -3430.88   5321.56 2.12        5       13

Above we see comically small effective sample sizes (ESS) indicating that we have basically no information about our parameters, and large Rhat values indicating that our chains have not converged. We also see an astronomically large standard error for our vtl_m predictor, especially given knowledge that human speaker height tends to range from 100 cm to 200 cm at most. Also, what does it mean that the predictors for VTL in meters and VTL in centimeters differ in sign? How can one be positively related and the other be negatively related to apparent height. Obviously, there are serious problems with these parameter estimates.

Linear dependence is binary, a set of vectors is linearly dependent or it is not. However, we can also talk about the degree to which vectors are linearly related, with linear dependence being one extreme. We measure linear relatedness using correlation coefficients (section 6.3.2) where correlations of 1 or -1 indicate linear dependence. Our model above failed because our two predictors are linearly dependent, they have a correlation of 1.

cor (exp_data$vtl_m, exp_data$vtl)
## [1] 1

In the code below we add a small amount of random error to our VTL predictor (in meters). This is set to 1/10th the value of the standard deviation of the predictor itself. This small amount of random noise reduces the correlation between VTL in meters and VTL in centimeters to 0.9946, almost perfectly correlated but not, strictly speaking, linearly dependent.

set.seed(1)
exp_data$vtl_m_noise = exp_data$vtl_m + 
  rnorm (length(exp_data$vtl_m),0,sd(exp_data$vtl_m)/10)

cor (exp_data$vtl, exp_data$vtl_m_noise)
## [1] 0.9946

We can fit the model with the new predictor:

model_bad_2 =  
  brms::brm (height ~ vtl_m_noise + vtl, data = exp_data, chains = 4, 
             cores = 4,warmup = 1000, iter = 3500, thin = 2, prior = priors)

If we inspect the population-level effects we see that this tiny change has made our model substantially better. We get no warnings or error messages when we try to fit this model, and we have a very respectable ESS for our parameters.

## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept      45.75      2.36    41.16    50.35 1.00     4523     4087
## vtl_m_noise  -201.86    175.11  -544.11   139.03 1.00     3504     3697
## vtl            10.57      1.76     7.14    14.01 1.00     3439     3866

At this point our model might not seem that bad except for the large effect (and huge interval) around vtl_m_noise (VTL in meters, plus noise). The estimate for vtl, and the interval around it, may seem plausible, especially in comparison to our last model. However, we might also consider the estimate for our vtl predictor in a model without this redundant parameter:

model_good =  
  brms::brm (height ~ vtl, data = exp_data, chains = 4, cores = 4,
       warmup = 1000, iter = 3500, thin = 2,
       prior = c(brms::set_prior("normal(176, 50)", class = "Intercept"),
                 brms::set_prior("normal(0, 15)", class = "sigma")))

We see that we get a much narrower credible interval around the VTL parameter.

## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    45.72      2.38    41.04    50.40 1.00     5052     4679
## vtl           8.55      0.18     8.21     8.90 1.00     5072     4831

The above is not to say that correlated quantitative predictors should never be included in the same regression model. However, it’s important to think about the consequences of doing this. In addition, when possible, its a good idea to not paint yourself into a corner by designing an experiment that will specifically lead to, or require, highly correlated predictors.

8.8.2 Predictable values of categorical predictors

Linear dependence can cause other sorts of problems that are less obvious when making predictions using categorical variables. Your model parameters will also not be identifiable if you include categorical predictors whose values can be predicted based on the value of other categorical predictors you include in your model. For example, we know that we can’t estimate both group means and the intercept in two-group models. We can now say that this is because if you know that an observation is not in the first group, you know it must be in the second. This is, once again, the problem of linear dependence.

For example, below we make a small matrix representing predictor variables in a simple regression comparing two groups. The first column represents the intercept for every observation (section 3.5.2). The second column is a variable equal to 1 when the observation was in the first group and 0 if not. The third column is equal to 1 when the observation was in the second group and 0 if not. Obviously, the second column will equal 1 if the third column is zero and 0 if the third column is 1.

x = cbind (intercept=rep(1,4), x1=rep(c(1,0),2), x2=rep(c(0,1),2))
x
##      intercept x1 x2
## [1,]         1  1  0
## [2,]         1  0  1
## [3,]         1  1  0
## [4,]         1  0  1

If we look at the numbers above, we can take the first column and add it to the negative of the second and third columns and this will always equal zero.

x[,1] + x[,2]*(-1) + x[,3]*(-1)
## [1] 0 0 0 0

That means that these three predictors are linearly dependent, and that’s why we can’t include separate predictors for the intercept and each group in a two-group model. This also explains why we can’t estimate all four levels of our C (apparent category) parameter, as we discussed in the previous chapter. Below we make a small matrix of ‘predictors’ for each of our four categories and the intercept. The predictors with C in their name equal 1 when the observation belongs to the first, second, third, and fourth groups respectively.

x = cbind (intercept=rep(1,4), C1=c(1,0,0,0), C2=c(0,1,0,0),
           C3=c(0,0,1,0),C4=c(0,0,0,1))
x
##      intercept C1 C2 C3 C4
## [1,]         1  1  0  0  0
## [2,]         1  0  1  0  0
## [3,]         1  0  0  1  0
## [4,]         1  0  0  0  1

We can see that the same approach can be used to make all rows add up to zero, indicating that these predictors are linearly dependent. The predictors cannot be made to sum to zero if we omit the fifth column of the matrix above, which is why our models will not estimate C4 when C is included as a predictor in our models.

x[,1] + x[,2]*(-1) + x[,3]*(-1) + x[,4]*(-1) + x[,5]*(-1)
## [1] 0 0 0 0

This sort of thing also causes problems when you include multiple different factors that mutually explain each other. For example, last chapter we initially fit a model using C, apparent speaker category, as a predictor. We then fit models using A and G, apparent age and gender as predictors. However, we did not fit models that included apparent age, gender, and speaker category. This is because if a speaker is a woman, they are an adult female. If they are a male child they are a boy, and if they are a female child they are a girl. So, given combinations of some categorical predictors, we have perfect knowledge of the values of other categorical predictors. Mathematically, this means that the ‘secret’ predictors representing each category are linearly dependent. Below we make our small predictor matrix containing predictors for apparent category, apparent age and gender, and their interaction.

x = cbind (intercept=rep(1,4), C1=c(1,0,0,0), C2=c(0,1,0,0),
           C3=c(0,0,1,0),A1=c(0,0,1,1), G1=c(0,1,0,1),A1G1=c(1,0,0,1))
x
##      intercept C1 C2 C3 A1 G1 A1G1
## [1,]         1  1  0  0  0  0    1
## [2,]         1  0  1  0  0  1    0
## [3,]         1  0  0  1  1  0    0
## [4,]         1  0  0  0  1  1    1

This matrix actually features several different combinations of predictors that can be combined to equal zero. For example we can combine the first, second, third and fifth predictors:

x[,1]*1 + x[,2]*(-1) + x[,3]*(-1) + x[,5]*(-1)
## [1] 0 0 0 0

Or the first, third, fourth, and seventh:

x[,1]*1 + x[,3]*(-1) + x[,4]*(-1) + x[,7]*(-1)
## [1] 0 0 0 0

If we try to fit a model that includes all of these predictors:

model_bad_3 =  
  brms::brm (height ~ C + A*G, data = exp_data, chains = 4, cores = 4,
       warmup = 1000, iter = 3500, thin = 2,
       prior = c(brms::set_prior("normal(176, 15)", class = "Intercept"),
                 brms::set_prior("normal(0, 15)", class = "sigma")))

We get a model with the low ESS values, the large Rhat values, and the very large coefficient values and intervals that we saw in our model featuring linearly dependent predictors above.

## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   157.98      0.20   157.57   158.35 1.06       61       78
## C1         1066.31   2604.72 -4431.79  5158.68 2.07        5       18
## C2          496.91   1751.32 -2543.19  3708.83 1.87        6       13
## C3           74.39   2086.76 -4113.31  2801.74 1.99        5       21
## A1          793.68   1363.24 -2383.78  3019.96 1.90        6       12
## G1          567.63   2171.69 -4107.90  3364.23 2.04        5       18
## A1:G1       283.89   1112.60 -2295.30  2141.73 2.26        5       18

This concludes our discussion of linear dependence and some of the many ways it can cause problems for your models. If you try to fit a model and you get errors, divergent transitions, low ESS values and high Rhats, you may have a problem with linear dependencies in your predictors, and it is worth carefully considering this possibility. In addition, a better understanding of model identifiability and linear dependence can be gained by learning a little linear algebra.

8.8.3 Saturated, and ‘nearly-saturated’, models

The models in this chapter sought to estimate five parameters for each listener (\(\mathrm{Intercept}, A, G, A:G, \sigma\)) based on 139 observations for each listener. This seems like a reasonable ratio of observations to parameters. If we had only 10 observations per listener, this model may still be identifiable. However, it might still not be such a good idea to fit a model with so many listener-dependent parameters given so little data. In general, before fitting a model that you can fit, it’s a good idea to think whether you should fit it given the nature of your data and the number of observations you have overall (and in different relevant subsets).

We can think of the logical extreme of a model with too many parameters and too little data. Imagine we wanted to fit a model with speaker effects, listener effects, and the interaction between the two. This model will be able to represent the 15 listener effects, 139 speaker effects, and all of the listener-dependent speaker effects. This means that at a minimum, this model would require 15 \(\cdot\) 139 = 2085 parameters to represent these effects. However, we only have 2085 data points, one observation per speaker per listener. As a result, we would have a different parameter for every single observation leading to what’s known as a saturated model.

In a model without shrinkage, saturated models lead to the perfect prediction of data. This means that there is no data-level error (\(\sigma\)), or in fact, any random variation at all. This is problematic because the existence of a non-zero \(\sigma\) is a pre-requisite for the calculation of the likelihood function of our regression parameters (see chapter 2). For example, the model below doesn’t report errors because it doesn’t fit at all, it repeatedly crashed R and Stan when we attempted to fit it.

exp_data$S = factor (exp_data$S)
exp_data$L = factor (exp_data$L)

model_bad_4  =  
  brms::brm (height ~ S*L, data = exp_data, chains = 4, cores = 4,
       warmup = 1000, iter = 3500, thin = 2,
       prior = c(brms::set_prior("normal(176, 15)", class = "Intercept"),
                 brms::set_prior("normal(0, 15)", class = "sigma")))

This may be easier to conceive of if we use a much smaller example. Imagine two people run the coffee and reading times experiment we discussed in earlier chapters. Two speakers read a passage after drinking decaf, and another after drinking coffee. This results in four observations. We can estimate the speaker effect, this will be the average of the reading times for each speaker across both drink conditions. The error for this estimate will be the distribution of the coffee and decaf reading times around the speaker average. We can also estimate the effect for drink condition, the average of both speaker reading times within each condition. The error for this will be the distribution of speaker reading times around the mean for each drink condition. However, we cannot estimate the speaker by drink condition interaction. This is because once we know the mean reading time for each speaker for each drink condition, there is no error. We only have one observation within this condition so that observation is the mean. There is no error distribution because we have no other observations.

Ok, so what if we had not one but two observations per speaker per drink condition? This would not be a saturated model, but we could think of it as a nearly-saturated model, a term that may not have a formal definition but gets the point across. A model with only two observations per person per drink condition may technically allow us to model the listener by drink condition interaction, however each of the interaction parameters would be based on only two observations.

Back to our example of estimation of the speaker effects. We have 139 observations per listener but only one per listener per speaker. Rather than think of our \(n\) (number of observations) as a single value, we can think of our effective \(n\) for different parameters. If we plan to investigate the effect of apparent age on apparent height, we have >40 observations per listener per apparent age group (table (exp_data$A, exp_data$L)). We can think of our \(n\) per listener for the estimation of the apparent age parameter as >40, a reasonable amount of data. On the other hand, our \(n\) for the estimation of the listener by speaker interaction is only 1 (table (exp_data$S, exp_data$L)), regardless of whether we have 139 observations per listener.

If we wanted to make claims about listener-dependent judgments of individual speakers and variation in these, we would need to design an experiment that featured a ‘reasonable’ number of observations per listener per speaker. Establishing a ‘reasonable’ number of repetitions depends on the amount of variation, the type of variation, and the number of groups in your data, among other things. In general, the noisier your data the more observations you will need to get precise (and useful) estimates for your parameters.

8.9 Exercises

The analyses in the main body of the text all involve only the unmodified ‘actual’ resonance level (in exp_data). Responses for the stimuli with the simulate ‘big’ resonance are reserved for exercises throughout. You can get the ‘big’ resonance in the exp_ex data frame, or all data in the exp_data_all data frame.

Fit and interpret one or more of the suggested models:

  1. Easy: Analyze the (pre-fit) model that’s exactly like model_A_L_sigma, except using the data in exp_ex (bmmb::get_model("8_model_A_L_sigma_ex.RDS")).

  2. Medium: Fit a model like model_A_L_sigma, but replace any of the predictors in the model with the resonance factor.

  3. Medium: Fit a model like model_A_L_sigma, but add another predictor to the prediction equation for sigma, possibly with an interaction with apparent age.

  4. Hard: Fit a model like model_A_L_sigma but including apparent age, apparent gender, and resonance. Include all interactions between these factors for both fixed and random effects.

In any case, describe the model, present and explain the results, and include some figures.

8.10 References

Gelman, A., Simpson, D., & Betancourt, M. (2017). The prior can often only be understood in the context of the likelihood. Entropy, 19(10), 555.

Kruschke, J. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan.

8.11 Plot Code


###############################################################################
### Figure 8.1
###############################################################################

prior_mildly_informative  = readRDS ("../models/8_prior_mildly_informative.RDS")
prior_conservative = readRDS ("../models/8_prior_conservative.RDS")

par (mfrow = c(1,3), mar = c(4.5,.5,2,1), oma = c(0,2,.1,1))
p_check (prior_uninformative, xlab="", cex.lab=1.2,yaxt="n", ylab = "Density",
         samples = seq(1,4000,length.out=10))
title ("Uinformative")
p_check (prior_mildly_informative, xlim = c(-100,400), xlab="", ylab="",yaxt="n",
         samples = seq(1,4000,length.out=10))
title ("Mildly Informative")
p_check (prior_conservative, xlim = c(-100,400), xlab="", ylab="",yaxt="n",
         samples = seq(1,4000,length.out=10))
title ("Conservative")
mtext ("Predicted Apparent Height", outer = TRUE, side=1,line=-1.5, cex = 0.8)
mtext ("Density", outer = TRUE, side=2,line=.7, cex = 0.9, adj = 0.6)

###############################################################################
### Figure 8.2
###############################################################################

model_uninformative = readRDS ("../models/8_model_uninformative.RDS")
model_mildly_informative = readRDS ("../models/8_model_mildly_informative.RDS")
model_conservative = readRDS ("../models/8_model_conservative.RDS")

fixef_1 = fixef(model_uninformative)
fixef_2 = fixef(model_mildly_informative)
fixef_3 = fixef(model_conservative)

par (mfrow = c(1,4), mar = c(4,4,2.5,1),oma = c(0,1,0,0))
brmplot (rbind(fixef_1[1,],fixef_2[1,],fixef_3[1,]), ylim = c(154.5,161.5),
         main = "Intercept",labels = c("UI", "MI", "C"),cex.lab=1.3,cex.axis=1.3)
brmplot (rbind(fixef_1[2,],fixef_2[2,],fixef_3[2,]), ylim = c(6.5,13.5),
         main = "A1",labels = c("UI", "MI", "C"),cex.lab=1.3,cex.axis=1.3)
brmplot (rbind(fixef_1[3,],fixef_2[3,],fixef_3[3,]), ylim = c(-6,1),
         main = "G1",labels = c("UI", "MI", "C"),cex.lab=1.3,cex.axis=1.3)
brmplot (rbind(fixef_1[4,],fixef_2[4,],fixef_3[4,]), ylim = c(-6,1),
         main = "A1:G1",labels = c("UI", "MI", "C"),cex.lab=1.3,cex.axis=1.3)
mtext (side = 2, outer = TRUE, "Centimeters",adj = .55, cex=0.9, line = -0.5)


################################################################################
### Figure 8.3
################################################################################

par (mar = c(2,3,1,1), oma = c(1,2,.1,1))
layout (mat = matrix (c(1,3,1,3,2,3),2,3))
boxplot(residuals_interaction[,1] ~ model_interaction$data$L, cex.lab = 1.3,cex.axis = 1.2,
        col = cols, ylim = c(-20,20), outline=FALSE,xlab='',ylab='')
grid()
boxplot(residuals_interaction[,1] ~ model_interaction$data$L, 
        col = cols, ylim = c(-20,20), outline=FALSE, add=TRUE,xlab='',ylab='',
        cex.lab = 1.3,cex.axis = 1.2)

boxplot(residuals_interaction[,1] ~ model_interaction$data$A, 
        col = cols[2:3], ylim = c(-20,20), outline=FALSE,xlab='',ylab='',
        cex.lab = 1.3,cex.axis = 1.3)
grid()
boxplot(residuals_interaction[,1] ~ model_interaction$data$A, 
        col = cols[2:3], ylim = c(-20,20), outline=FALSE, add=TRUE,xlab='',ylab='',
        cex.lab = 1.3,cex.axis = 1.3)
#axis (side=1, at = 1:2, labels = c()
boxplot(residuals_interaction[,1] ~ model_interaction$data$A + model_interaction$data$L, 
        col = rep(cols, each = 2), ylim = c(-32,30), outline=FALSE,xlab='',
        xaxt='n',ylab='', cex.lab = 1.3,cex.axis = 1.3)
grid()
boxplot(residuals_interaction[,1] ~ model_interaction$data$A + model_interaction$data$L,ylab='', 
        col = rep(cols, each = 2), ylim = c(-32,30), outline=FALSE, add=TRUE,
        xlab='',xaxt='n', cex.lab = 1.3,cex.axis = 1.3)
axis (side = 1, at = seq(1.5,29.5,2), labels = 1:15,cex.axis = 1.3)

mtext (side=2, line = 0.3,outer = TRUE, "Residuals (in centimeters)")

################################################################################
### Figure 8.4
################################################################################

par (mfrow = c(1,2), mar = c(2.2,4,0.5,1))
brmplot (sigmas, col = cols[4:6], labels = c("Overall","Adults","Children"),
         ylab = "Centimeters", ylim = c(3.7,8))
brmplot (fixef(model_A_sigma)[c(3:5),], col = cols,nudge= -.1,pch=17,
         labels = c("A1","G1","A1:G1"),ylab = "Centimeters")
brmplot (fixef(model_interaction)[2:4,], col = cols, add = TRUE,
         nudge=.1,labels="")
#abline (h = 7.71)
legend (1.7,10, legend = c("Heteroscedastic","Homoscedastic"), pch = c(16,17), bty='n')

################################################################################
### Figure 8.5
################################################################################

par (mfrow = c(1,2), mar = c(4,2,1,1),oma=c(0,2,0,0))
brmplot (log_sigmas_centered, col = cols,labels = "", ylim = c(-0.9,0.8),
         xlab="Listener")
axis (side=1, at = 1:15)
brmplot (sigmas[,1:4], col = cols, ylim = c(1,11),labels="",
         xlab="Listener")
abline (h = exp(1.74))
brmplot (sigmas[,1:4], col = cols, add=TRUE, labels="")
axis (side=1, at = 1:15)
mtext (side = 2, outer = TRUE, "Centimeters",adj = .6, cex=1.1, line = 0.75)


################################################################################
### Figure 8.6
################################################################################

par (mar = c(2,4,1,2.5), oma = c(1,1,1,.1))

layout (m = matrix (c(1,1,1,1,1,1,2,5,8,3,6,9,4,7,10),3,5))

brmplot (fixefs1, col=cols[15],labels = "", nudge= -0.1, ylim = c(-5,15),
         ylab="Centimeters",cex.lab=1.4)
brmplot (fixefs2[c(1,3,4,5),], add=TRUE, nudge=0, col=cols[3],labels = "")
brmplot (fixefs3[c(1,3,4,5),], add=TRUE, nudge=0.1, col=cols[4],labels = "")

axis (side=1,at=1:4,c("Int.","A1","G1","A1:G1"),cex.axis=1.3)
legend (1.5,6,legend = c("M1:model_interaction", "M2:model_A_sigma", "M3:model_A_L_sigma"), 
        col = cols[c(15,3,4)], pch=16,pt.cex=1.5, bty = "n",cex=1.2)
mtext (side=3,outer=FALSE, "Fixed Effects",cex=.9,line=0.9)

par (mar = c(1.5,2,1.3,1))

plot (ranefs1$L[,1,], ranefs2$L[,1,1:4],pch=16,col=bmmb::cols[5],
      xlim=c(-10,10),ylim=c(-10,10),cex=1.25,lwd=2,xlab="",ylab="")
abline (0,1,col=2,lwd=1,lty=2)
mtext (side=3,outer=FALSE, "Listener Effects",cex=.9,line=0.9)
text (-2,-6,"x=M1,y=M2",pos=4)
plot (ranefs1$S[,1,], ranefs2$S[,1,], pch=16,col=bmmb::cols[7],
      xlim=c(-11,5),ylim=c(-11,5),cex=1.25,lwd=2,xlab="",ylab="")
abline (0,1,col=2,lwd=1,lty=2)
mtext (side=3,outer=FALSE, "Speaker Effects",cex=.9,line=0.9)
plot (yhat1[,1],yhat2[,1],pch=16,col=bmmb::cols[8],cex=1.25,
      xlim=c(120,190), ylim=c(120,190),xlab="",ylab="")
abline (0,1,col=2,lwd=1,lty=2)
mtext (side=3,outer=FALSE, "Predictions",cex=.9,line=0.9)

plot (ranefs1$L[,1,], ranefs3$L[,1,1:4], pch=16,col=bmmb::cols[9],
      xlim=c(-10,10),ylim=c(-10,10),cex=1.25,lwd=2,xlab="",ylab="")
abline (0,1,col=2,lwd=1,lty=2)
text (-2,-6,"x=M1,y=M3",pos=4)
plot (ranefs1$S[,1,], ranefs3$S[,1,], pch=16,col=bmmb::cols[10],
      xlim=c(-11,5),ylim=c(-11,5),cex=1.25,lwd=2,xlab="",ylab="")
abline (0,1,col=2,lwd=1,lty=2)
plot (yhat1[,1],yhat3[,1],pch=16,col=bmmb::cols[2],cex=1.25,
      xlim=c(120,190), ylim=c(120,190),xlab="",ylab="")
abline (0,1,col=2,lwd=1,lty=2)

plot (ranefs2$L[,1,1:4], ranefs3$L[,1,1:4], pch=16,col=bmmb::cols[12],
      xlim=c(-10,10),ylim=c(-10,10),cex=1.25,lwd=2,xlab="",ylab="")
abline (0,1,col=2,lwd=1,lty=2)
text (-2,-6,"x=M2,y=M3",pos=4)
plot (ranefs2$S[,1,], ranefs3$S[,1,], pch=16,col=bmmb::cols[13],
      xlim=c(-11,5),ylim=c(-11,5),cex=1.25,lwd=2,xlab="",ylab="")
abline (0,1,col=2,lwd=1,lty=2)
plot (yhat2[,1],yhat3[,1],pch=16,col=bmmb::cols[14],cex=1.25,
      xlim=c(120,190), ylim=c(120,190),xlab="",ylab="")
abline (0,1,col=2,lwd=1,lty=2)