---
title: "Using Plot2WayAnova"
author: "Chuck Powell"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Using Plot2WayAnova}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Background
The CGPfunctions package includes functions that I find useful for teaching
statistics especially to novices (as well as an opportunity to sharpen my own R
skills). I only write functions when I have a real need -- no theory -- just
help for actually practicing the art. They typically are not "new" methods but
rather wrappers around either base R or other packages and are very task
focused. This vignette covers one function from the package that tries to help
users (especially students) do one thing well by pulling together pieces from a
variety of places in `R`. `Plot2WayANOVA`, which as the name implies conducts a
2 way ANOVA and plots the results. I always try and find the right balance
between keeping the number of dependencies to a minimum and not reinventing the
wheel and writing functions that others have done for me. The function makes use
of the following non base r packages.
- `ggplot2` as the work horse for all the actual plotting
- `car` for it's ability to compute Type II sums of squares, we'll address why
that's important in more detail later in the scenario. We'll also make use of
it's `leveneTest`.
- `sjstats` which takes out ANOVA table and gives us other important
information such as the effect sizes ($\eta^2$ and $\omega^2$ ) through use of
its `anova_stats` function. Prior to this version I had been using my own local
function but this runs rings around what I could do.
- `broomExtra::glance` will also help us grab very important results like $R^2$
and display them
- `DescTool::PostHocTest` for accomplishing post hoc tests
## Vignette Info
The ANOVA (Analysis of Variance) family of statistical techniques allow us to
compare mean differences of one outcome (dependent) variable across two or more
groups (levels) of one or more independent variables (factor). It is also true
that ANOVA is a special case of the GLM or regression models so as the number of
levels increase it might make more sense to try one of those approaches. The 2
Way ANOVA allows for comparisons of mean differences across 2 independent
variables `factors` with a varying numbers of `levels` in each `factor`.
If you prefer a more regression based approach with a very similar plotted
result I highly recommend the `interactions` package which I was unaware of
until just recently.
It is [available through CRAN](https://CRAN.R-project.org/package=interactions).
The `Plot2WayANOVA` function conducts a classic analysis of variance (ANOVA) in
a sane and defensible, albeit opinionated, manner, not necessarily the only one.
It's real strength (*I hope*) lies in the fact that it is pulled together in
one function and more importantly allows you to visualize the results
concurrently with no additional work.
## Scenario and data
Imagine that you are interested in understanding whether a car's fuel efficiency
(mpg) varies based upon the type of transmission (automatic or manual) and the
number of cylinders the engine has. Let's imagine that the `mtcars` data set is
actually a random sample of 32 cars from different manufacturers and use the
mean `mpg` grouped by `am` and `cyl` to help inform our thinking. While we
expect variation across our sample we're interested in whether the differences
between the means by grouping of transmission type and cylinders is
significantly different than what we would expect in random variation across the
data.
In simplistic terms we want to know whether `am` matters, `cyl` matters or if it
depends on the interaction of the two. It's this interaction term that typically
confuses novices or is difficult to "see". That's where a good interaction graph
can hopefully play a key role, and that's what the `Plot2WayANOVA` focuses on.
There's no lack or tools or capabilities in base R or in the many packages to do
this task. What this function tries to do is pull together the disparate pieces
with a set of sane defaults and a simple interface to work with it. At its
simplest you would require the library and then enter this command:
`Plot2WayANOVA(formula = mpg ~ am * cyl, dataframe = mtcars)` which lays our
question out in R's vernacular with a formula and a dataframe. Optionally we
can specify a different confidence level and choose a line or a bar graph. Over
time the function has gained a plethora of formatting options.
"Under the hood", however there's a lot of nice features at work.
1. Some basic error checking to ensure a valid formula and dataframe. The
function accepts only a fully crossed formula to check for an interaction
term
1. It ensures the dependent (outcome) variable is numeric and that the two
independent (predictor) variables already are or can be coerced to factors
– the user is warned on the console if there are problems.
1. A check is conducted to see if any of the variables of interest have
missing cases – the user is warned on the console if there are problems.
1. Balance is checked, that is whether the number of observations per cell
when we cross `am` and `cyl` is equal. There's some debate as to how much
imbalance is permissible. But you'll be given fair warning if there is any.
1. In addition to the classic ANOVA table information available in `aov` or
`Anova` you'll be presented with information about effect sizes like eta
squared $\eta^2$. They're calculated and appended as additional columns.
If you're unfamiliar with them and want to know more especially where the
numbers come from I recommend a good introductory stats text. I recommend
*Learning Statistics with R*
[LSR](https://learningstatisticswithr.com/) see Table 14-1
on page 432.
1. A summarized table of means, standard deviations, standard errors of the
means, confidence intervals, and group sizes for each of the crossed
combinations in our example that's 6 groupings 3 levels of cylinder and 2 levels
of automatic or manual.
1. Some measures of overall fit including $R^2$
1. If any of the effect terms (main or interaction) are significant you'll
be presented with a post-hoc comparison of the means. By default a
Scheffe test is run but the user can choose from several supported options.
1. The Homogeneity of Variance assumption is tested with Brown-Forsythe
1. The normality assumption is tested with Shapiro-Wilk
## Installation
```{r eval = FALSE}
# Install from CRAN
install.packages("CGPfunctions")
# Or the development version from GitHub
# install.packages("devtools")
devtools::install_github("ibecav/CGPfunctions")
```
then load the library.
```{r LoadLibrary, warning = FALSE}
library(CGPfunctions)
```
## Example of using the function
The only two required parameters are a formula and a dataframe (like) object. If
we run the function in its simplest form here's what we get.
```{r Plot2WayANOVA, echo=TRUE, message=TRUE, warning=FALSE, fig.width=7, fig.height=4}
Plot2WayANOVA(formula = mpg ~ am * cyl, dataframe = mtcars)
```
In the console you'll receive a series of messages chronicling your progress
and any diagnostic information. In this case `am` and `cyl` are being coerced to
factors and you're being prompted to make sure that's what is intended.
Next you receive a warning because you have a very unbalanced design. There are
only two 8 cylinder cars with a manual transmission and twelve 8 cylinder cars
with automatics. Whereas there are eight 4 cylinders with manual and only three
that are automatics. Imbalance in our design worries us for two reasons. One is
that it causes a lack of statistical power and creates some math challenges in
deciding how to divide up the sums of the squared differences. This data set
causes the more troublesome worry. Are the number of cylinders and manual versus
automatic related systematically which would call our whole design into
question. Make sure you can answer questions about which is at work here, or
make sure you have a balanced design.
A table follows that is intended to summarize the findings. We'll discuss it
more later when we examine the plot.
The overall measures table can be very handy for comparing numerous models. For
example how does the AIC number change if we were to eliminate `am` ?
The table of group means is useful for looking at summary by group. Want the
best gas mileage? Buy a 4 cylinder manual car.
In our simple example the only statistically significant effect is for the main
effect of number of cylinders. Accordingly the Scheffe test is run against the
three types of cars with 4, 6, or 8 cylinders and we can see that with a
difference of 7.3 mpg eight and four cylinder cars are statistically
significant even as we control for the multiple simultaneous comparisons.
The next step is to test homogeneity of variance also known as
(homoscedasticity). Since the math in our ANOVA rely on the assumption that the
variance for the different groupings of cars is more or less equal, we need to
check that assumption.
We'll use the Brown-Forsythe test. When you run the `leveneTest` in R the default
is actually a Brown-Forsythe, to get a true Levene you must specify `center =
mean`. Brown-Forsythe is actually more robust since it tests differences from
the median. Not surprisingly when we consult out table of group results we have
some reason for concern sine the standard deviations vary widely.
Finally, let’s address the assumption that our errors or residuals are normally
distributed. We’re looking for evidence that our residuals are skewed or tailed
or otherwise misshapen in a way that would influence our results. Surprisingly,
there is actually quite a bit of controversy on this point since on the one hand
we have strong reason to believe that our sample will be imperfect and that our
population will not necessarily be “perfectly normal” either. Some argue that
some simple plotting is all that is necessary looking for an unspecifiable
amount of non normality that will trigger a search for the source. Other prefer
a more formal approach using one or more statistical tests.
`Plot2WayANOVA` runs the most common test of the normality assumption (there are
many) the Shapiro-Wilk test The statistics look good, no strong evidence in the
data we have.
The default settings for the resultant plot are deliberately minimalistic,
allowing you to focus visually on the pattern of means and the connecting lines.
If you're already used to looking at this sort of plot it is immediately
apparent from the separation between the lines that the number of cylinders is
having a significant impact on mileage. Automatic versus manual transmission
seems to have less impact (judged by the relative lack of slope except for 4
cylinder models) and there does seem to be at least the start of an interaction
between the two. (bear in mind this is a small data set and we are very
unbalanced).
One other easy tip is warranted. Order matters and sometimes it is helpful to
run the command simply reversing the order of the independent variables to help
you better "see" the results, e.g. `Plot2WayANOVA(formula = mpg ~ cyl * am,
dataframe = mtcars)`
**Note that if you want to "save" all these tables of data and information all
you need to do is store the results in an object as in `MyResults <-
Plot2WayANOVA(formula = mpg ~ am * cyl, dataframe = mtcars)` then `MyResults`
can be accessed as a list.**
### Some common tweaks
Let's make some changes that are likely to be quite common:
1. Change the order of the factors
1. Change to p < .01 or 99% CI
1. Add title and axis labels
1. Add a label with the actual group mean values
1. **New** display the standard error of the mean (SEM) instead of
confidence intervals
1. Change the plotted shape for the group mean to a square not a diamond
1. Change our post hoc test method to something less conservative than Scheffe
```{r Plot2WayANOVA2, echo=TRUE, fig.height=4, fig.width=7, message=FALSE, warning=FALSE}
Plot2WayANOVA(formula = mpg ~ cyl * am,
dataframe = mtcars,
confidence = .99,
title = "MPG by cylinders and type transmission",
xlab = "Cylinders",
ylab = "Miles per gallon",
mean.label = TRUE,
mean.shape = 22,
posthoc.method = "lsd",
errorbar.display = "SEM"
)
```
**Please don't fail to notice how liberal Fisher's LSD is compared to Scheffe
especially given we've demanded more confidence **
### Less common more custom tweaks
Although the defaults are minimalistic to allow you to focus on the interaction
pattern the function also has any number of optional ways of increasing
complexity and showing more information.
Let's make some custom changes that are more uncommon:
1. Display standard deviation errorbars
1. Show individual data points
1. Superimpose a boxplot
1. **New** use an entirely different theme (minimal)
1. **New** custom text for the factor labels (bigger and blue)
```{r Plot2WayANOVA3, echo=TRUE, fig.height=4, fig.width=7, message=FALSE, warning=FALSE}
# Create a new dataset
library(dplyr)
library(ggplot2)
library(stringi)
newmpg <- mpg %>%
filter(cyl != 5) %>%
mutate(am = stringi::stri_extract(trans, regex = "auto|manual"))
Plot2WayANOVA(formula = hwy ~ am * cyl,
dataframe = newmpg,
ylab = "Highway mileage",
xlab = "Transmission type",
plottype = "line",
offset.style = "wide",
overlay.type = "box",
mean.label = TRUE,
mean.shape = 20,
mean.size = 3,
mean.label.size = 3,
show.dots = TRUE,
errorbar.display = "SD",
ggtheme = ggplot2::theme_minimal(),
ggplot.component = theme(axis.text.x = element_text(size=14, color="darkblue"))
)
```
## Credits
Many thanks to Danielle Navarro and the book * [Learning Statistics with
R](https://learningstatisticswithr.com/).*
### Leaving Feedback
If you like CGPfunctions, please consider Filing a GitHub issue by [leaving
feedback here](https://github.com/ibecav/CGPfunctions/issues), or by contacting
me at ibecav at gmail.com by email.
I hope you've found this useful. I am always open to comments, corrections and
suggestions.
Chuck
### License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.