Drawing O3 (Overview Of Outliers) plots

Introduction

Outliers are a complicated business. It is difficult to define what they are and it is difficult to assess how they may affect analyses. They can be errors, missclassifications or fascinating special cases.

Univariate outliers are cases identified as outliers based on one variable. Multivariate outliers are cases identified based on several variables taken together. Datasets with many variables can give rise to a variety of different outliers depending on which combinations of variables (and which outlier identification methods) are used. An O3 plot shows which cases are identified as potential outliers for every combination of variables.

A first O3 plot

library(OutliersO3)
library(ggplot2)
data(Election2005)
data <- Election2005[, c(6, 10, 17, 28)]
O3s <- O3prep(data, method="PCS", tols=0.01, boxplotLimits=6)
O3s1 <- O3plotT(O3s, caseNames=Election2005$Name)
O3s1$gO3 + theme(plot.margin = unit(c(0, 2, 0, 0), "cm"))

Fig 1: An O3 plot using the FastPCS algorithm for a dataset comprising four structural variables for German parliamentary constituencies.

There is a row for each variable combination for which outliers were found and two blocks of columns. Each row of the block on the left shows which variable combination defines that row. As there are 4 variables, there are 4 columns, one for each variable, and a cell is coloured grey if that variable is part of the combination. The combinations (the rows) are sorted by numbers of outliers found within numbers of variables in the combination, and blue dotted lines separate the combinations with different numbers of variables. The columns in the left block are sorted by how often the variables occur. Combinations are only included if outliers are found with them. Two white columns separate this block from the block on the right that records the outliers found. There is one column for each case that is found to be an outlier at least once and these columns are sorted by the numbers of times the cases are outliers.

German words and names can be long and that is why the plot labels take up so much space. The full name of the constituency most often found to be an outlier is Berlin-Friedrichshain - Kreuzberg - Prenzlauer Berg Ost. Without a recognised abbreviation it can be better to use dataset row numbers, the default labelling, as is done in the other figures, and use that to find the name in the dataset.

Ten of the 299 cases are found to be outliers for some variable combination or other. Three Berlin constituencies are identified by 8 combinations, always when population density is in the combination.

Setting tol to 0.01 is not particularly stringent. The next figure compares outliers identified at different levels of the tolerance level.

Comparing different outlier tolerance levels

O3x <- O3prep(data, method="PCS", tols=c(0.01, 0.005, 0.001), boxplotLimits=c(3, 6, 10))
O3x1 <- O3plotT(O3x)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.6.1

grid.arrange(O3x1$gO3, O3x1$gpcp, ncol=1)

Fig 2: An O3 plot for the same dataset drawn for different tol levels (above) and a parallel coordinate plot (below) with outliers highlighted in red that are identified by at least one variable combination for any level. The FastPCS algorithm was used.

More cases are now identified as possible outliers and the variable population density finds many at a level of 0.1, but none at lower levels.

For every O3 plot there is a complementary parallel coordinate plot with outliers coloured red that are identified by any of the variable combinations.

Comparing two methods

Different outlier methods may identify different sets of outliers and the next figure compares methods covMcd and FastPCS, using the package’s default tolerance levels of 0.05 for HDoutliers and 0.01 for FastPCS. For any method like FastPCS that does not identify univariate outliers, boxplot limits are used for single variables defined by the argument boxplotLimits.

O3m <- O3prep(data, method=c("MCD", "PCS"))
O3m1 <- O3plotM(O3m)
grid.arrange(O3m1$gO3, O3m1$gpcp, ncol=1)

Fig 3: An O3 plot displaying outliers found by either or both of methods covMcd and FastPCS (above) and the complementary parallel coordinate plot.

covMcd finds more outliers than FastPCS One outlier found by FastPCS is not identified as an outlier by covMcd for any combination of variables. Nine outliers are found by covMcd that are not identified by FastPCS for any combination of variables.

Using five methods simultaneously

The OutliersO3 package offers the choice of 5 outlier identification methods and all 5 can be compared in one plot. If all 5 methods identify an outlying case for a particular variable combination, it is shown in red. If four methods agree, the cell is coloured orange and if X agree (X≤3) then slategrayX shading is used. The tolerance levels used for each method in this plot are the defaults ones set in the package.

It is useful to know how many outliers have been found before drawing the displays—there can be very many— and so a table of the O3plot output, nOut (number of outliers), is useful.

O3y <- O3prep(data, method=c("PCS", "BAC", "adjOut", "DDC", "MCD"))
O3y1 <- O3plotM(O3y)
cx <- data.frame(outlier_method=names(O3y1$nOut), number_of_outliers=O3y1$nOut)
knitr::kable(cx, row.names=FALSE)

outlier_method	number_of_outliers
PCS	10
BAC	5
adjOut	1
DDC	4
MCD	18

Drawing the plot and the complementary parallel coordinate plot gives:

grid.arrange(O3y1$gO3, O3y1$gpcp, ncol=1)

Fig 4: An O3 plot displaying outliers found by any of the 5 methods and the parallel coordinate plot with cases identified by any method for any variable combination highlighted in red.

(A large number of potential outliers identified can make a mess of the labelling, even though only row numbers are used as labels.)

Although all methods identify at least some potential outliers, no case is identified by all five methods for the same combination of variables.

Using the same tolerance level for each method is not ideal, as they each use the parameter in a different way. The problem then becomes deciding how to set individual parameter values that are in some sense equivalent for the methods you want to compare. This is discussed in the second vignette, “Different outlier tolerance levels for different O3 plots”.