--- title: "specification" author: "Josef Dolejs" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{gilmour-spec} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} library(gilmour) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Several methods may be found for selecting a subset of regressors from a set of k candidate variables in multiple linear regression. One possibility is to evaluate all possible regression models and comparing them using Mallows's Cp statistic (Cp).
1. Gilmour, S.G. The interpretation of Mallows's Cp-statistic. The Statistician. 1996. 45(1) 49-56. doi--10.2307/2348411.
2. Boisbunon, Aurélie; Canu, Stephane; Fourdrinier, Dominique; Strawderman, William; Wells, Martin T. (2013). "AIC, Cp and estimators of loss for elliptically symmetric distributions". arXiv--1308.2766
Generally, the presented package is suitable when the number of observations (n) is small and the number of regressors (k) is relatively low. However, the condition n \> k + 3 should be satisfied.
Short theoretical background.
For a particular submodel with p parameters, the Cp statistic is defined as: $Cp = \frac{SSEp}{\sigma^2}-n+2p$ where SSEₚ is the sum of squared errors for the submodel, and MSE is the mean square error from the full model with all possible regressors, used as an estimate of the error variance σ². Gilmour proposed an adjusted version of the Cp statistic to improve model comparison, given by: $\bar{C}_p = C_p - \frac{2(k - p + 1)}{(n - k – 3)}$ where k is the number of regressors in the full model, and p - 1 is the number of regressors in the submodel (thus, p includes the intercept term). Further details and theoretical background can be found in Gilmour's original paper. For example, in the case of the trivial model (i.e., a model with only an intercept), the adjusted statistic is: $\bar{C}_1 = C_1 - \frac{2(k - p + 1)}{(n - k – 3)}= \frac{var(y)(n-1)}{MSE}-n+2-\frac{2(k - p + 1)}{(n - k – 3)}$ where the function var() of sample variation is used for calculation of sum of squares in the trivial model and MSE is estimation of $\sigma^2$. The package contains two primary functions: submodels(d) and final_model(d).
The functions, submodels(d) and final_model(d), share the same input argument d, which must be a table in the data.frame format. The first column of this data frame should contain the response (dependent) numerical variable (y), while the remaining columns should contain the explanatory (independent) numerical variables (regressors). By default, each row represents a single observation.
[ The number of observations (n) should be higher than the number of regressors (k) plus three (n\>k+3). ]{style="color:red"}
The functions ignore column names (colnames) and row names (rownames); however, [the position of the first column is critical and must always contain the response variable.]{style="color:red"}
The package includes eight illustrative datasets:
Gilmour9p, T1, T2, Trivial, Modified_Gilmour9p, Parks5p, Patents5p and EU2019.
Additionally, the standard R dataset "mtcars" may be also used.
The procedure consists of two main steps:
Initial Selection:
Adjusted $\bar{C}_p$ values are calculated for all possible combinations of regressors. For example, 1023 submodels are for "mtcars" data, see the two following statements:
[ submodels(mtcars)\$submodels_number
final_model(mtcars)\$submodels_number ]{style="color:red"}
Subsequently, the submodel with the minimum $\bar{C}_p$ value is selected and labeled as ModelMin.
See the two following statements:
[ submodels(mtcars)\$model_min
final_model(mtcars)\$model_min ]{style="color:red"}
Reduction and Testing in the function "final_model(d)":
All submodels nested within ModelMin are considered. Among them, the submodel with the lowest $\bar{C}_p$ is selected as a new candidate. A hypothesis test is then performed where the null hypothesis states that ModelMin provides a better fit than the candidate submodel. If the null hypothesis is not rejected, ModelMin is accepted as the final model. If the null hypothesis is rejected, the candidate submodel becomes the new ModelMin, and the process is repeated with submodels nested within this new model. The procedure continues iteratively until a null hypothesis is not rejected or until the so-called trivial model (i.e., a model using only the arithmetic mean) is reached. See the two diagrams below.
The function "final_model(d)" returns full regression results for:
the Full Model, the selected ModelMin, and the resulting Final Model.
Additionally, it outputs $\bar{C}_p$ values for all submodels, enabling users to easily generate a adjusted $\bar{C}_p$ \~ p plot (where p is the number of regressors+1), as recommended by Gilmour.
The statements which draw the plots are below and in the help of the package.
"Initial Selection" is in the two functions: submodels() and final_model()
Trulli

"Reduction and Testing" is in the function final_model()
Trulli

Examples:
The package includes eigth illustrative datasets. Additionally, the standard R dataset "mtcars" may be used as a nine example in the help documentation (see: mtcars dataset). In this dataset, the first column (mpg, representing car fuel consumption) is used as the response variable. Some example datasets have no substantive meaning and are included purely to demonstrate technical capabilities of the functions. However, three of the datasets were used in previously published studies and carry practical relevance.
References of three data files with practical meaning:
Gilmour9p:
Gilmour, S.G. The interpretation of Mallows's Cp-statistic. The Statistician. 1996. 45(1):49-56.
Parks5p:
Stemberk Josef, Josef Dolejs, Petra Maresova, Kamil Kuca. Factors affecting the number of Visitors in National Parks in the Czech Republic, Germany and Austria. International Journal of Geo-Information. . ISPRS Int. J. Geo-Inf. 2018, 7(3), 124;
Patents5p:
Perspective and Suitable Research Area in Public Research—Patent Analysis of the Czech Public Universities. Education and Urban Society, 54(7),
The First Example: mtcars ```{r results='asis'} knitr::kable(head(mtcars, 12)) ``` Data "mtcars" may be used as simple illustration and no other treatment is necessary.
Two simple statements may be used to obtain all results.
[ MyResults=final_model(mtcars) ]{style="color:red"}
This is what the output looks like when applying the final_model(d) function:
[ [1] "Number of regressors in model_min 3" ]{style="color:blue"}
[ [1] "model_min: y~x5+x6+x8" ]{style="color:blue"}
[ [1] "Cq+1: -0.634206365802602" ]{style="color:blue"}
[ [1] "The starting value of q for the test of submodel is: 3" ]{style="color:blue"}
[ [1] "For q = 3: Ho is rejected, new submodel goes to the next test." ]{style="color:blue"}
[ [1] "For q = 2: the End, Ho is not rejected, the resulting model is: y~x5+x6+const. F= 11.79722 Cq+1 = 0.98767" ]{style="color:blue"}
[ [1] "Final value of q is : 2" ]{style="color:blue"}
This is how you can display the basic graph of the Gilmour method, where in the first step the model with the minimum value of the adjusted statistic $\bar{C}_p$ is searched.
[ yCp= as.numeric(MyResults\$submodels[,3]) ; xp= as.numeric(MyResults\$submodels[,2]) ]{style="color:red"}
[ ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9* min(yCp)) ]{style="color:red"}
[ YRange=c( ymin ,1.5*max(xp)) ; FinalCp= MyResults\$FinalCp ]{style="color:red"}
[ plot(yCp ~ xp, xlab="Number of Parameters in Submodel",ylab="", ylim=YRange , ]{style="color:red"}
[ col=ifelse( ( round(yCp,4)== round(min(yCp),4)| round(yCp,4)== round(FinalCp,4) ), "red", "darkblue") ) ]{style="color:red"}
[ lines(xp, xp, col="red") ]{style="color:red"}
[ mtext(bquote(paste( bar(C) , "p")), side=2, line=3, padj=1, cex=1.2) ]{style="color:red"}
[ mtext(bquote(paste("All Submodels: ",bar(C),"p ~ p")), side=3, line=3, padj=1, cex=1.2) ]{style="color:red"}
Relationship $\bar{C}_p$~p for all submodels calculated for data "mtcars".
Trulli

Notes: red circles corresponds to ModelMin and FinalModel. The following commands show additional outputs for the full model and for the final model.
[ MyResults\$full_model ]{style="color:red"}
[ summary(MyResults\$full_model_results) ]{style="color:red"}
[ MyResults\$final_model ]{style="color:red"}
[ summary(MyResults\$final_model_results) ]{style="color:red"}
Information about the model "ModelMin" with the minimum value of the adjusted C̄p characteristic can be obtained as follows:
[ MyResults\$model_min ]{style="color:red"}
[ summary(MyResults\$model_min_results) ]{style="color:red"}
This submodel can be identical to the final model and, in principle, to the full model.
The next 8 illustrative tables are also available in the package:
[Gilmour9p; T1; T2; Trivial; Modified_Gilmour9p; Parks5p; Patents5p; EU2019 ]{style="color:red"}
Further, individual data are gradually evaluated in this illustrative manner (the outputs of the functions submodels(d) and final_model(d) can also be used in other ways).
Data "Gilmour9p" were taken from the original study by Gilmour.
It illustrates the entire method and is also of practical importance.
The data contains 9 predictors and 24 observations (House price data).
The outputs of the functions "submodels()" and "final_model()"are identical to the results in the original work. [ d=Gilmour9p ]{style="color:red"}
[ MyResults=final_model(d) ]{style="color:red"}
[ [1] "Number of regressors in model_min 2" ]{style="color:blue"}
[ [1] "model_min: y~x1+x2" ]{style="color:blue"}
[ [1] "Cq+1: -0.248045011348888" ]{style="color:blue"}
[ [1] "The starting value of q for the test of submodel is: 2" ]{style="color:blue"}
[ [1] "For q = 2: Ho is rejected, new submodel goes to the next test." ]{style="color:blue"}
[ [1] "For q = 1: the End, Ho is not rejected, y~x1+constant, F = 72.17679 Cq+1 = 0.89705" ]{style="color:blue"}
[ [1] "Final value of q is : 1" ]{style="color:blue"}
This is how you can display the basic graph of the Gilmour method, where in the first step the model with the minimum value of the adjusted statistic $\bar{C}_p$ is searched.
[ yCp= as.numeric(MyResults\$submodels[,3]) ; xp= as.numeric(MyResults\$submodels[,2]) ]{style="color:red"}
[ ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9* min(yCp)) ]{style="color:red"}
[ YRange=c( ymin ,1.5*max(xp)) ; FinalCp= MyResults\$FinalCp ]{style="color:red"}
[ plot(yCp ~ xp, xlab="Number of Parameters in Submodel",ylab="", ylim=YRange , ]{style="color:red"}
[ col=ifelse( ( round(yCp,4)== round(min(yCp),4)| round(yCp,4)== round(FinalCp,4) ), "red", "darkblue") ) ]{style="color:red"}
[ lines(xp, xp, col="red") ]{style="color:red"}
[ mtext(bquote(paste( bar(C) , "p")), side=2, line=3, padj=1, cex=1.2) ]{style="color:red"}
[ mtext(bquote(paste("All Submodels: ",bar(C),"p ~ p")), side=3, line=3, padj=1, cex=1.2) ]{style="color:red"}
Relationship $\bar{C}_p$~p for all submodels calculated for data "Gilmour9p".

Notes: red circles corresponds to ModelMin and FinalModel.
Simmilar commands as was used for data mtcars may may be used for all other tables:
[Gilmour9p; T1; T2; Trivial; Modified_Gilmour9p; Parks5p; Patents5p; EU2019 ]{style="color:red"}
For example, if [d=T1]{style="color:red"} is used then table "d" and other commands may be used ....
All commands are available also in the section "Examples" in the help of the function "final_model(d)".