---
title: "Introduction"
output: 
  volker::html_report
vignette: >
  %\VignetteIndexEntry{Introduction}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r include=FALSE}
knitr::opts_chunk$set(
  comment = "#>",
  collapse = TRUE,
  echo = TRUE,
  message = FALSE,
  knitr.table.format = "html"
)

options(
  vlkr.fig.settings=list(
    html = list(
      dpi = 96, scale = 1, width = 910, pxperline = 12
    )
  )
)
```


## How to use the volkeR package?

First, load the package, set the plot theme and get some data.

```{r, warning=FALSE}
# Load the package
library(volker)

# Set the basic plot theme
theme_set(theme_vlkr())

# Load an example dataset ds from the package
ds <- volker::chatgpt

```

## How to generate tables and plots?

Decide whether your data is categorical or metric
and choose the appropriate function:  

- `report_counts()` shows frequency tables and generates simple and stacked bar charts.
- `report_metrics()` creates tables with distribution parameters, 
  visualises distributions in density plots, box plots or scatter plots.


Report functions, under the hood, call functions that generate plots, tables or calculate effects.
If you only need one of those outputs, you can call the functions directly:

- `tab_counts()`, `plot_counts()` or `effect_counts()` for categorical data.
- `tab_metrics()`, `plot_metrics()` or `effect_metrics()` for metric data.

All functions expect a dataset as their first parameter.
The second and third parameters await your column selections. 
The column selections determine whether to analyse single variables, item lists
or to compare and correlate multiple variables.

*Try out the following examples!*

### Categorical variables


```{r eval=FALSE}
# A single variable
report_counts(ds, use_private)
```

```{r eval=FALSE}
# A list of variables
report_counts(ds, c(use_private, use_work))
```

```{r eval=FALSE}
# Variables matched by a pattern
report_counts(ds, starts_with("use_"))
```

You can use all sorts of tidyverse style selections:
A single column, a list of columns or patterns such as
`starts_with()`, `ends_with()`, `contains()` or `matches()`.

### Metric variables

```{r eval=FALSE}
# One metric variable
report_metrics(ds, sd_age)
```

```{r eval=FALSE}
# Multiple metric items
report_metrics(ds, starts_with("cg_adoption_"))
```

### Cross tabulation and group comparison

Provide a grouping column in the third parameter to compare different groups.

```{r eval=FALSE}
report_counts(ds, adopter, sd_gender)
```

For metric variables, you can compare the mean values.

```{r eval=FALSE}
report_metrics(ds, sd_age, sd_gender)
```

By default, the crossing variable is treated as categorical. 
You can change this behavior using the `metric` parameter to calculate correlations:

```{r eval=FALSE}
report_metrics(ds, sd_age, use_work, metric = TRUE)
```

The `ci` parameter, where possible, adds confidence intervals to the outputs.

```{r eval=FALSE}
ds |> 
  filter(sd_gender != "diverse") |> 
  report_metrics(sd_age, sd_gender, ci = TRUE)
```

Conduct statistical tests with the `effect` parameter.

```{r eval=FALSE}
ds |> 
  filter(sd_gender != "diverse") |> 
  report_counts(adopter, sd_gender, effect = TRUE)
```


See the function help (F1 key) to learn more options. 
For example, you can use the `prop` parameter to grow bars to 100%. 
The `numbers` parameter prints frequencies and percentages onto the bars.

```{r eval=FALSE}
ds |> 
  filter(sd_gender != "diverse") |> 
  report_counts(adopter, sd_gender, prop="rows", numbers= "n")
```


# Theming

The `theme_vlkr()`-function lets you customise colors:

```{r}
theme_set(theme_vlkr(
  base_fill = c("#F0983A","#3ABEF0","#95EF39","#E35FF5","#7A9B59"),
  base_gradient = c("#FAE2C4","#F0983A")
))
```

# Labeling

Labels used in plots and tables are stored in the comment attribute of the variable. 
You can inspect all labels using the `codebook()`-function:

```{r}
codebook(ds)
```


Set specific column labels by providing a named list to the items-parameter of `labs_apply()`:

```{r eval = FALSE}
ds %>%
  labs_apply(
    items = list(
      "cg_adoption_advantage_01" = "General advantages",
      "cg_adoption_advantage_02" = "Financial advantages",
      "cg_adoption_advantage_03" = "Work-related advantages",
      "cg_adoption_advantage_04" = "More fun"
    )
  ) %>% 
  report_metrics(starts_with("cg_adoption_advantage_"))
```


Labels for values inside a column can be adjusted by providing a named list to the values-parameter of `labs_apply()`. In addition, select the columns where value labels should be changed:


```{r eval=FALSE}
ds %>%
  labs_apply(
    cols = starts_with("cg_adoption"),  
    values = list(
      "1" = "Strongly disagree",
      "2" = "Disagree",
      "3" = "Neutral",
      "4" = "Agree",
      "5" = "Strongly agree"
    ) 
  ) %>% 
  report_metrics(starts_with("cg_adoption"))
```

To conveniently manage all labels of a dataset, 
save the result of `codebook()` to an Excel file,
change the labels manually in a copy of the Excel file, 
and finally call `labs_apply()` with your revised codebook.

```{r, eval=FALSE}
library(readxl)
library(writexl)

# Save codebook to a file
codes <- codebook(ds)
write_xlsx(codes,"codebook.xlsx")

# Load and apply a codebook from a file
codes <- read_xlsx("codebook_revised.xlsx")
ds <- labs_apply(ds, codes)
```


Be aware that some data operations such as `mutate()` from the tidyverse
loose labels on their way. In this case, store the labels (in the
codebook attribute of the data frame) before the operation and restore
them afterwards:

```{r eval=FALSE}
ds %>%
  labs_store() %>%
  mutate(sd_age = 2024 - sd_age) %>% 
  labs_restore() %>% 
  
  report_metrics(sd_age)
```


## The volker report template

Reports combine plots, tables and effect calculations in an RMarkdown document.
Optionally, for item batteries, an index, clusters or factors are calculated and reported.

To see an example or develop own reports, use the volker report template in RStudio:

- Create a new R Markdown document from the main menu  
- In the popup select the "From Template" option  
- Select the volker template.  
- The template contains a working example. 
  Just click knit to see the result.

Have fun with developing own reports!

Without the template, to generate a volker-report from any R-Markdown document, 
add `volker::html_report` to the 
output options of your Markdown document:

```
---
title: "How to create reports?"
output: 
  volker::html_report
---
```

Then, you can generate combined outputs using the report-functions.
One advantage of the report-functions is that plots are automatically scaled to fit the page.
See the function help for further options (F1 key). 


```{r}

#> ```{r echo=FALSE}
#> ds %>% 
#>   filter(sd_gender != "diverse") %>% 
#>   report_counts(adopter, sd_gender, 
#> ```

```


### Custom tab sheets
By default, a header and tabsheets are automatically created. 
You can mix in custom content.

- If you want to add content before the report outputs, 
  set the title parameter to `FALSE` and add your 
  own title.
- A good place for methodological details is a custom tabsheet 
  next to the "Plot" and the "Table" buttons. You can add a tab
  by setting the close-parameter to `FALSE` and adding a new header
  on the fifth level (5 x # followed by the tab name). 
  Close your custom new tabsheet with `#### {-}` (4 x #). 

Try out the following pattern in an RMarkdown document!


```{r}

#> ### Adoption types
#> 
#> ```{r echo=FALSE}
#> ds %>% 
#>   filter(sd_gender != "diverse") %>% 
#>   report_counts(adopter, sd_gender, prop="rows", title=FALSE, close=FALSE)
#> ```
#>
#> ##### Method
#> Basis: Only male and female respondents.
#> 
#> #### {-}

```


# Index calculation for item batteries

For quick inspections of an index from a bunch of items, set the index parameter to `TRUE`.
The index is calculated by the average value of all selected columns.

Cronbach's Alpha and the number of items are calculated with `psych::alpha()`
and stored as column attribute named "psych.alpha". The reliability values 
are printed by `report_metrics()`.

```{r eval=FALSE}
ds |> 
  report_metrics(starts_with("cg_adoption"), index = TRUE)
```

You can add an index as a new column using `add_index()`. 
A new column is created with the average value of all selected columns
for each case. Provide a custom name for the column using the `newcol` parameter.
The `report_metrics()` function still outputs reliability values for the column.


**Add a single index**
```{r eval=FALSE}
ds %>%
  add_index(starts_with("cg_adoption_"), newcol = "idx_cg_adoption") %>%
  report_metrics(idx_cg_adoption)
```

**Compare the index values by group**
```{r eval=FALSE}
ds %>%
  add_index(starts_with("cg_adoption_"), newcol = "idx_cg_adoption") %>%
  report_metrics(idx_cg_adoption, adopter)
```

**Add multiple indizes and summarize them**
```{r eval=FALSE}
ds %>%
  add_index(starts_with("cg_adoption_")) %>%
  add_index(starts_with("cg_adoption_advantage")) %>%
  add_index(starts_with("cg_adoption_fearofuse")) %>%
  add_index(starts_with("cg_adoption_social")) %>%
  tab_metrics(starts_with("idx_cg_adoption"))
```

To reverse items, provide a selection of columns to the `cols.reverse`-parameter of `add_index()`.

# Factor and cluster analysis 

The easiest way to conduct factor analysis or cluster analyses 
is to use the respective parameters in the `report_metrics()` function. 

```{r eval=FALSE}
ds |> 
  report_metrics(starts_with("cg_adoption"), factors = TRUE, clusters = TRUE)
```

Currently, cluster analysis is performed using kmeans and factor analysis is a principal component analysis.
Setting the parameters to true, automatically generates scree plots
and selects the number of factors or clusters. 
Alternatively, you can explicitly specify the numbers.

**Add factor or cluster analysis results to the original data** 

If you want to work with the results, use `add_factors()` and `add_clusters()` respectively. 
For factor analysis, new columns prefixed with "fct_" are created to store the factor loadings based on the specified number of factors. 
For clustering, an additional column prefixed with "cls_" is added that assigns each observation to a cluster number. 

```{r eval=FALSE}

ds |> 
  add_factors(starts_with("cg_adoption"), k = 3) |> 
  select(starts_with("fct_"))
```

Once you have added factor or cluster columns to your data set,
you can use them with the report functions:

```{r eval=FALSE}
ds |> 
  add_factors(starts_with("cg_adoption"), k = 3)  |>
  report_metrics(fct_cg_adoption_1, fct_cg_adoption_2, metric = TRUE)

```


```{r eval=FALSE}
ds |>
  add_clusters(starts_with("cg_adoption"), k = 3) |>
  report_counts(sd_gender, cls_cg_adoption, prop = "cols")
```

After explicitly adding factor or cluster columns,
you can inspect the analysis results using
`factor_tab()`, `factor_plot()` or `cluster_tab()`, `cluster_plot()`.

```{r eval=FALSE}
ds |> 
  add_factors(starts_with("cg_adoption"), k = 3)  |>
  factor_tab(starts_with("fct_"))
```


**Automatically determine the number of factors or clusters**  

To automatically determine the optimal number of factors or clusters based on diagnostics, set k = NULL.  

```{r eval=FALSE}
ds |> 
  add_factors(starts_with("cg_adoption"), k = NULL) |>
  factor_tab(starts_with("fct_cg_adoption"))
```


# Modeling: Regression and Analysis of Variance

Modeling in the statistical sense is predicting an outcome (dependent
variable) from one or multiple predictors (independent variables).

The report_metrics() function calculates a linear model if the `model`
parameter is set to `TRUE`. You provide the variables in the following
parameters:

-   Dependent metric variable: first parameter (a single column in the `cols` parameter).\
-   Independent categorical variables: second parameter (a tidy column
    selection in the `cross` parameter).\
-   Independent metric variables: third parameter (a tidy column
    selection in the `metric` parameter).\
-   Interaction effects: `interactions` parameter with a vector of
    multiplication terms (e.g. `c(sd_age * sd_gender)`). 

```{r, eval=FALSE}
ds |>
 filter(sd_gender != "diverse") |>
 report_metrics(
   use_work, 
   cross = c(sd_gender, adopter), 
   metric = sd_age,
   model = TRUE,
   diagnostics = TRUE
 )
```

If you are used to formulas, you can provide your model in formula notation 
instead of using column selections. This automatically enables the `model` parameter.
The following example is equivalent to the preceding example:

```{r, eval=FALSE}
ds |>
 filter(sd_gender != "diverse") |>
 report_metrics(use_work ~ sd_gender + adopter + sd_age)
```


Four selected diagnostic plots are generated if the `diagnostics`
parameter is set to `TRUE`:

-   Residual vs. fitted: Residuals should be evenly distributed
    vertically. Horizontally, they should follow the straight line.
    Otherwise this could be an indicator for heteroscedasticity,
    non-linearity, or autocorrelation.
-   Scale-location plot: Points should be evenly distributed without any
    visible pattern, similar to the residual plot.
-   Q-Q-Plot of fitted values: All points should lie along a straight
    line. Deviations may suggest non-linearity or that residuals are not
    normally distributed.
-   Cook's distance plot: High values indicate that single cases
    influence the model disproportionately. Rule of thumb: Cook's
    distance \> 1 is a problem.

To work with the predicted values, use `add_model()` instead of the
report function. This will add a new variable prefixes with `prd_`,
holding the target scores.

```{r, eval=FALSE}
ds <- ds |> 
  add_model(
   use_work,
   categorical = c(sd_gender, adopter), 
   metric = sd_age
 )

report_metrics(ds, use_work, prd_use_work, metric = T)
```

There are two functions to get the regression table or plot from the new
column:

```{r, eval=FALSE}
model_tab(ds, prd_use_work)
model_plot(ds, prd_use_work)

```

By default, p-values are adjusted to the number of tests by controlling
the false discovery rate (fdr). Set the `adjust` parameter to `FALSE` to
disable p-value correction.

# Reliability Scores and Classification Performance Indicators

In content analysis, reliability is usually checked by coding the cases
with different persons and then calculating the overlap. To calculate
reliability scores, prepare one data frame for each person:

-   All column names in the different data frames should be identical.\
    Codings must be either binary (`TRUE`/`FALSE`) or contain a fixed
    number of values such as "sports", "politics", "weather".\
-   Add a column holding initials or the name of the coder.\
-   One column must contain unique IDs for each case (e.g., a running
    case number).

Next, you row-bind the data frames. The coder and ID columns ensure that
each coding can be uniquely related to both the coder and the case.

```{r, eval=FALSE}

data_coded <- bind_rows(
  data_coder1,
  data_coder2
)

```

The final data, for example, looks like:

| case | coder | topic_sports | topic_weather |
|------|-------|--------------|---------------|
| 1    | anne  | TRUE         | FALSE         |
| 2    | anne  | TRUE         | FALSE         |
| 3    | anne  | FALSE        | TRUE          |
| 1    | ben   | TRUE         | TRUE          |
| 2    | ben   | TRUE         | FALSE         |
| 3    | ben   | FALSE        | TRUE          |

Calculating reliability is straightforward with `report_counts()`:

-   Provide the data to the first parameter.\
-   Add the column with codings (or a selection of multiple columns,
    e.g. using `starts_with()`) to the second parameter.\
-   Set the the third parameter to the column name with coder names or
    initials.\
-   Set the `ids` parameter to the column that contains case IDs or case
    numbers (this tells the volker-package which cases belong
    together).\
-   Set the `agree` parameter to "reliability" to request reliability
    scores.

Example:

```{r, eval=FALSE}

report_counts(data_coded, starts_with("topic_"), coder, ids = case, prop = "cols", agree = "reliability")

```

If you are only interested in the scores (without a plot), use
`agree_tab`. Tip: You may abbreviate the method name (e.g., "reli"
instead of "reliability").

```{r, eval=FALSE}

agree_tab(data_coded, starts_with("topic_"), coder, ids = case, method = "reli")

```

You can also request classification performance indicators (accuracy,
precision, recall, F1) with the same function by setting the `method`
parameter to "classification" (may be abbreviated). Use this option when
comparing manual codings with automated codings (e.g., classifiers or
large language models). By default, you get macro statistics (average
precision, recall and f1 across categories).

If you have multiple values in a column, you can focus on one category
to get micro statistics:

```{r, eval=FALSE}

agree_tab(data_coded, starts_with("topic_"), coder, ids = case, method = "class", category = "catcontent")

```

# The mystery of missing values

Cases with missing values, by default, are omitted in all methods. 
Thus, the calculations are only based on cases with complete values
in the selected columns.

Furthermore, each function first cleans the values:  

- Residual levels defined in the `VLKR_NA_LEVELS` constant are recoded to missing values ("[NA] nicht beantwortet", "[NA] keine Angabe", "[no answer]" and "keine Angabe").  
- Residual numeric values defined in the `VLKR_NA_NUMBERS` constant  are recoded to missing values (-9, -2, and -1).  


```{r}

print(volker:::VLKR_NA_LEVELS)

print(volker:::VLKR_NA_NUMBERS)

```

The output always contains information about how many cases were removed due to missing values.
You have three options to treat missings:  

1. Disable recoding by setting the `clean`-parameter of the functions to `FALSE`.  
2. Override the values in VLKR_NA_LEVELS and VLKR_NA_NUMBERS by calls such as `options(vlkr.na.levels=c("Not answered"))` or `options(vlkr.na.numbers=c(-2,-9))`. If you set the value to `FALSE`, no values are recoded.  
3. When analysing items, use pairwise complete data by calling `options(vlkr.na.omit=FALSE)` (maximal information from all items). 


# What's behind the scenes?

The volker-package is based on standard methods for data handling and visualisation. 
You could produce all outputs on your own. The package just makes your
code dry - don't repeat yourself - and wraps often used snippets into a simple interface.

Report functions call subsidiary tab, plot and effect functions, which in turn call functions
specifically designed for the provided column selection. Open the package help to see,
to which specific functions the report functions are redirected.

Console and markdown output is pimped by specific print- and knit-functions. 
To make this work, the cleaned data, produced plots, tables and markdown snippets 
gain new classes (`vlkr_df`, `vlkr_plt`, `vlkr_tbl`, `vlkr_list`, `vlkr_rprt`).

The volker-package makes use of common tidyverse functions.
Basically, most outputs are generated by three functions:

- `count()` is used to produce counts  
- `skim()` is used to produce metrics  
- `ggplot()` is used to assemble plots.

Statistical tests, clustering and factor analysis are largely based on the 
stats, psych, car and effectsize packages.

Thanks to all the maintainers, authors and contributors of the packages that 
make the world of data a magical place.