--- title: "Regression Diagnostics by Period using REPS" author: "" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Regression Diagnostics by Period using REPS} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: ./REFERENCES.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, include=FALSE} library(REPS) data("data_constraxion") ``` ## Introduction The `calculate_regression_diagnostics()` function in **REPS** provides *regression diagnostics by period*. It is designed for panel or repeated cross-section data (e.g. property transactions over time) to evaluate the quality of **period-specific log-linear regressions**. For each period, it: - Fits a log-linear regression model: `log(price) ~ covariates` - Computes diagnostics: - **Shapiro-Wilk p-value** (normality) - **Adjusted R-squared** (linearity) - **Durbin-Watson test** (autocorrelation) - **Breusch-Pagan test** (heteroscedasticity) These diagnostics help assess **model quality over time**, identifying periods with issues like non-normality, low fit, heteroscedasticity, or autocorrelation. ## Required Data Your dataset should include: - A **period variable** (e.g. quarterly/annual codes) - A **dependent variable** (typically price) - One or more **numerical independent variables** (e.g. floor area) - Optionally, **categorical independent variables** (e.g. neighbourhood codes) ```{r} # Example dataset (you should already have this loaded) head(data_constraxion) # We log transform the floor_area again (see vignette on calculating price index as why) dataset <- data_constraxion dataset$floor_area <- log(dataset$floor_area) ``` ## Using `calculate_regression_diagnostics()` Example: ```{r} diagnostics <- calculate_regression_diagnostics( dataset = dataset, period_variable = "period", dependent_variable = "price", numerical_variables = c("floor_area", "dist_trainstation"), categorical_variables = c("dummy_large_city", "neighbourhood_code") ) head(diagnostics) ``` ## Visualizing Diagnostics For convenient visualization: ```r plot_regression_diagnostics(diagnostics) ``` This generates a **3x2 grid** of plots: - Normality (p-value Shapiro-Wilk) - Linearity (Adjusted R-squared) - Autocorrelation (Durbin-Watson statistic) - Autocorrelation (p-value Durbin-Watson) - Heteroscedasticity (p-value Breusch-Pagan) Example: ```{r echo=FALSE, out.width="100%", fig.align="center"} knitr::include_graphics("diagnostics_plot.png") ``` ## Interpreting the Output The hedonic price index relies on a log-linear regression model, which assumes that certain statistical conditions hold. The diagnostics plot provides an overview of how well these assumptions are met across different periods. Each subplot corresponds to a specific model assumption: ### Row 1: Normality and Linearity - **Shapiro-Wilk test (left plot)** - Shows p-values for the normality of residuals. - A p-value below 0.05 (dashed red line) indicates a potential violation of the normality assumption. - **Adjusted R-squared (right plot)** - Reflects the explanatory power of the regression model. - Values below 0.6 (dashed red line) may indicate a weak linear relationship. ### Row 2: Independence - **Durbin-Watson statistic (left plot)** - Tests for autocorrelation in residuals. - Ideal value is around 2. - Values outside the 1.75–2.25 range (dashed lines) suggest potential autocorrelation. - **Durbin-Watson p-value (right plot)** - Indicates whether autocorrelation is statistically significant. - p > 0.05: no significant evidence of autocorrelation. - p ≤ 0.05: residuals may not be independent. ### Row 3: Homoscedasticity - **Breusch-Pagan p-value** - Tests whether residuals have constant variance. - A p-value below 0.05 (dashed red line) suggests heteroscedasticity (non-constant variance). ## Summary The `calculate_regression_diagnostics()` and `plot_regression_diagnostics()` functions in **REPS** enable: - **Period-by-period regression checking** - **Easy comparison of assumptions over time** - **Detection of problematic periods** They support **robust, high-quality** hedonic price index modeling by systematically checking regression assumptions.