--- title: "Getting started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction `insurancerating` provides actuarial building blocks for insurance pricing in R. A common GLM-based pricing exercise often combines several tasks: 1. portfolio analysis 2. model estimation 3. interpretation of fitted coefficients 4. refinement of tariff structure This vignette illustrates one way to combine the main building blocks: - analyse risk factors with `factor_analysis()` - estimate pricing models with `glm()` - interpret coefficients with `rating_table()` - assess model stability with `model_performance()` and `bootstrap_performance()` The focus is on the transition from portfolio data to an interpretable tariff structure. ## Data We use the example dataset `MTPL2`, which contains a motor portfolio with: - number of claims (`nclaims`), - exposure (`exposure`), - premium (`premium`), - claim amounts (`amount`), - several rating factors ```{r, warning = FALSE, message = FALSE} library(insurancerating) library(dplyr) head(MTPL2) ``` ## Step 1 — Portfolio analysis ### Factor analysis A pricing analysis often starts with an analysis of the portfolio. Before fitting a model, it is necessary to understand: - how experience differs across factor levels - whether differences are credible - whether exposure is sufficient - whether the observed pattern is plausible This is done with `factor_analysis()`. ### Basic factor analysis We start by analysing a single risk factor. ```{r} fa <- factor_analysis( MTPL, risk_factors = "zip", claim_count = "nclaims", exposure = "exposure", claim_amount = "amount" ) fa ``` The output provides commonly used portfolio metrics such as: - frequency = claims / exposure - average severity = loss / claims - risk premium = loss / exposure - loss ratio = loss / premium - average premium = premium / exposure ### Visualising factor behaviour ```{r} autoplot(fa, metrics = c("exposure", "frequency", "risk_premium")) ``` This provides a direct view of: - the distribution of exposure - the variation in claim frequency - the variation in risk premium At this stage, the purpose is not yet to fit a model, but to understand whether the factor behaves in a way that is suitable for pricing. ## Step 2 — Continuous variables ### Why continuous variables are treated separately Continuous variables are typically not used directly in a tariff. In pricing practice, they are usually: 1. analysed as continuous variables 2. translated into tariff segments 3. used in a GLM as categorical rating factors This ensures that the final tariff remains interpretable and implementable. ### Analysing the shape with a GAM ```{r} age_freq <- risk_factor_gam( data = MTPL, risk_factor = "age_policyholder", claim_count = "nclaims", exposure = "exposure" ) autoplot(age_freq, show_observations = TRUE) ``` This step is used to inspect: - non-linear patterns - local volatility - areas with low exposure - plausible breakpoints for tariff segments ### Deriving tariff segments ```{r} age_segments <- derive_tariff_segments(age_freq) autoplot(age_segments) ``` This converts the continuous variable into risk-homogeneous tariff segments. The resulting segments should reflect differences in risk, while remaining suitable for use in a tariff. ### Adding tariff segments to the data ```{r} dat <- MTPL |> add_tariff_segments(age_segments, name = "age_cat") |> mutate(across(where(is.character), as.factor)) |> mutate(across(where(is.factor), ~ set_reference_level(., exposure))) ``` `set_reference_level()` sets the reference level to the level with the highest exposure. In pricing models, this is often the most stable and interpretable baseline. ## Step 3 — Model estimation ### Why GLMs are used Generalized linear models are widely used in insurance pricing because they: - accommodate non-normal response distributions - produce interpretable multiplicative effects - can be translated into tariff relativities A common decomposition is: - frequency --> Poisson GLM - severity --> Gamma GLM ### Frequency model ```{r} mod_freq <- glm( nclaims ~ age_cat, offset = log(exposure), family = poisson(), data = dat ) ``` ### Severity model ```{r} mod_sev <- glm( amount ~ age_cat, weights = nclaims, family = Gamma(link = "log"), data = dat |> filter(amount > 0) ) ``` Frequency and severity are modelled separately because they capture different aspects of the loss process. ### Constructing a premium proxy ```{r} premium_df <- dat |> add_prediction(mod_freq, mod_sev) |> mutate(premium = pred_nclaims_mod_freq * pred_amount_mod_sev) head(premium_df) ``` This produces a pure premium estimate, i.e. expected loss per unit of exposure. ## Step 4 — Premium model ### Fitting a premium model ```{r} burn_unrestricted <- glm( premium ~ age_cat + zip, weights = exposure, family = Gamma(link = "log"), data = premium_df ) ``` This model combines the rating factors into a single premium structure. In practice, this is often the model that is closest to the final tariff logic, because it reflects the premium level rather than only individual model components such as frequency or severity. ## Step 5 — Interpreting coefficients ### Rating table ```{r} rt <- rating_table(burn_unrestricted) rt ``` `rating_table()` expresses fitted coefficients in terms of the original factor levels, including the reference level. This output is commonly used to inspect tariff relativities. ### Visualising coefficients ```{r} rating_table(burn_unrestricted) |> autoplot() ``` This plot is typically used to assess: - the relative size of coefficients - the structure across levels - the exposure behind each level - whether additional refinement may be needed At this stage, the relevant questions are: - are coefficients sufficiently stable? - do they follow the expected pattern? - are some levels driven by limited exposure? ## Step 6 — Model evaluation ### Model performance ```{r} model_performance(mod_freq) ``` This provides summary measures of model fit, such as RMSE. ### Bootstrap performance ```{r} bp <- bootstrap_performance(mod_freq, dat, n_resamples = 50, show_progress = FALSE) autoplot(bp) ``` This provides a view of predictive stability by evaluating how performance changes across bootstrap samples. A single fit statistic is usually not sufficient. In pricing practice, it is also relevant to assess whether the model behaves consistently under small data perturbations. ## Step 7 — From model to tariff At this point, the example has produced: - portfolio-level insight - fitted pricing models - interpretable factor relativities - basic performance diagnostics In many cases, a further step is required before the model output can be used as a tariff. Typical reasons include: - irregular coefficient patterns - monotonicity requirements - externally imposed restrictions - expert-driven adjustments This can be handled with the refinement tools described in [Refinement building blocks](refinement-workflow.html). ## Summary A possible sequence in `insurancerating` is: ```{r, eval = FALSE} factor_analysis() # analyse portfolio behaviour risk_factor_gam() # analyse continuous variables derive_tariff_segments() # derive tariff segments glm() # estimate pricing models rating_table() # interpret fitted coefficients bootstrap_performance() # assess stability prepare_refinement() # refine tariff structure if needed ``` The aim is to move from raw portfolio data to a tariff structure that is: - interpretable - reproducible - and suitable for practical pricing use ## Next steps The following vignette covers the refinement step in more detail: - [Refinement building blocks](refinement-workflow.html) For the conceptual background to exposure, risk premium, and tariff design, see: - [Pricing workflow building blocks](pricing-workflow-building-blocks.html)