--- title: "Solving a finite-horizon semi-MDP" author: "Lars Relund " date: "`r Sys.Date()`" bibliography: litt.bib output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{finite-mdp} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} library(knitr) options(rmarkdown.html_vignette.check_title = FALSE, # formatR.arrow = TRUE, # scipen=999, # digits=5, width=90) #thm <- knit_theme$get("edit-kwrite") # whitengrey, bright, print, edit-flashdevelop, edit-kwrite #knit_theme$set(thm) knit_hooks$set( par = function(before, options, envir) { if (before && options$fig.show != 'none') par(mar = c(0, 0, 0, 0), # bottom, left, top, and right oma = c(0, 0, 0, 0))} ) knitr::opts_chunk$set( # collapse = TRUE, comment = "#>", fig.align = 'center', fig.width = 9, fig.height = 5, fig.show = 'hold', out.extra = 'style="max-width:100%;"', # tidy = TRUE, # prompt=T, # comment=NA, cache = F # background = "red" ) library(magrittr) library(dplyr) ``` The `MDP2` package in R is a package for solving Markov decision processes (MDPs) with discrete time-steps, states and actions. Both traditional MDPs [@Puterman94], semi-Markov decision processes (semi-MDPs) [@Tijms03] and hierarchical-MDPs (HMDPs) [@Kristensen00] can be solved under a finite and infinite time-horizon. The package implement well-known algorithms such as policy iteration and value iteration under different criteria e.g. average reward per time unit and expected total discounted reward. The model is stored using an underlying data structure based on the *state-expanded directed hypergraph* of the MDP (@Relund06) implemented in `C++` for fast running times. Building and solving an MDP is done in two steps. First, the MDP is built and saved in a set of binary files. Next, you load the MDP into memory from the binary files and apply various algorithms to the model. For building the MDP models see `vignette("building")`. In this vignette we focus on the second step, i.e. finding the optimal policy. Here we consider a finite-horizon semi-MDP. ```{r} library(MDP2) ``` ## A finite-horizon semi-MDP A *finite-horizon semi-MDP* considers a sequential decision problem over $N$ *stages*. Let $I_{n}$ denote the finite set of system states at stage $n$. When *state* $i \in I_{n}$ is observed, an *action* $a$ from the finite set of allowable actions $A_n(i)$ must be chosen, and this decision generates *reward* $r_{n}(i,a)$. Moreover, let $\tau_n(i,a)$ denote the *stage length* of action $a$, i.e. the expected time until the next decision epoch (stage $n+1$) given action $a$ and state $i$. Finally, let $p_{ij}(a,n)$ denote the *transition probability* of obtaining state $j\in I_{n+1}$ at stage $n+1$ given that action $a$ is chosen in state $i$ at stage $n$. ## Example Consider a small machine repair problem used as an example in @Relund06 where the machine is always replaced after 4 years. The state of the machine may be: good, average, and not working. Given the machine's state we may maintain the machine. In this case the machine's state will be good at the next decision epoch. Otherwise, the machine's state will not be better at next decision epoch. When the machine is bought it may be either in state good or average. Moreover, if the machine is not working it must be replaced. The problem of when to replace the machine can be modeled using a Markov decision process with $N=5$ decision epochs. We use system states `good`, `average`, `not working` and dummy state `replaced` together with actions buy (`buy`), maintain (`mt`), no maintenance (`nmt`), and replace (`rep`). The set of states at stage zero $S_{0}$ contains a single dummy state `dummy` representing the machine before knowing its initial state. The only possible action is `buy`. The cost of buying the machine is 100 with transition probability of 0.7 to state `good` and 0.3 to state `average`. The reward (scrap value) of replacing a machine is 30, 10, and 5 in state `good`, `average` and `not working`, respectively. The reward of the machine given action `mt` are 55, 40, and 30 in state `good`, `average` and `not working`, respectively. Moreover, the system enters state 0 with probability 1 at the next stage. Finally, the reward, transition states and probabilities given action $a=$`nmt` are given by: $n:s$ $1:$ `good` $1:$ `average` $2:$ `good` $2:$ `average` $3:$ `good` $3:$ `average` ------------- --------------- ------------------ ------------------ ------------------- ------------------ ------------------ $r_n(i,a)$ 70 50 70 50 70 50 $j$ $\{0,1\}$ $\{1,2\}$ $\{0,1\}$ $\{1,2\}$ $\{0,1\}$ $\{1,2\}$ $p_{ij}(a,n)$ $\{0.6,0.4\}$ $\{0.6,0.4\}$ $\{0.5,0.5\}$ $\{0.5,0.5\}$ $\{0.2,0.8\}$ $\{0.2,0.8\}$ Let us try to load the model and get some info: ```{r, par=TRUE} prefix <- paste0(system.file("models", package = "MDP2"), "/machine1_") mdp <- loadMDP(prefix) getInfo(mdp, withList = F, dfLevel = "action", asStringsActions = TRUE) ``` The state-expanded hypergraph representing the semi-MDP with finite time-horizon can be plotted using ```{r, par=TRUE} plot(mdp, actionColor = "label", radx = 0.06, marX = 0.065, marY = 0.055) ``` Each node corresponds to a specific state and a directed hyperarc is defined for each possible action. For instance, action `mt` (maintain) corresponds to a deterministic transition to state `good` and action `nmt` (not maintain) corresponds to a transition to a condition/state not better than the current condition/state. We buy the machine in stage 1 and may choose to replace the machine. Let us use value iteration to find the optimal policy maximizing the expected total reward: ```{r solve3} scrapValues <- c(30, 10, 5, 0) # scrap values (the values of the 4 states at the last stage) runValueIte(mdp, "Net reward", termValues = scrapValues) ``` The optimal policy is: ```{r, par=TRUE} pol <- getPolicy(mdp) tail(pol) plot(mdp, actionsVisible = "policy", stateLabel = "weight", radx = 0.06, marX = 0.065, marY = 0.055) ``` Note given the optimal policy the total expected reward is `r pol %>% filter(stateStr == "0,0") %>% pull(weight)` and the machine will never make a transition to states `not working` and `replaced`. We may evaluate a certain policy, e.g. the policy always to maintain the machine: ```{r Set policy (machine rep),echo=TRUE,eval=TRUE} policy<-data.frame(sId=c(8,11), aIdx=c(0,0)) # set the policy for sId 8 and 11 to mt setPolicy(mdp, policy) getPolicy(mdp) ``` If the policy specified in `setPolicy` does not contain all states then the actions from the previous optimal policy are used. In the output above we can see that the policy now is to maintain always. However, the reward of the policy has not been updated. Let us calculate the expected reward: ```{r Calc reward (machine rep),echo=TRUE} runCalcWeights(mdp, "Net reward", termValues = scrapValues) tail(getPolicy(mdp)) ``` That is, the expected reward is 90.5 compared to 102.2 which was the reward of the optimal policy. ## References