Reinforcement Learning for Clinical Decision Support in Critical Care: Comprehensive Review

Background Decision support systems based on reinforcement learning (RL) have been implemented to facilitate the delivery of personalized care. This paper aimed to provide a comprehensive review of RL applications in the critical care setting. Objective This review aimed to survey the literature on RL applications for clinical decision support in critical care and to provide insight into the challenges of applying various RL models. Methods We performed an extensive search of the following databases: PubMed, Google Scholar, Institute of Electrical and Electronics Engineers (IEEE), ScienceDirect, Web of Science, Medical Literature Analysis and Retrieval System Online (MEDLINE), and Excerpta Medica Database (EMBASE). Studies published over the past 10 years (2010-2019) that have applied RL for critical care were included. Results We included 21 papers and found that RL has been used to optimize the choice of medications, drug dosing, and timing of interventions and to target personalized laboratory values. We further compared and contrasted the design of the RL models and the evaluation metrics for each application. Conclusions RL has great potential for enhancing decision making in critical care. Challenges regarding RL system design, evaluation metrics, and model choice exist. More importantly, further work is required to validate RL in authentic clinical environments.


An Introduction to Reinforcement Learning
Reinforcement learning (RL) is one of the three main types of machine learning, where we may be more familiar with the other two types: supervised learning and unsupervised learning. Regressions and classifications are common supervised learning tasks and has been widely applied in the healthcare domain. Supervised Learning is task-driven, because there are always data labels to oversee or "supervise" the learning process. In contrast, unsupervised learning is data-driven where the learning process totally depends on the inter-or intra-relations among data clusters. Unlike supervised/unsupervised learning that learns the patterns or relations from data directly, RL tries to understand the data by interacting with the environment where the data came from. It is a goal-oriented learning algorithm wherein an agent or a decision maker perform a task by taking some actions in the environment and get evaluative feedbacks on the actions in each step, allowing it to improve the performance of subsequent actions.
RL has been applied in various types of applications, such as robots movement, Atari game, autonomous vehicle, recommendation systems, financial investment etc. All the applications share the common goal: to select the best actions that maximise total reward for all the steps when interacting with an environment.
To achieve this goal, we need to take note of the unique characteristics in RL: 1. Actions may have long term consequences 2. Reward may be delayed 3. It may be better to sacrifice immediate reward to gain more long-term reward With a chess game as an example, the very early moves for blocking the opponent might help winning the game in the end.
All the RL applications have the same sets of "building blocks", including Mathematically, RL can be described by a Markov Decision Process (MDP) as shown in the following figure.
In a setting of a grid-world game (environment), the agent will need to walk from the start position to the goal position through the white grids. We design the rules as follows: • Reward: -1 per time-step (the agent need to take the shortest path to the goal position to avoid negative reward) • Actions: North, East, South, West in each step • States: Agent's location in the grid-world The agent would be able to conquer the game in either of the two ways: 1) The agent knows which direction to take in each of the white grid. In another word, the agent have a policy function, ( ) that tells the direction to move (actions) for all the states (shown as the red arrows).
2) The agent knows the value function, ( ) in each white grid from which the agent could compare the values for all adjacent states and take the action that leads to a state with a larger value in the subsequent step. The value function evaluate the goodness of states, and it could be written in another form with the action term: ( ) = ∈ [ ( , )]. The value function for a state s, is the average value for all the possible state-action pairs in that state. The value of the state-action pair ( , ) is called as Qfunction.
To summarise, an agent could solve an RL problem by either learning a policy function or a value/Q-function. The RL algorithms that learn the policy function is called policy-based RL, whereas those learn the value/Q-function is called value-based RL.
In the next section, we would introduction some common types of policy-based RL, valuebased RL and some other extensions.

Policy-based Reinforcement Learning Policy Gradient
The policy gradient [1] methods target at modelling and optimizing the policy directly. The policy ( | ) is usually modelled by a neural network with parameter . When an agent act according to a policy from a random initial state 1 , the policy will tell the agent to choose action 1 under the state 1 . The environment would respond to this action with a reward 1 , and would lead the agent to enter the next state 2 . The state-action sequence would continue to roll out to form a trajectory , with reward at each step by following the policy . The objective of the agent is to continuously update the policy , so that the average accumulated reward ̅̅̅̅ from the trajectory could be maximized.
The policy could be updated by adjusting the parameter by taking gradient to the reward ̅̅̅̅ in the neural network. Therefore, this type of RL is named policy gradient.

Value-based Reinforcement Learning Q-learning
The main idea for Q-learning [2] is to construct a reference map of values and state-action pairs, so that given a random initial state 1 , an agent would refer to the reference map to seek action 1 which maximize the value for state 1 , ( 1 , 1 ). Q-function could be re-write in terms of the reward: ( , ) = ( , ) + max +1 ( +1 , +1 ), where the value of the current stateaction pair ( , ) is defined as the reward ( , ) received from the current action plus the estimate from the highest Q-value of the next possible state-action pair with a discount factor γ max +1 ( +1 , +1 ). The discount factor γ has the value between 0 to 1 and it was used to represent the decay effect of future reward w.r.t. time. The reward from near future would have a greater impact than the reward from the future that is further away.
The reference map for state-action values can be construct using a table, named Q-table, where each entry represents the value of one state-action pair. However, as the state/action space grow in size, the number of entries in Q-table would have to grow geometrically to store all the values, which makes Q-learning not feasible for problems with continuous state/action space.

Deep Q Network (DQN)[3]
is a derivative of Q-learning, where the Q-table is replaced by a deep neural network (DNN) parametrized by θ, to represent the value of the state-action pairs. The rule for updating parameter θ is to minimize the mean squared error (loss function) given as Another modification on Double DQN is to separate the Q-value with two steams, value stream and advantage steam, ( , ) = ( ) − ( , ). The network is called Double DQN with Dueling [5]. The key motivation behind this architecture is that for some cases, it is unnecessary to know the value of each action at every timestep. By explicitly separating two estimators, the dueling architecture can learn which states are (or are not) valuable, without having to learn the effect of each action for each state.

Policy Iteration and Value Iteration (Fitted Q iteration)
Policy Iteration [6] update a policy in 3 steps. The first step is the initialization of a random policy . The second component is policy evaluation. By policy evaluation, we mean that following this policy, what should be the value of any state. As mentioned above, given a policy π, the value of a state is the expected reward when the agent starts from s and follows π after that. ( ) = ∑ ( +1 , | ) +1 , is the probability of entering state +1 with reward r by following the policy from state . The third component is policy improvement. Policy will be updated to ' if the new policy produce a higher value of ( ). We run policy evaluation and improvement iteratively until the policy becomes stable when none of the action maximization step in any state causes a change in the policy.
Value Iteration [7] or Fitted Q Iteration [8] follows exactly the same steps as Policy Iteration. The only difference is to replace the policy with an estimation with a value function, = ( , ).

Actor-Critic Reinforcement Learning
Actor-Critic RL [9] is a combination of policy-based RL and value-based RL. It has two networks which are parameterized with DNN. One is called actor-network and the other one called critic-network. The critic-network is similar as those value-based RL, where the network estimate the value function or the Q-function. The actor-network updates policy as those in the policy-based RL, where it improve the policy in the suggested by the critic-network. Actor-Critic RL has two main advantages over pure policy-based and valued based RL. 1) Convergence is guaranteed even for non-linear approximation of the value function (which is not the case for Q-learning). 2) Actor-Critic RL reduce variance with respect to pure policy search methods.

Model-based Reinforcement Learning
All the above discussed RL algorithms are all model-free RL, in which we assume the transition function ( +1 | , ) is unknown. Therefore, given the current state and action pair, the RL agent won't be able to tell what the real next state is. In fact, the model-free RL does not attempt to learn the transition function explicitly. It bypasses the transition function by sampling from the environment. While in model-based RL [10], the agent aims to learns the transition function from the environment, so that given the current state and an action, an model-based RL algorithm would estimate the probability of all possible next states. With model-based RL, one can generate new samples from an environment easily.
In the model-based RL, we first act in the environment to collect a few trajectories of stateaction pairs. Then we deduce a model with DNN or Monte Carlo Tree Search. With this model, we would be able to generate new trajectories. In the next step, we update the value function or the policy function from the generated trajectories, and use the updated value function/ policy function to go back to the environment to select action. This process repeats over and over again to gradually improve the model so that it fully represents the feedbacks from the real environment.

Inverse Reinforcement Learning
In most RL algorithms, the reward function is hand-crafted without knowing the true reward. This type of reward design is very vulnerable to misspecification. Inverse RL [11] can be the alternative to the hand designed reward function where the Inverse RL learns the reward directly through expert demonstrations. The Inverse RL aims to learn an optimal policies from sub-optimal demonstrations. The goal of Inverse RL is to recover the right reward function.
The general idea behind Inverse RL with sampled trajectories is to iteratively improve a reward function by comparing the value of the approximately optimal expert policy with a set of generated policies. Here are a few key steps in Inverse RL. 1. Estimate the value of our optimal policy for the initial state ̂( 1 ), as well as the value of every generated policy ̂( 1 ), by taking the average cumulative reward of many randomly sampled trajectories. 2. Generate an estimate of the reward function R by solving a linear programming problem. Specifically, set in reward function R(s) = 1 ∅ 1 ( ) + 2 ∅ 2 ( ) + ⋯ + ∅ ( ),to maximize the difference between our optimal policy and each of the other k generated policies. 3. Repeat step 1 and 2 multiple iterations and add the newly generated policy the set of k candidate policies, and repeat the procedures.
Another way to learn the reward function is through DNN, where the input to the DNN are the state-action pairs (trajectories) produced by a sub-optimal policy , and the output of the DNN is a reward ( , ), where is the parameters of the DNN that we would learn through backpropagation.