Optimal Value Function in Markov Decision Process (MDP)
- Python Automation and Machine Learning for ICs -
- An Online Book -
Python Automation and Machine Learning for ICs                                                           http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

The optimal value function in a Markov Decision Process (MDP) is a function that assigns a value to each state in a way that reflects the expected cumulative reward of following the optimal policy from that state onward. The optimal value function is often denoted as V(s), where s is a state.

For a given state s, the optimal value function V(s) is defined as the maximum expected cumulative reward achievable by following the best possible policy from that state. Mathematically, it can be given by,

------------------------- [3663a]

where,

π is a policy.

Eπ denotes the expected value under policy π.

γ is the discount factor.

St and At are the state and action at time t.

R(St, At) is the immediate reward at time t.

The optimal value function satisfies the Bellman optimality equation, which is given by,

------------------------- [3663b]

where,

a is an action.

P(s'∣s, a) is the transition probability to the next state.

R(s, a, s′) is the immediate reward.

V*(s′) is the optimal value of the next state.

In a non-stationary setting, where transition probabilities can change over time, the optimal value function needs to account for this variability. The equations become more dynamic to reflect the time-dependent nature of the system:

The non-stationary Bellman expectation equation expresses the relationship between the optimal value function at a given state and the expected value of future rewards, considering the time-varying transition probabilities,

------------------------- [3663c]

where,

t represents the current time step.

Pt(s′∣s, a) is the transition probability from state s to state s′ at time t given action a.

Rt(s, a, s′) is the immediate reward at time t.

The non-stationary Bellman optimality equation provides a recursive relationship for the optimal value function,

------------------------- [3663d]

This equation is used iteratively, and the optimal value function is updated over time.

The expected cumulative reward at time t in a non-stationary Markov Decision Process (MDP) is given by,

------------------------- [3663e]

----- [3663f]

where,

Eπ denotes the expected value under the optimal policy.

T is the time horizon.

R(Sk, Ak) is the immediate reward at time step k resulting from taking action Ak in state Sk

The remarkable property of the LQR is that, under the assumptions above, the optimal value function in the Markov Decision Processes (MDPs) is quadratic. This property makes it computationally efficient to compute the optimal value function, which is often denoted as V*.

============================================

=================================================================================