Optimal Value Function in Markov Decision Process (MDP)  Python Automation and Machine Learning for ICs   An Online Book  

Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/  


Chapter/Index: Introduction  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  Appendix  
================================================================================= The optimal value function in a Markov Decision Process (MDP) is a function that assigns a value to each state in a way that reflects the expected cumulative reward of following the optimal policy from that state onward. The optimal value function is often denoted as V^{∗}(s), where s is a state. For a given state s, the optimal value function V^{∗}(s) is defined as the maximum expected cumulative reward achievable by following the best possible policy from that state. Mathematically, it can be given by,  [3663a] where, π is a policy. E_{π} denotes the expected value under policy π. γ is the discount factor. S_{t} and A_{t} are the state and action at time t. R(S_{t}, A_{t}) is the immediate reward at time t. The optimal value function satisfies the Bellman optimality equation, which is given by,  [3663b] where, a is an action. P(s'∣s, a) is the transition probability to the next state. R(s, a, s′) is the immediate reward. V^{*}(s′) is the optimal value of the next state. In a nonstationary setting, where transition probabilities can change over time, the optimal value function needs to account for this variability. The equations become more dynamic to reflect the timedependent nature of the system: The nonstationary Bellman expectation equation expresses the relationship between the optimal value function at a given state and the expected value of future rewards, considering the timevarying transition probabilities,  [3663c] where, t represents the current time step. P_{t}(s′∣s, a) is the transition probability from state s to state s′ at time t given action a. R_{t}(s, a, s′) is the immediate reward at time t. The nonstationary Bellman optimality equation provides a recursive relationship for the optimal value function,  [3663d] This equation is used iteratively, and the optimal value function is updated over time. The expected cumulative reward at time t in a nonstationary Markov Decision Process (MDP) is given by,  [3663e]  [3663f] where, E_{π}^{∗} denotes the expected value under the optimal policy. T is the time horizon. R(S_{k}, A_{k}) is the immediate reward at time step k resulting from taking action A_{k} in state S_{k}. The remarkable property of the LQR is that, under the assumptions above, the optimal value function in the Markov Decision Processes (MDPs) is quadratic. This property makes it computationally efficient to compute the optimal value function, which is often denoted as V^{*}. ============================================


=================================================================================  

