FiniteHorizon MDP (Markov Decision Process)  Python Automation and Machine Learning for ICs   An Online Book  

Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/  


Chapter/Index: Introduction  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  Appendix  
================================================================================= Finitehorizon MDP uses a fixed time horizon for decisionmaking. In a finitehorizon MDP, the agent makes decisions over a finite number of time steps or stages. The decision process is divided into discrete time periods, and the agent's goal is often to maximize the cumulative sum of rewards over this finite time horizon. The formulation of a finitehorizon MDP includes an additional parameter, T, representing the time horizon. The agent takes actions in each time step, and the environment responds with transitions and rewards. The objective is to find a policy that maximizes the sum of rewards over the specified time horizon. The formulation of a finitehorizon MDP can be expressed as (S, A, P, R, T), where T is the time horizon. Solving finitehorizon MDPs often involves dynamic programming methods, such as backward induction or forward dynamic programming, to compute the optimal policy and corresponding value function over the finite time steps. The optimal policy dictates the best action to take at each time step to maximize the expected cumulative reward by the end of the time horizon. The expected sum of rewards over time under a certain policy can be given by,  [3665a] where, E represents the expectation operator, indicating that the expression is referring to the expected value. R(S_{t}, a_{t}) represents the immediate reward associated with taking action a_{t} in state S_{t}. S_{t} and a_{t}represent the state and action at time step t, respectively. The sum in Equation 3665a denotes the total sum of rewards over the finite time horizon T. In the MDP, the objective is often to find a policy that maximizes the expected cumulative reward over a specified time horizon. Then, the corresponding policy is π_{t}* (nonstationary policy), which depends on time instead of a stationary policy. ============================================


=================================================================================  

