Policy Search Algorithms  Python Automation and Machine Learning for ICs   An Online Book  

Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/  


Chapter/Index: Introduction  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  Appendix  
================================================================================= Policy search algorithms are a class of reinforcement learning algorithms used in machine learning. In reinforcement learning, an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or punishments based on the actions it takes, and its goal is to learn a policy—a mapping from states to actions—that maximizes the cumulative reward over time. Policy search algorithms specifically focus on directly optimizing the policy without necessarily modeling the underlying dynamics of the environment. Instead of learning a value function that estimates the expected cumulative reward for each state or stateaction pair, policy search methods aim to find an optimal policy directly. There are various approaches to policy search, and they can be broadly categorized into two types: i) Stochastic Policy Search: In stochastic policy search, the policy is modeled as a probability distribution over actions given a state. The parameters of the policy (such as weights in a neural network) are iteratively updated to maximize the expected cumulative reward. Examples include REINFORCE, TRPO (Trust Region Policy Optimization), and PPO (Proximal Policy Optimization). ii) Deterministic Policy Search: In deterministic policy search, the policy directly maps states to specific actions without incorporating randomness. Optimization techniques such as gradient ascent are used to find the optimal deterministic policy. Examples include DDPG (Deep Deterministic Policy Gradients) and SAC (Soft ActorCritic). Policy search methods are particularly useful in problems where the dynamics of the environment are complex, making it challenging to model the transition probabilities accurately. They are often applied in robotics, game playing, and other realworld scenarios where the system's dynamics are not wellknown or are difficult to model explicitly. However, policy search methods may require more data and computational resources compared to some modelbased approaches. A stochastic policy parameterized by θ in the policy search algorithms, particularly in the realm of RL and ML can give by,  [3659a] where, s represents the state of the environment. θ represents the parameters of the policy. a represents the action selected by the policy. π_{θ}(s) is the probability distribution over actions given the state s and parameterized by θ. The function on the right hand side of the equation is the sigmoid function, which maps the dot product of θ and s through a logistic function. This is often used when the action space is binary (e.g., two possible actions). In Equation 3659a, the policy is stochastic, meaning it outputs a probability distribution over actions rather than a deterministic action. The sigmoid function ensures that the probabilities are between 0 and 1, and it is commonly used when the action space is binary (e.g., left or right, 0 or 1). In RL and policy search algorithms, a stochastic policy is often represented as a function π: S×A → R, where π(s, a) represents the probability of taking action a in state s. Here, S is the set of all possible states in the environment. A is the set of all possible actions that the agent can take. π(s, a) is the probability of taking action a in state s. And, we have,  [3659b] The objective is to find the policy parameters θ that maximize the expected sum of rewards that the agent receives over time when following the policy π_{θ:}  [3659bla]  [3659blb]  [3659blc]  [3659bld] where,  [3659bt] What we then need to do is to use gradient descent to maximize the term in Equation 3659bld. To do this, we will compute the gradient of this expression with respect to the parameters θ and then update θ in the direction of the gradient. This expression represents the expected sum of rewards under a policy π_{θ} and a given state transition model P. We can rewrite the expression inside the summation as J(θ):  [3659bu] As an example of the applications of stochastic policy used in policy search algorithms, suppose we have a discrete action space with two possible actions: "left" and "right." The state space is continuous and the steps of the application can be: i) In the stochastic policy representation, we can represent our stochastic policy using the sigmoid function given by,  [3659ca]  [3659cb] where, θ are the parameters of the policy. s is the state. ii) In the policy search algorithm, we can consider a simple policy gradient algorithm, where we update the policy parameters θ based on the gradient of the expected reward. For instance, in the case described by Equation 3659bu , the policy gradient ∇_{θ}J(θ) can be calculated using the likelihood ratio trick and the chain rule. The gradient ascent update rule is then:  [3659d] where, J(θ) represents the expected cumulative reward. α is the learning rate. iii) During policy evaluation, the agent interacts with the environment using the current policy to collect trajectories (sequences of states and actions). For each stateaction pair in a trajectory, we compute the log probability of the selected action according to the policy,  [3659e] iv)The policy parameters are then updated based on the gradient of the expected reward,  [3659f] where, R_{t} is the reward at time step t, and the sum is taken over the entire trajectory. For the case in Equation 3659bu, we can compute the gradient of J(θ) with respect to the policy parameters θ, and then use the likelihood ratio trick to express the gradient in terms of the log probability of the actions taken under the policy.  [3659g] Then, we can do: a) Generate trajectories (s_{0}, a_{0}, s_{1}, a_{1}, …, s_{T}, a_{T}) by interacting with the environment under the policy π_{θ}. b) Compute the Gradient Estimate: Approximate the expectation in the gradient using the sampled trajectories. c) Update Policy Parameters: Update the policy parameters θ using the gradient ascent update rule in Formula 3659d. v) Repeat: The process is repeated, and the policy parameters are updated iteratively based on collected trajectories until the policy converges to an optimal or nearoptimal solution. In programming, what we will do is: Loop { Sample (s_{0}, a_{0}, s_{1}, a_{1}, …, s_{T}, a_{T}) Compute Payoff with Equation 3659bt Perform an update using formula 3659d } ============================================


=================================================================================  

