Reinforcement Learning - Python for Integrated Circuits - - An Online Book - |
||||||||||||||||||
Python for Integrated Circuits http://www.globalsino.com/ICs/ | ||||||||||||||||||
Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix | ||||||||||||||||||
=================================================================================
Reinforcement learning (RL) is a type of machine learning paradigm where an agent learns to make decisions by interacting with an environment. The goal of the agent is to learn a policy or strategy that maximizes a cumulative reward signal over time. The learning process in reinforcement learning involves the agent taking actions in the environment, receiving feedback in the form of rewards, and adjusting its strategy or policy to improve its performance over time. The agent explores different actions and learns which actions lead to favorable outcomes and higher rewards. There are different approaches to solving reinforcement learning problems, including value-based methods, policy-based methods, and a combination of both. Common algorithms in reinforcement learning include Q-learning, Deep Q Network (DQN), policy gradient methods, and actor-critic methods. Reinforcement learning often uses the framework of Markov Decision Processes (MDPs) to formalize and solve decision-making problems. An MDP is a mathematical model that represents a decision process in a stochastic (probabilistic) environment. It is a key framework for studying and solving problems where an agent makes sequential decisions to maximize cumulative rewards. The agent's objective in an MDP is to find an optimal policy (π*) that maximizes the expected cumulative reward over time. Reinforcement learning algorithms operate within the MDP framework to learn such optimal policies through interaction with the environment. These algorithms use various methods, such as Q-learning, policy gradient methods, and actor-critic architectures, to update the agent's policy based on observed rewards and transitions. In RL, we need to ask a question, of whether to consider optimizing the optimal policy (π*) or the optimal value function (V*), which arises from the need to design effective and efficient reinforcement learning solution due to the inherent trade-offs and challenges in reinforcement learning and policy optimization, including complexity and representational choices (model complexity and expressiveness), computational efficiency, sample complexity and efficiency, task characteristics (problem structure and exploration-exploitation), and algorithmic considerations (e.g. algorithm suitability). In reinforcement learning, there are negative, zero, and positive rewards. The use of negative, zero, and positive rewards allows the agent to learn from the consequences of its actions in an environment. The rewards serve as a signal to guide the agent toward behaviors that lead to desirable outcomes and away from behaviors that lead to undesirable outcomes. Some reasons why negative or zero rewards are commonly used are:
Table 4321 lists the relationship between Reinforcement Learning and Markov Decision Processes. Table 4321. Relationship between reinforcement learning and Markov decision processes.
The choice of reward function for playing chess, or any other reinforcement learning problem, depends on the specific objectives and requirements of the task. Note that the credit assignment problem is a significant challenge in reinforcement learning. Some reinforcement learning problems are: 1) Flying a helicopter using ML can be formulated as a reinforcement learning problem (see page3703). 2) Playing chess can involve reinforcement learning, but it is not the only approach used for chess-playing programs. Chess engines, or programs that play chess, typically employ a combination of different techniques, and reinforcement learning is just one possible component. Reward function of for a win, for a loss, and for a tie is a common and reasonable choice for many chess-playing algorithms.The value function associated with a policy in the reinforcement learning is defined by,------------------------ [4321a] where,
Bellman's equation is a fundamental concept in reinforcement learning and dynamic programming. It provides a recursive relationship for the value function, connecting the value of a state to the values of its neighboring states. There are two primary forms of Bellman's equation, which are named after Richard E. Bellman, who made significant contributions to the development of dynamic programming and reinforcement learning: i) Bellman Expectation Equation -------------- [4321b] where,
-------------- [4321bf] where, represents the action taken by the agent at time . represents the state of the environment at time . represents the next state of the environment at time . represents the immediate reward received at time . -------------- [4321bg] where, is the probability of taking action when the environment is in state according to policy . Equation 4321b represents a linear system of equations in terms of the state values .Assuming a finite set of states . For each state , we can then set up a linear equation using the Bellman Expectation Equation. This results in a system of linear equations, where each equation corresponds to a different state. The unknowns in these equations are the values . For instance, if we have three states s1, s2, s3, the system of linear equations would look like,-------------- [4321bgl] -------------- [4321bgm] -------------- [4321bgn] We can represent these equations in matrix form , where is the matrix of coefficients, is the vector of unknowns ( ), and is the vector on the right-hand side. Solving this system of linear equations will give we the values of for each statesi.By considering a specific initial state and assuming that the probability of transitioning to the next state s' is determined solely by the policy he Bellman Expectation Equation 4321b can be simplified by,-------------- [4321bgo] In Equation 4321b, if the policy is deterministic, meaning that for each state , there is only one action with a probability of 1, the inner sum over actions in the Bellman equation collapses, simplifying the expression. Equation 4321bgo also assumes a specific initial state s0. If the system has a single initial state, then is fixed, and the transition probabilities represent the probability of transitioning from state to under policy . Equation 4321bgo assumes the environment should have stationary dynamics, meaning that the transition probabilities do not change over time, which is a common assumption in many reinforcement learning models. On the other hand, in Equation 4321b only depend on the current state and action , not on the entire history of states and actions, while represents the probability of transitioning to state from the current state under policy .An example of steps, which will be taken in reinforcement learning, is shown in the case of Grid World Navigation: The action taken at state is determined by the policy as shown in Equation 4321bg. In reinforcement learning, a policy defines the agent's strategy or set of rules for making decisions in the environment. The policy specifies the probability distribution over actions given the current state. The actual action taken at state is then sampled from this probability distribution. If, for example, , , and , the agent might take the "up" action with a 60% probability, the "down" action with a 30% probability, and the "left" action with a 10% probability. The choice of action is stochastic and influenced by the policy. This equation expresses the expected value of a state under a policy . It states that the value of a state is the sum of the expected immediate reward and the discounted expected value of the next state, considering all possible actions and next states according to the policy. The transition probabilities capture the likelihood of transitioning from state to state with reward under action .The expected immediate reward refers to the average or expected value of the immediate reward that an agent receives upon taking a specific action in a particular state. Mathematically, it can be expressed using the expected value operator ( ). The expected immediate reward for taking action in state is denoted as , and it is computed by summing over all possible immediate rewards with their corresponding probabilities:-------------- [4321bh] where:
ii) Bellman Optimality Equation -------------- [4321c] This equation defines the optimal value function , representing the maximum expected cumulative reward achievable from state following an optimal policy. It states that the optimal value of a state is the maximum over all possible actions of the sum of the expected immediate reward and the discounted expected value of the next state.Equation 4321c can be approximately written to the equation below under certain conditions, -------------------------------------------------- [4321d] The approximation conditions are: i) Optimal Policy Existence:
ii) Deterministic Policy:
iii) Greedy Policy:
Under these conditions, the optimal value function represents the maximum expected cumulative reward achievable from state , and the action that achieves this maximum is the one prescribed by the optimal policy . Therefore, the maximization operation over policies ( ) in the Bellman Optimality Equation can be approximated by considering the greedy policy with respect to the current optimal value function ( ).If the immediate reward is included, we then can consider the case where the optimal policy is deterministic and chooses the action that maximizes the expected cumulative reward. Under this condition, the equation becomes,------------------------------------ [4321e] ------------------------------------ [4321el] ------------------------------------ [4321el] Here, represents the probability of transitioning to state from state under action in the optimal policy π*. The term on the right hand side in Equation 4321el is the expected maximum value of the next state (s') over all possible actions a, where s' is sampled from the transition probability distribution Pps.The term in Equation 4321el can be written by Q(a), ------------------------------------ [4321el] The optimal policy is the policy that maximizes the expected cumulative reward for each state. In the case of a deterministic policy, is an action selected to achieve the maximum expected value. The equation for a deterministic optimal policy is,------------------------------------ [4321f] where,
The optimal policy selects, for each state , the action that maximizes the expected cumulative reward, according to the optimal action-value function .Equations 4321c and 4321e describe fitted value iteration (VI), which gives approximation to V* which implicitly defines π* because of Equation 4321f , and then give us reasonable robots by choosing a set of features and learning the value function to approximate the expected payoff of a robot starting off in different states. In terms of the transition probabilities and the optimal value function , the can be given by,------------------------------------ [4321g] For all states , we have,------------------------------------ [4321h] where, is the optimal value function, representing the maximum expected cumulative reward achievable from state under the optimal policy. is the value function for any policy , representing the expected cumulative reward achievable from state under policy . the value function of the optimal policy π*. It shows the optimal value function is defined to be greater than or equal to the value function under any policy for all states .Finding the optimal policy in reinforcement learning involves exploring and evaluating different policies to determine which one maximizes the expected cumulative reward over time. The strategy to find the optimal policy is: i) Find V*. The first step is to find the optimal state-value function , which represents the maximum expected cumulative reward for each state under the optimal policy.Value iteration is a dynamic programming algorithm that can be used to find the optimal state-value function in a reinforcement learning problem. Value iteration is specifically designed for solving MDPs when the transition probabilities and rewards are known. The basic idea behind value iteration is to iteratively update the value function until it converges to the optimal values. The algorithm involves repeated application of the Bellman Optimality Equation, which relates the value of a state to the values of its neighboring states under the optimal policy. The update rule for value iteration is given by,------------------------------------ [4321i] where:
Here, we initialize V(s) := 0 for every state s. The update is applied for each state in the state space. The algorithm continues iteratively until the change in the value function becomes very small, indicating convergence. At this point, the value function has been approximated, and the optimal policy can be derived from by selecting actions that maximize the expected cumulative reward.Two different updates can be used in this step above: a) Synchronous update.
b) Asynchronous update.
Both synchronous and asynchronous updates are valid approaches, and the choice between them may depend on factors such as computational efficiency, memory requirements, and the specific characteristics of the problem. Synchronous updates are simpler to implement and ensure a consistent update across all states, while asynchronous updates can sometimes speed up convergence by prioritizing certain states. ii) Use argmax Equation to Find . Once is obtained, the optimal policy can be derived by selecting actions that maximize the expected cumulative reward in each state. The argmax operation (e.g. Equations 4321f or 4321g) is applied to the action set in each state with respect to the optimal state-value function.Value iteration is an iterative algorithm used in reinforcement learning to find the optimal state-value function for a MDP. However, due to practical considerations and the nature of the algorithm, it may not reach the exact value:
In a reinforcement learning problem, if we are dealing with an environment where the outcomes of actions are uncertain or not precisely known, then the transition probabilities page3676) are advantageous when we don't have access to the dynamics of the environment. They rely on the observed data (state-action pairs and associated rewards) to iteratively improve their estimates of the optimal policy or value function. cannot be known. In these cases, model-free methods, such as Q-learning and Monte Carlo methods, (seeReinforcement learning (RL) algorithms typically involve a series of steps to learn an optimal policy through interaction with an environment as shown in Figure 4321a.
Figure 4321a. Basic RL model. [2] While the specific details may vary depending on the algorithm, common steps in a typical RL algorithm are below:
st+1 = Ast + Bat ------------------------------ [4321kl] Here, A is ℝnxn matrix, and B is either ℝnx1 matrix or ℝ1xn matrix. Equatio 4321kl ctually works well for helicopters if they are not flying too fast. Equation 4321kl is deterministic, meaning with this model, the next state St+1 is entirely determined by the current state St and the action at . In a stochastic model, the next state is not entirely determined by the current state and action; instead, there is an element of randomness involved. This can be represented using probability distributions. A common way to model a stochastic transition in reinforcement learning is to use a probability distribution over possible next states. For instance, the equation for the probability distribution of the next state St+1 given the current state St and action a with a stochastic model can be given by, -------------------- [4321klk] st+1 = Ast + Bat + εt ------------------------------ [4321klkl] where, εt is used to introduce randomness (e.g. noise) either in the agent's exploration strategy or in the stochasticity of the environment. Setting ϵt = 0 in reinforcement learning corresponds to using a completely greedy strategy for action selection. In ϵ-greedy exploration, ϵt is typically a parameter that determines the probability of choosing a random action (exploration) as opposed to the action with the maximum estimated value (exploitation). When ϵt = 0, it means that the agent is always exploiting its current knowledge and never exploring randomly. This strategy is known as "greedy" because the agent is always choosing the action that is currently believed to be the best according to its learned values. In many applications, when we deploy the simulator, then we can get rid of the noise by setting ϵt to zero. In a simple case, we might have a Gaussian distribution, -------------------- [4321klkm] To find matrices A and B to minimize the discrepancy between the predicted next state Ast +Bat and the actual next state st+1 across multiple samples i and time steps t. -------------------- [4321kll] where, st(i) is the state at time t for the i-th sample. A is a matrix representing the transformation of the current state. Bat(i) is a matrix that depends on the action at(i) taken at time t for the i-th sample. St+1(i) is the actual next state at time t+1 for the i-th sample. The RL above is called model-based RLs, in which we can either have linear model or non-linear model. For non-linear model, we can do linearization in the RL process. On the other hand, there is another RL which is called model free RL. In programming, we can have the loop below: { For i = 1, 2, ..., m For each action a ∈A Sample s'1, s'2, ..., s'k Set, ----------------------------- [4321l] }
These steps are common in various RL algorithms, including model-free methods like Q-learning, policy gradient methods, and model-based methods. Depending on the specific algorithm, there might be additional steps or variations in the sequence of these steps. On the other hand, model-free methods, like Q-learning, often directly estimate the optimal policy and value function, while model-based methods may involve estimating the dynamics of the environment. Note that the optimal value in reinforcement learning is usually found, V* = maxπVπ(s), through an iterative process of taking actions, receiving feedback, and adjusting the agent's strategy. This involves learning from the consequences of actions to improve decision-making over time. So, the optimization step is spread across the entire learning process rather than being confined to a specific step. When the dimension of the state space is high, it is better to approximate V* directly, without using discretization, which involves employing function approximation methods, with neural networks being a popular choice due to their ability to handle high-dimensional input spaces: i) Define the Neural Network Architecture: Choose a neural network architecture that suits the problem at hand. For value function approximation, a common choice is a feedforward neural network. The input layer should match the dimensionality of the state space, and the output layer should have a single output for the estimated value. For instance, in the neural network, linear regression can be applied to evaluate V*, ----------------------------- [4321n] Here, V* = y(i) In this step, we need to choose the feature φ(s) of state s in order to approximate V* using a function V(s). ii) Collect Data: Collect a dataset of state-action pairs along with their associated rewards and next states. This dataset is used to train the neural network to approximate the optimal value function. iii) Loss Function: Define a suitable loss function to train the neural network. For value function approximation, a common choice is the mean squared error (MSE) loss, which measures the squared difference between the predicted values and the true values. iv) Optimization: Use an optimization algorithm, such as stochastic gradient descent (SGD) or more advanced variants like Adam, to minimize the loss function. This involves adjusting the weights of the neural network to improve its ability to predict the optimal values. v) Training: Iterate through the dataset multiple times (epochs) to train the neural network. During each iteration, randomly sample batches from the dataset to update the network weights. vi) Exploration and Exploitation: Depending on the reinforcement learning algorithm being used, incorporate a strategy for exploration and exploitation. Techniques such as epsilon-greedy exploration or more sophisticated exploration strategies can help balance the need to explore new states and exploit current knowledge. Main symbols and notations used in RL are: a: Action, representing the decision or move taken by the agent in a given state. Fitted VI: Fitted Value Iteration (VI). k is often used as an iteration index or a counter to denote the current step or episode in the learning process, might represent the current iteration or time step in iterative algorithms, might represent the current episode number, could represent the current time step or the number of interactions the agent has had with the environment, might represent the index of a specific batch or subset of data used for training, or is used to represent a specific setting or configuration, such as hyperparameter values in a hyperparameter search or cross-validation. P: Transition model, representing the probability distribution of next states given the current state and action. P(s'∣s, a): Transition probability, representing the probability of transitioning to state s' from state s by taking action a. Q(s,a): Action-value function, representing the expected cumulative reward from being in state s, taking action a, and following a particular policy. Q∗(s,a): Optimal action-value function, representing the maximum expected cumulative reward achievable from being in state s, taking action a, and following the optimal policy. r: Immediate reward, the numerical value received by the agent as feedback for taking an action in a specific state. R: Return, the total cumulative reward obtained by the agent over a sequence of time steps. s: State, representing the current situation or configuration of the environment. St, At: State and action at time step t in a trajectory. T: Time step or the length of a trajectorr, or final horizon. V(s): State-value function, representing the expected cumulative reward from being in state s and following a particular policy. V*(s): Optimal state-value function, representing the maximum expected cumulative reward achievable from state s onward following the optimal policy. V^: The approximation to the optimal state-value function. ϵ: Exploration rate, a parameter used in epsilon-greedy strategies to balance exploration and exploitation. γ: Discount factor, a parameter that determines the importance of future rewards. It is a value between 0 and 1. π: Policy, a mapping from states to probabilities of selecting each possible action. πt* : non-stationary policy. π*: Stationary policy. π(a∣s): Probability of taking action a in state s according to policy π. θ are the parameters of the function approximator. τ: Trajectory, a sequence of states and actions experienced by the agent during interaction with the environment. RL algorithms is not efficient and it may require a significant number of iterations (e.g. 10 millions of iteration ) to learn effective policies for a job. The efficiency can also vary widely depending on several factors, including the complexity of the task, the quality of the reward signal, the choice of algorithm, and the characteristics of the environment. The poor efficiency is a disadvantage of RL. Some examples of Reinforcement Learning applications are:
============================================
[1] Daniel Kurian, Fereshteh Sattari, Lianne Lefsrud, and Yongsheng Ma, Using machine learning and keyword analysis to analyze incidents and reduce risk in oil sands operations, Safety Science, 130(2020), 104873. [2] https://doi.org/10.1371/journal.pone.0217408.g002.
|
||||||||||||||||||
================================================================================= | ||||||||||||||||||
|
||||||||||||||||||