Reinforcement Learning

Reinforcement Learning
- Python for Integrated Circuits -
- An Online Book -

Python for Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Reinforcement learning (RL) is a type of machine learning paradigm where an agent learns to make decisions by interacting with an environment. The goal of the agent is to learn a policy or strategy that maximizes a cumulative reward signal over time. The learning process in reinforcement learning involves the agent taking actions in the environment, receiving feedback in the form of rewards, and adjusting its strategy or policy to improve its performance over time. The agent explores different actions and learns which actions lead to favorable outcomes and higher rewards. There are different approaches to solving reinforcement learning problems, including value-based methods, policy-based methods, and a combination of both. Common algorithms in reinforcement learning include Q-learning, Deep Q Network (DQN), policy gradient methods, and actor-critic methods.

Reinforcement learning often uses the framework of Markov Decision Processes (MDPs) to formalize and solve decision-making problems. An MDP is a mathematical model that represents a decision process in a stochastic (probabilistic) environment. It is a key framework for studying and solving problems where an agent makes sequential decisions to maximize cumulative rewards.

The agent's objective in an MDP is to find an optimal policy (π*) that maximizes the expected cumulative reward over time. Reinforcement learning algorithms operate within the MDP framework to learn such optimal policies through interaction with the environment. These algorithms use various methods, such as Q-learning, policy gradient methods, and actor-critic architectures, to update the agent's policy based on observed rewards and transitions.

In RL, we need to ask a question, of whether to consider optimizing the optimal policy (π^*) or the optimal value function (V^*), which arises from the need to design effective and efficient reinforcement learning solution due to the inherent trade-offs and challenges in reinforcement learning and policy optimization, including complexity and representational choices (model complexity and expressiveness), computational efficiency, sample complexity and efficiency, task characteristics (problem structure and exploration-exploitation), and algorithmic considerations (e.g. algorithm suitability).

In reinforcement learning, there are negative, zero, and positive rewards. The use of negative, zero, and positive rewards allows the agent to learn from the consequences of its actions in an environment. The rewards serve as a signal to guide the agent toward behaviors that lead to desirable outcomes and away from behaviors that lead to undesirable outcomes. Some reasons why negative or zero rewards are commonly used are:

Guiding Exploration:
- Negative rewards can discourage the agent from taking actions that lead to undesirable states or outcomes. This helps in guiding the agent to explore and discover more optimal strategies.
Punishment and Cost Modeling:
- Negative rewards are often used to model punishments or costs associated with certain actions. For example, an autonomous vehicle might receive a negative reward for violating traffic rules or getting involved in accidents.
Sparse Rewards:
- In some environments, positive rewards may be sparse, meaning that the agent receives feedback only infrequently. Negative rewards can provide more frequent guidance to the agent, making learning more efficient.
Balancing Exploration and Exploitation:
- Zero rewards can be used to encourage the agent to explore different actions without immediate positive or negative consequences. This helps in balancing exploration and exploitation during the learning process.
Expressiveness in Modeling:
- The use of both positive and negative rewards allows for expressive modeling of a wide range of tasks and scenarios. It enables the representation of different aspects of the environment, such as avoiding obstacles, achieving goals, or following constraints.
Handling Uncertainty:
- Negative rewards can help the agent learn to avoid uncertain or risky actions. This is particularly important in situations where the consequences of an action may not be immediately apparent.
Constraint Enforcement:
- Negative rewards can be used to enforce constraints. For example, if there are specific limitations or rules in an environment, violating them could result in negative rewards.

Table 4321 lists the relationship between Reinforcement Learning and Markov Decision Processes.

Table 4321. Relationship between reinforcement learning and Markov decision processes.

	Reinforcement Learning	MDPs
Agent	The agent is a key concept in reinforcement learning, representing the decision-maker or learner.	The agent is the entity that interacts with the environment by taking actions.
Environment	The environment is where the agent operates and learns by receiving feedback.	The environment is a crucial element in the MDP framework, representing the external system with which the agent interacts.
State (s)	States are representations of the current situation or configuration of the environment, helping the agent make decisions based on its current knowledge.	States are a fundamental concept in MDPs, representing the different situations or configurations the environment can be in.
Action	Actions are the decisions or moves that the agent can take in the environment.	Actions are part of the MDP framework, representing the set of possible moves or decisions that an agent can make.
Reward (R(s))	Rewards are numerical values provided by the environment as feedback to the agent after each action, guiding the learning process.	Rewards (or sometimes transition probabilities) are associated with state-action pairs in MDPs and play a critical role in determining the agent's optimal policy.

The choice of reward function for playing chess, or any other reinforcement learning problem, depends on the specific objectives and requirements of the task. Note that the credit assignment problem is a significant challenge in reinforcement learning.

Some reinforcement learning problems are:

1) Flying a helicopter using ML can be formulated as a reinforcement learning problem (see page3703).

2) Playing chess can involve reinforcement learning, but it is not the only approach used for chess-playing programs. Chess engines, or programs that play chess, typically employ a combination of different techniques, and reinforcement learning is just one possible component. Reward function of for a win, for a loss, and for a tie is a common and reasonable choice for many chess-playing algorithms.

The value function associated with a policy in the reinforcement learning is defined by,

value function Vπ associated with a policy π ------------------------ [4321a]

where,

is the discount factor which is a parameter that influences the agent's preference for present rewards over future rewards. It takes values between 0 and 1, and its role is to discount the value of future rewards when calculating the expected cumulative reward.
represents the value of state under policy . It denotes the expected total payoff (cumulative reward) for starting in state , following policy , and making decisions accordingly.
is the expected sum of discounted rewards under policy given that the initial state is . The sum extends over the agent's interactions with the environment, and is the discount factor, which influences the importance of future rewards.

Bellman's equation is a fundamental concept in reinforcement learning and dynamic programming. It provides a recursive relationship for the value function, connecting the value of a state to the values of its neighboring states. There are two primary forms of Bellman's equation, which are named after Richard E. Bellman, who made significant contributions to the development of dynamic programming and reinforcement learning:

i) Bellman Expectation Equation

Bellman Expectation Equation -------------- [4321b]

where,

s' represents the next state in a transition from the current state .
represents the joint probability distribution over the next state s' and the immediate reward given that the agent is in state and takes action . In other words, it describes the likelihood of transitioning to state s' and receiving immediate reward when the current state is and the action taken is :

Bellman Expectation Equation -------------- [4321bf]

where,

represents the action taken by the agent at time .

represents the state of the environment at time .

represents the next state of the environment at time .

represents the immediate reward received at time .

π(a|s) represents the probability of taking action given that the agent is in state according to a fixed policy . In other words, it is the probability distribution over actions conditioned on the current state under the policy:

Bellman Expectation Equation -------------- [4321bg]

where,

is the probability of taking action when the environment is in state according to policy .

Equation 4321b represents a linear system of equations in terms of the state values .

Assuming a finite set of states . For each state , we can then set up a linear equation using the Bellman Expectation Equation. This results in a system of linear equations, where each equation corresponds to a different state. The unknowns in these equations are the values . For instance, if we have three states s₁, s₂, s₃, the system of linear equations would look like,

Bellman Expectation Equation -------------- [4321bgl]

Bellman Expectation Equation -------------- [4321bgm]

Bellman Expectation Equation -------------- [4321bgn]

We can represent these equations in matrix form , where is the matrix of coefficients, is the vector of unknowns (), and is the vector on the right-hand side. Solving this system of linear equations will give we the values of for each states_i.

By considering a specific initial state and assuming that the probability of transitioning to the next state s' is determined solely by the policy he Bellman Expectation Equation 4321b can be simplified by,

Bellman Expectation Equation -------------- [4321bgo]

In Equation 4321b, if the policy is deterministic, meaning that for each state , there is only one action with a probability of 1, the inner sum over actions in the Bellman equation collapses, simplifying the expression. Equation 4321bgo also assumes a specific initial state s₀. If the system has a single initial state, then is fixed, and the transition probabilities represent the probability of transitioning from state to under policy . Equation 4321bgo assumes the environment should have stationary dynamics, meaning that the transition probabilities do not change over time, which is a common assumption in many reinforcement learning models. On the other hand, in Equation 4321b only depend on the current state and action , not on the entire history of states and actions, while represents the probability of transitioning to state from the current state under policy .

An example of steps, which will be taken in reinforcement learning, is shown in the case of Grid World Navigation: The action taken at state is determined by the policy as shown in Equation 4321bg. In reinforcement learning, a policy defines the agent's strategy or set of rules for making decisions in the environment. The policy specifies the probability distribution over actions given the current state. The actual action taken at state is then sampled from this probability distribution. If, for example, , , and , the agent might take the "up" action with a 60% probability, the "down" action with a 30% probability, and the "left" action with a 10% probability. The choice of action is stochastic and influenced by the policy.

This equation expresses the expected value of a state under a policy . It states that the value of a state is the sum of the expected immediate reward and the discounted expected value of the next state, considering all possible actions and next states according to the policy. The transition probabilities capture the likelihood of transitioning from state to state with reward under action .

The expected immediate reward refers to the average or expected value of the immediate reward that an agent receives upon taking a specific action in a particular state. Mathematically, it can be expressed using the expected value operator (). The expected immediate reward for taking action in state is denoted as , and it is computed by summing over all possible immediate rewards with their corresponding probabilities:

Bellman Expectation Equation -------------- [4321bh]

where:

represents the random variable of immediate rewards.
is the probability distribution of receiving an immediate reward when taking action in state .

ii) Bellman Optimality Equation

Bellman Expectation Equation -------------- [4321c]

This equation defines the optimal value function , representing the maximum expected cumulative reward achievable from state following an optimal policy. It states that the optimal value of a state is the maximum over all possible actions of the sum of the expected immediate reward and the discounted expected value of the next state.

Equation 4321c can be approximately written to the equation below under certain conditions,

Bellman Expectation Equation -------------------------------------------------- [4321d]

The approximation conditions are:

i) Optimal Policy Existence:

There exists at least one policy that is optimal, meaning it achieves the maximum expected cumulative reward for every state.

ii) Deterministic Policy:

The optimal policy is deterministic, meaning that for each state , there is a single action that maximizes the expected cumulative reward.

iii) Greedy Policy:

The optimal policy is greedy with respect to the current value function . In other words, in each state , the optimal action is the one that maximizes the immediate reward plus the discounted expected future reward ).

Under these conditions, the optimal value function represents the maximum expected cumulative reward achievable from state , and the action that achieves this maximum is the one prescribed by the optimal policy . Therefore, the maximization operation over policies () in the Bellman Optimality Equation can be approximated by considering the greedy policy with respect to the current optimal value function ().

If the immediate reward is included, we then can consider the case where the optimal policy is deterministic and chooses the action that maximizes the expected cumulative reward. Under this condition, the equation becomes,

Bellman Expectation Equation ------------------------------------ [4321e]

Bellman Expectation Equation ------------------------------------ [4321el]

Here, represents the probability of transitioning to state from state under action in the optimal policy π*. The term on the right hand side in Equation 4321el is the expected maximum value of the next state (s') over all possible actions a, where s' is sampled from the transition probability distribution P_ps.

The term in Equation 4321el can be written by Q(a),

Bellman Expectation Equation ------------------------------------ [4321el]

The optimal policy is the policy that maximizes the expected cumulative reward for each state. In the case of a deterministic policy, is an action selected to achieve the maximum expected value. The equation for a deterministic optimal policy is,

Bellman Expectation Equation ------------------------------------ [4321f]

where,

is the optimal action-value function, representing the maximum expected cumulative reward when starting in state , taking action , and following the optimal policy thereafter.

The optimal policy selects, for each state , the action that maximizes the expected cumulative reward, according to the optimal action-value function .

Equations 4321c and 4321e describe fitted value iteration (VI), which gives approximation to V^* which implicitly defines π* because of Equation 4321f , and then give us reasonable robots by choosing a set of features and learning the value function to approximate the expected payoff of a robot starting off in different states.

In terms of the transition probabilities and the optimal value function , the can be given by,

Bellman Expectation Equation ------------------------------------ [4321g]

For all states , we have,

Bellman Expectation Equation ------------------------------------ [4321h]

where,

is the optimal value function, representing the maximum expected cumulative reward achievable from state under the optimal policy.

is the value function for any policy , representing the expected cumulative reward achievable from state under policy .

the value function of the optimal policy π*.

It shows the optimal value function is defined to be greater than or equal to the value function under any policy for all states .

Finding the optimal policy in reinforcement learning involves exploring and evaluating different policies to determine which one maximizes the expected cumulative reward over time. The strategy to find the optimal policy is:

i) Find V*. The first step is to find the optimal state-value function , which represents the maximum expected cumulative reward for each state under the optimal policy.

Value iteration is a dynamic programming algorithm that can be used to find the optimal state-value function in a reinforcement learning problem. Value iteration is specifically designed for solving MDPs when the transition probabilities and rewards are known. The basic idea behind value iteration is to iteratively update the value function until it converges to the optimal values. The algorithm involves repeated application of the Bellman Optimality Equation, which relates the value of a state to the values of its neighboring states under the optimal policy. The update rule for value iteration is given by,

Bellman Expectation Equation ------------------------------------ [4321i]

where:

is the value function at iteration .
is the transition probability function.
is the discount factor.

Here, we initialize V(s) := 0 for every state s. The update is applied for each state in the state space. The algorithm continues iteratively until the change in the value function becomes very small, indicating convergence. At this point, the value function has been approximated, and the optimal policy can be derived from by selecting actions that maximize the expected cumulative reward.

Two different updates can be used in this step above:

a) Synchronous update.

In synchronous update, all the state values are updated simultaneously or in a synchronized manner during each iteration of the algorithm. That is, the new values are calculated based on the old values from the same iteration. The entire set of state values is updated together.
The update rule for synchronous value iteration can be expressed as Equation 4321i. is calculated for all states simultaneously using the values from the same iteration .

b) Asynchronous update.

In asynchronous update, only one or a subset of state values is updated during each iteration. The choice of which state to update can be based on various strategies, such as randomly selecting states or prioritizing certain states.
The update rule for asynchronous value iteration might also look like the one described in Equation 4321i, but applied to a specific state rather than all states simultaneously.
The asynchronous update allows for more flexibility and may converge faster in some cases, but it can introduce some non-uniformity in the convergence process.

Both synchronous and asynchronous updates are valid approaches, and the choice between them may depend on factors such as computational efficiency, memory requirements, and the specific characteristics of the problem. Synchronous updates are simpler to implement and ensure a consistent update across all states, while asynchronous updates can sometimes speed up convergence by prioritizing certain states.

ii) Use argmax Equation to Find . Once is obtained, the optimal policy can be derived by selecting actions that maximize the expected cumulative reward in each state. The argmax operation (e.g. Equations 4321f or 4321g) is applied to the action set in each state with respect to the optimal state-value function.

Value iteration is an iterative algorithm used in reinforcement learning to find the optimal state-value function for a MDP. However, due to practical considerations and the nature of the algorithm, it may not reach the exact value:

Convergence to :
- Value iteration is designed to converge to the optimal state-value function over time. This means that as the algorithm progresses through iterations, the state values get closer and closer to the true optimal values.
Finite Precision and Convergence:
- In practice, the algorithm operates with finite precision due to the limitations of numerical representation in computers. This can result in the algorithm never exactly reaching the true value.
- The convergence criterion is often based on the change in state values between iterations. When this change becomes sufficiently small, the algorithm terminates, but the values may not be exact due to numerical precision.
Approximation of V*:
- Value iteration provides an approximation of . This approximation is typically accurate enough for practical purposes, and the difference between the approximated values and the true optimal values is usually small.
Practical Implications:
- In practice, the goal is often to obtain a good enough approximation of rather than the exact solution. The approximation is still valuable for making optimal decisions in the given environment.

In a reinforcement learning problem, if we are dealing with an environment where the outcomes of actions are uncertain or not precisely known, then the transition probabilities cannot be known. In these cases, model-free methods, such as Q-learning and Monte Carlo methods, (see page3676) are advantageous when we don't have access to the dynamics of the environment. They rely on the observed data (state-action pairs and associated rewards) to iteratively improve their estimates of the optimal policy or value function.

Reinforcement learning (RL) algorithms typically involve a series of steps to learn an optimal policy through interaction with an environment as shown in Figure 4321a.

Figure 4321a. Basic RL model. [2]

While the specific details may vary depending on the algorithm, common steps in a typical RL algorithm are below:

Define the Problem:
- Clearly define the problem as a MDP, consisting of states, actions, rewards, and transition probabilities.
Initialization:
- Initialize the policy (the strategy the agent uses to select actions), value function (the expected cumulative reward from a given state), and/or action-value function (the expected cumulative reward from taking a specific action in a given state). The value function () and the action-value function () represent different aspects of the expected cumulative rewards associated with states and state-action pairs, respectively.
Action Selection:
- Choose an action according to the current policy based on exploration strategy (e.g., -greedy, softmax, etc.).
Execute Action:
- Take the selected action and observe the next state and the immediate reward from the environment.
Update Estimates:

The agent receives feedback from the environment in the form of a reward and observes the new state.
Update the estimates of the state-action pair values (e.g., values) based on the observed reward and the new state. This step is crucial for learning from experience.
Use of a learned model (e.g. either from a physics simulator or from existing data, and others) to predict the next state s_t+1 given the current state s_t and action a_t. For instance, for linear regression, we have,

s_t+1 = As_t+ Ba_t------------------------------ [4321kl]

Here, A is ℝ^nxn matrix, and B is either ℝ^nx1 matrix or ℝ^1xn matrix. Equatio 4321kl ctually works well for helicopters if they are not flying too fast.

Equation 4321kl is deterministic, meaning with this model, the next state S_t+1 is entirely determined by the current state S_t and the action a_t.

In a stochastic model, the next state is not entirely determined by the current state and action; instead, there is an element of randomness involved. This can be represented using probability distributions. A common way to model a stochastic transition in reinforcement learning is to use a probability distribution over possible next states. For instance, the equation for the probability distribution of the next state S_t+1 given the current state S_t and action a with a stochastic model can be given by,

Value Function -------------------- [4321klk]

s_t+1 = As_t+ Ba_t+ ε_t------------------------------ [4321klkl]

where,

ε_t is used to introduce randomness (e.g. noise) either in the agent's exploration strategy or in the stochasticity of the environment.

Setting ϵ_t = 0 in reinforcement learning corresponds to using a completely greedy strategy for action selection. In ϵ-greedy exploration, ϵ_t is typically a parameter that determines the probability of choosing a random action (exploration) as opposed to the action with the maximum estimated value (exploitation). When ϵ_t = 0, it means that the agent is always exploiting its current knowledge and never exploring randomly. This strategy is known as "greedy" because the agent is always choosing the action that is currently believed to be the best according to its learned values. In many applications, when we deploy the simulator, then we can get rid of the noise by setting ϵ_t to zero.

In a simple case, we might have a Gaussian distribution,

Value Function -------------------- [4321klkm]

To find matrices A and B to minimize the discrepancy between the predicted next state As_t +Ba_t and the actual next state s_t+1 across multiple samples i and time steps t.

Value Function -------------------- [4321kll]

where,

s_t⁽ⁱ⁾ is the state at time t for the i-th sample.

A is a matrix representing the transformation of the current state.

B_at⁽ⁱ⁾ is a matrix that depends on the action a_t⁽ⁱ⁾ taken at time t for the i-th sample.

S_t+1⁽ⁱ⁾ is the actual next state at time t+1 for the i-th sample.

The RL above is called model-based RLs, in which we can either have linear model or non-linear model. For non-linear model, we can do linearization in the RL process. On the other hand, there is another RL which is called model free RL.

Policy Improvement:
- Update the policy based on the updated value function. This can be done through methods such as policy iteration or value iteration.
Repeat:
Repeat steps 3-6 for a certain number of episodes or until convergence.

In programming, we can have the loop below:

{

For i = 1, 2, ..., m

For each action a ∈A

Sample s'₁, s'₂, ..., s'_k

Set,

----------------------------- [4321l]

}

Convergence Check:
- Check for convergence criteria, such as stability of policy or value function. If the algorithm has not converged, go back to step 3.
Output:
- Output the learned policy or value function for deployment in the environment.

These steps are common in various RL algorithms, including model-free methods like Q-learning, policy gradient methods, and model-based methods. Depending on the specific algorithm, there might be additional steps or variations in the sequence of these steps. On the other hand, model-free methods, like Q-learning, often directly estimate the optimal policy and value function, while model-based methods may involve estimating the dynamics of the environment.

Note that the optimal value in reinforcement learning is usually found, V* = max_πV^π(s), through an iterative process of taking actions, receiving feedback, and adjusting the agent's strategy. This involves learning from the consequences of actions to improve decision-making over time. So, the optimization step is spread across the entire learning process rather than being confined to a specific step.

When the dimension of the state space is high, it is better to approximate V* directly, without using discretization, which involves employing function approximation methods, with neural networks being a popular choice due to their ability to handle high-dimensional input spaces:

i) Define the Neural Network Architecture:

Choose a neural network architecture that suits the problem at hand. For value function approximation, a common choice is a feedforward neural network. The input layer should match the dimensionality of the state space, and the output layer should have a single output for the estimated value.

For instance, in the neural network, linear regression can be applied to evaluate V*,

----------------------------- [4321n]

Here,

V* = y⁽ⁱ⁾

In this step, we need to choose the feature φ(s) of state s in order to approximate V* using a function V(s).

ii) Collect Data:

Collect a dataset of state-action pairs along with their associated rewards and next states. This dataset is used to train the neural network to approximate the optimal value function.

iii) Loss Function:

Define a suitable loss function to train the neural network. For value function approximation, a common choice is the mean squared error (MSE) loss, which measures the squared difference between the predicted values and the true values.

iv) Optimization:

Use an optimization algorithm, such as stochastic gradient descent (SGD) or more advanced variants like Adam, to minimize the loss function. This involves adjusting the weights of the neural network to improve its ability to predict the optimal values.

v) Training:

Iterate through the dataset multiple times (epochs) to train the neural network. During each iteration, randomly sample batches from the dataset to update the network weights.

vi) Exploration and Exploitation:

Depending on the reinforcement learning algorithm being used, incorporate a strategy for exploration and exploitation. Techniques such as epsilon-greedy exploration or more sophisticated exploration strategies can help balance the need to explore new states and exploit current knowledge.

Main symbols and notations used in RL are:

a: Action, representing the decision or move taken by the agent in a given state.

Fitted VI: Fitted Value Iteration (VI).

k is often used as an iteration index or a counter to denote the current step or episode in the learning process, might represent the current iteration or time step in iterative algorithms, might represent the current episode number, could represent the current time step or the number of interactions the agent has had with the environment, might represent the index of a specific batch or subset of data used for training, or is used to represent a specific setting or configuration, such as hyperparameter values in a hyperparameter search or cross-validation.

P: Transition model, representing the probability distribution of next states given the current state and action.

P(s'∣s, a): Transition probability, representing the probability of transitioning to state s' from state s by taking action a.

Q(s,a): Action-value function, representing the expected cumulative reward from being in state s, taking action a, and following a particular policy.

Q^∗(s,a): Optimal action-value function, representing the maximum expected cumulative reward achievable from being in state s, taking action a, and following the optimal policy.

r: Immediate reward, the numerical value received by the agent as feedback for taking an action in a specific state.

R: Return, the total cumulative reward obtained by the agent over a sequence of time steps.

s: State, representing the current situation or configuration of the environment.

S_t, A_t: State and action at time step t in a trajectory.

T: Time step or the length of a trajectorr, or final horizon.

V(s): State-value function, representing the expected cumulative reward from being in state s and following a particular policy.

V*(s): Optimal state-value function, representing the maximum expected cumulative reward achievable from state s onward following the optimal policy.

V^{^}: The approximation to the optimal state-value function.

ϵ: Exploration rate, a parameter used in epsilon-greedy strategies to balance exploration and exploitation.

γ: Discount factor, a parameter that determines the importance of future rewards. It is a value between 0 and 1.

π: Policy, a mapping from states to probabilities of selecting each possible action.

π_t* : non-stationary policy.

π*: Stationary policy.

π(a∣s): Probability of taking action a in state s according to policy π.

θ are the parameters of the function approximator.

τ: Trajectory, a sequence of states and actions experienced by the agent during interaction with the environment.

RL algorithms is not efficient and it may require a significant number of iterations (e.g. 10 millions of iteration ) to learn effective policies for a job. The efficiency can also vary widely depending on several factors, including the complexity of the task, the quality of the reward signal, the choice of algorithm, and the characteristics of the environment. The poor efficiency is a disadvantage of RL.

Some examples of Reinforcement Learning applications are:

Reinforcement Learning (RL) is well-suited for exploring mazes and finding optimal solutions to reach the finish line due to its ability to learn through trial and error in an environment with delayed rewards.

============================================

[1] Daniel Kurian, Fereshteh Sattari, Lianne Lefsrud, and Yongsheng Ma, Using machine learning and keyword analysis to analyze incidents and reduce risk in oil sands operations, Safety Science, 130(2020), 104873.

[2] https://doi.org/10.1371/journal.pone.0217408.g002.

=================================================================================