Example of Building Robot Systems with Automated ML: Helicopter - Python Automation and Machine Learning for ICs - - An Online Book - |
||||||||
Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/ | ||||||||
Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix | ||||||||
================================================================================= Example 1: Build a self-driving helicopter. [1] Flying a helicopter using ML can be formulated as a reinforcement learning problem:
The goal of the reinforcement learning agent is to learn a policy that allows the helicopter to navigate the environment successfully and achieve specific objectives, such as reaching a destination or avoiding obstacles. Reinforcement learning techniques, including algorithms like deep reinforcement learning, can be employed to train the helicopter's control system. The learning process involves the agent exploring different actions, receiving feedback in the form of rewards or penalties, and adjusting its policy to maximize cumulative rewards over time. This approach allows the helicopter to learn and adapt to the complexities of flying in a dynamic environment. In the reinforcement learning for flying a helicopter, the responsibilities of a design and/or AI engineer can include problem formulation, modeling, algorithm selection, training and testing, safety considerations, integration, monitoring and maintenance, amd collaboration. Writing the cost function or reward function is a crucial aspect of designing a reinforcement learning system for a helicopter or any other application. The reward function plays a central role in shaping the behavior of the learning agent by providing feedback on the desirability of its actions in different states. This project can be very complicated if you want to have good functions in the built helicopter, and can be a PhD thesis work. The basic steps to do the project are: i) Build a helicopter simulator by using a video game simulator. The reason for using a simulator is cheap even though it crashes without loss of life. ii) Choose a cost function. E.g. J(θ) = ||x-xDesired||2 ------------------------------------ [3703a] Or, choose a reward function, R(θ) = -||s-sDesired||2 ------------------------------------ [3703b] where, x and s is the helicopter position. Equations 3703a and 3703b are very simple cost funciton and reward function, respectively, which is squared error. Both x and s represent the squared Euclidean distance between the current state and the desired state. Note that both equations can make the helicopter to fly, but they cannot make it aggressively fly in difficult situations. iii) Run reinforcement learning algorithm to fly helicopter in simulation by trying to mimimize the cost function (Equation 3703ca below) and/or to maximize the reward function (Equation 3703cb below): θRL = arg minθJ(θ) ------------------------------------ [3703ca] It Is reward function is given by, E[R(s0) + R(s1) + ... + R(sT) ] ------------------------------------ [3703cb] With Equations 3703ca and 3703cb, we then are able to get a learned policy πRL. iv) Maximize the finite horizon MDP formulation and maximize the sum of rewards over time T. v) We find that the resulting controller does much worse than the human pilot, then the question is "What to do next?" There are a couple of things which you can try and consider: v.a) The simulator is accurate? Improve the simulator? However, maybe the simulator is good enough. If πRL flies well in simulation, but the helicopter does not fly well in real life, then the problem is in the simulator. One can spend 5-6 years to improve the simulator; however, "will this long-time work actually give you good results" can still be a question. v.b) Modify and minimize cost function J again? v.c) Modify RL algorithm? v.d) Optimize the functions and algorithm to match the mechanic needs? v.e) The controller given by the parameters θRL performs poorly? The RL algorithm needs correctly to control the helicopter in simulation so as to maximize expected payoff (reward) VπR(S0) = E[R(S0) + R(S1) + ... + R(ST)|πRL, S0]. v.f) Do your logic reasoning of trying different things make sense? One thing we can try as well is that we isolate this big complicated program into one component which causes crashes: i) If θR flies well in simulation, but the helicopter does not fly well in real life, then the simulator has a problem. ii) Let πhuman be the human control policy. If (where V represents the value function, πRL represents the policy learned by the RL agent, and πhuman represents the policy of a human operator), then the problem is in the RL algorithm, e.g. failing to maximize the expected payoff. It indicates that the RL agent's policy is currently leading to lower expected cumulative rewards (value) in the initial state S0 compared to the human operator's policy. Several potential reasons for this discrepancy are: ii.a) Suboptimal Policy: The RL agent may have learned a suboptimal policy, and it has not yet discovered or converged to a policy that performs as well as the human's policy. This could be due to issues with the learning algorithm, insufficient training, or inadequate exploration. ii.b) Exploration: The RL agent may not have explored the state-action space effectively, leading to a failure to discover optimal actions in certain situations. Improving exploration strategies or adjusting exploration parameters may help the agent discover better policies. ii.c) Reward Function Design: The reward function used to train the RL agent might not be well-designed or may not accurately reflect the objectives of the task. Adjusting the reward function or using reward shaping techniques could be beneficial. ii.d) Human Expertise: Human operators often bring domain expertise and intuition to complex tasks. The RL agent may not be leveraging this expertise effectively. Incorporating expert demonstrations, using imitation learning, or fine-tuning the RL agent based on human demonstrations could improve its performance. ii.e) Model Complexity: The helicopter control task may be complex, and the RL agent's model might not be capturing the intricacies of the helicopter dynamics and control. Consideration should be given to using more sophisticated RL algorithms or neural network architectures. ii.f) Hyperparameter Tuning: RL algorithms have various hyperparameters, and suboptimal choices may affect learning. Tuning hyperparameters, such as learning rates and discount factors, could improve the RL agent's policy. ii.g) Safety Constraints: Helicopter control often involves safety-critical considerations. If the RL agent's policy violates safety constraints, it may receive lower values in terms of the value function. Ensuring that the RL agent adheres to safety constraints is essential. ii.h) Task Complexity: The task itself may be challenging, and the RL agent may require more time or data to learn an effective policy. Patience and continued training may be necessary. iii) However, on the other hand, if , then it implies that the RL agent's policy is currently leading to higher expected cumulative rewards (value) in the initial state S0 compared to the human's policy. While this might seem desirable, it's important to consider potential issues or reasons behind this scenario: iii.a) Overfitting: The RL agent may have overfit the training data or specific situations encountered during training, leading to a policy that performs well on those situations but might not generalize well to new, unseen scenarios. iii.b) Randomness or Luck: The RL agent might have encountered a lucky sequence of states or actions during training, contributing to higher estimated values in the initial state. This could be due to the stochastic nature of the environment or the learning process. iii.c) Limited Exploration: The RL agent may not have explored the state-action space sufficiently, and its policy might be exploiting a subset of actions that appear to yield high rewards in the short term. This could lead to a lack of robustness in the learned policy. iii.d) Reward Hacking: The RL agent might have found a way to exploit the reward signal in unintended ways, leading to artificially inflated values. Careful inspection of the reward function and potential reward hacking scenarios is important. iii.e) Task Mismatch: The RL agent may have learned a policy that is optimized for a different set of objectives than those of the human operator. The reward function or task definition might not align with the true goals of the task. iii.f) Inadequate Human Demonstration: If human demonstrations were used during RL training, the quality and relevance of the demonstrations could impact the learned policy. Inaccurate or suboptimal demonstrations might lead to a learned policy that outperforms the human policy. iii.g) Discount Factor Sensitivity: The choice of the discount factor in the RL algorithm can impact the importance given to future rewards. A high discount factor may prioritize long-term rewards, potentially leading to differences in performance. iii.h) Simulation Fidelity: If the RL agent was trained in a simulation environment that does not accurately reflect the dynamics of a real helicopter, the learned policy might not transfer well to the real-world scenario. iv) If J(θhuman) < J(θRL), (here, θhuman is human control policy), then the problem is in the reinforcement learning algorithm, which fails to minimize the cost function J. it suggests that the RL algorithm has failed to minimize the cost function effectively. The algorithm has not learned a policy (θRL) that performs as well as the human control policy (θhuman) according to the specified cost function. v) If J(θhuman) > J(θRL), then it suggests that, according to the cost function, the RL policy (θRL) performs better or incurs lower costs than the human control policy (θhuman). There are several reasons why achieving a low cost, as measured by J(θRL), may not guarantee that the RL agent will fly well or perform optimally in a given environment:
============================================
[1] Andrew NG, 2018.
|
||||||||
================================================================================= | ||||||||
|
||||||||