Linear Quadratic Regulation (LQR)

Linear Quadratic Regulation (LQR)
- Python Automation and Machine Learning for ICs -
- An Online Book -

Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Linear Quadratic Regulation (LQR) is a control algorithm used in the field of control theory, rather than specifically in machine learning. However, it can be related to machine learning in reinforcement learning and optimal control. In control theory, LQR is a method for designing optimal linear state feedback controllers for linear dynamic systems with quadratic performance measures. The goal is to find a control policy that minimizes a cost function, typically a quadratic function of the state and control inputs.

Linear Quadratic Regulation (LQR) is most effective and commonly used for problems that can be well approximated by linear dynamics and quadratic cost functions. It is particularly suited for problems where the system dynamics can be accurately modeled as linear and where the cost function can be effectively represented as a quadratic function of the state and control inputs. LQR may not be as suitable for problems with highly nonlinear dynamics or non-quadratic cost functions. In such cases, more advanced control techniques or reinforcement learning approaches might be more appropriate. Model Predictive Control (MPC) is an extension of LQR that can handle some level of nonlinearity and has been applied to a broader range of problems.

In reinforcement learning, the cost function R(s,a) represents the immediate reward or cost associated with taking action a in state s. The form of R(s,a) can vary depending on the specific problem and the goals of the reinforcement learning task. However, a common structure is to have R(s,a) as a scalar value that represents the immediate reward associated with the state-action pair (s, a). The cost function is typically designed based on the objectives of the task. For instance, in a control task, you might define R(s, a) to penalize deviations from a desired state or to penalize large control efforts. A simple linear quadratic form for the cost function in the control problems and reinforcement learning can be given by,

------------------------------ [3662a]

where,

s is the state vector.

a is the action vector, which can be a column vector representing the state of the system at a given time. The dimensionality of s depends on the number of state variables in your system. If we have n state variables, s is an 1 n×1 column vector, given by,

------------------------------ [3662b]

Q and R are positive semi-definite matrices that determine the penalties for state and action deviations, respectively. Q is a positive semi-definite matrix that defines the cost associated with the state deviations. The dimension of Q is determined by the dimension of the state vector s. If s is an n×1 column vector, then Q is an n×n matrix, given by,

------------------------------ [3662c]

Each element q_ij in Q determines the penalty associated with the deviation of the i-th state variable from its desired trajectory.

The dimension of R is determined by the dimension of the control input vector a. If a is an m×1 column vector, then R is an m×m matrix, given by,

------------------------------ [3662d]

Each element r_ij in R determines the penalty associated with the i-th control input.

This quadratic form in Equation 3662a is common in the context of linear quadratic control problems and is used in methods like the Linear Quadratic Regulator (LQR).

Let's use s_t for the state vector and a_t for the action vector at time t with a state transition dynamics P(s_t+1∣s_t, a_t) for a discrete-time finite-horizon MDP, then the cost function J over a finite time horizon T can be given by,

------------------------------ [3662e]

where,

s_t is the state vector at time.

a_t is the action vector at time.

Q_Tis the terminal cost matrix.

P(s_t+1∣s_t, a_t) represents the state transition dynamics.

The optimal control policy in the context of reinforcement learning is typically denoted as π^*(a_t∣s_t), where π^* represents the optimal policy. In the linear quadratic case, the policy can be linear in the state,

------------------------------ [3662e]

where,

K_t is a matrix that minimizes the cost-to-go.

If both Q and R are identity matrices (denoted as I), the linear quadratic cost function simplifies to the sum of squared state variables and control inputs,

------------------------------ [3662f]

The squared norm of a vector is always a non-negative quantity, and it can be zero if and only if the vector itself is the zero vector. In the linear quadratic cost function, if the squared norm of one of the vectors is zero, it means that the state or the action is exactly at the origin (zero point) of the vector space. However, in practical applications, it's often rare for the state or control input to be exactly zero, especially in dynamic systems where non-trivial behavior is expected.

The several key assumptions of LQR are:

i) Linear Dynamics:

LQR assumes that the dynamics of the system can be accurately described by linear equations, which states as a linear function of the previous state and action with some noise. The state transition dynamics are represented as a linear function of the current state and control input.

ii) Quadratic Cost Function:

The reward function is a quadratic cost function. LQR is specifically designed for problems where the cost function is quadratic in both the state and control input.

iii) Stationarity:

LQR typically assumes stationarity, meaning that the system dynamics and the cost function parameters remain constant over time.

iv) Known Dynamics:

LQR assumes that the system dynamics are known. This means that the matrices defining the state transition dynamics are accurately known and can be modeled.

v) Quadratic Stabilizability:

For continuous-time systems, a key assumption is quadratic stabilizability, which ensures that the system can be stabilized using a quadratic cost function.

vi) Full State Observability:

LQR assumes that the full state of the system is observable. In other words, all state variables can be measured or estimated accurately.

vii) Finite Horizon:

LQR is often used for finite-horizon problems where the objective is to minimize the cost over a specific time horizon.

The remarkable property of the LQR is that, under the assumptions above, the optimal value function in the Markov Decision Processes (MDPs) is quadratic. This property makes it computationally efficient to compute the optimal value function, which is often denoted as V^*.

The value function at time t+1 in a quadratic form is given by,

------------------------------ [3662g]

where,

φ_t+1 ∈ ℝ^nxn, and Ψ ∈ ℝ.

Then, V^*_t can be given by,

------------------------------ [3662h]

This equation expresses the optimal value function at time t in terms of the maximum expected immediate reward (the first term on the righ-hand side) and the expected value of the optimal value function at the next time step (the second term on the righ-hand side). The first term in Equation 3662h is to maximize Equation 3662a.

The optimal control input at time t in a linear quadratic control problem can be given by,

------------------------------ [3662i]

where,

a_t represents the optimal control input at time t.

Φ_t+1 is a matrix that characterizes the quadratic part of the value function at time t+1.

B is the matrix associated with the control input in the system dynamics equation.

V is a scalar term representing the constant offset in the value function.

A is the matrix associated with the state in the system dynamics equation.

S_t is the state at time t.

The optimal value function at time t in a stochastic control problem with a quadratic cost function and Gaussian noise in the state transition can be given by,

------- [3662j]

where,

is Gaussian noise.

Then, Equation 3662i can be applied into Equation 3662j.

We can re-write Equation 3662j to,

------------------------------ [3662k]

where,

V_t^*(S_t) is the optimal value function at time t given the state S_t.

S_t^TΦS_t represents the quadratic part of the cost associated with the state S_t. The matrix Φ_t characterizes this quadratic cost.

Φ_t is a scalar term representing an additional constant offset in the value function.

In linear quadratic control, the value function often has a quadratic form, and the optimal policy and value function can be expressed in terms of matrices representing the system dynamics, control inputs, and quadratic costs. The quadratic form of the value function is a consequence of the structure of the problem, particularly when dealing with quadratic costs and linear dynamics.

Both Equations 3662j and 3662k exhibit a quadratic form in terms of the state S_t. The first term in Equation 3662k is a quadratic term in the value function.

============================================

=================================================================================