CN113485099B

CN113485099B - Online learning control method of nonlinear discrete time system

Info

Publication number: CN113485099B
Application number: CN202011635930.6A
Authority: CN
Inventors: 李新兴; 查文中; 王雪源; 王蓉
Original assignee: Information Science Research Institute of CETC
Current assignee: Information Science Research Institute of CETC
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2023-09-22
Anticipated expiration: 2040-12-31
Also published as: CN113485099A

Abstract

The invention discloses an online learning control method for a nonlinear discrete time system, which includes a behavioral strategy selection step, an optimal Q-function definition step, an evaluation network and execution network introduction step, an estimation error calculation step, and a final optimal weight calculation step. , when the weights of the evaluation network and the execution network converge, the output of the execution network is the approximation of the optimal controller. This invention does not need to iterate repeatedly between strategy evaluation and strategy improvement, and can realize real-time online learning of the optimal controller; it adopts an off-track strategy learning mechanism, which effectively overcomes the problem of insufficient exploration of the state-policy space of the direct heuristic dynamic programming method. Problem, the execution network and the evaluation network can use any form of activation function. The present invention can realize online learning of the optimal controller without a system model and only requires state data generated by the behavioral strategy.

Description

An online learning control method for nonlinear discrete-time systems

技术领域Technical field

本发明涉及工业生产控制领域，具体的，涉及一种对非线性离散时间系统的在线学习控制方法。The present invention relates to the field of industrial production control, and specifically, to an online learning control method for nonlinear discrete time systems.

背景技术Background technique

在工业生产的过程中，工程技术人员往往需要对机器人、无人机、无人车等控制对象的控制器进行优化设计，以满足一定的控制指标要去。由于上述控制对象往往表现出很强的非线性，使得控制器的优化面临很大困难。从最优控制的角度来看，获得最优控制控制器需要求解复杂的哈密顿-雅可比-贝尔曼方程(HJB方程)，但HJB方程为非线性的偏微分方程，非常难求解。传统的动态规划、变分法、谱方法等由于具有极高的计算复杂度，在实际应用过程中往往面临很大的局限性。In the process of industrial production, engineering and technical personnel often need to optimize the design of controllers for control objects such as robots, drones, and unmanned vehicles to meet certain control indicators. Since the above control objects often exhibit strong nonlinearity, the optimization of the controller faces great difficulties. From the perspective of optimal control, obtaining the optimal control controller requires solving the complex Hamilton-Jacobi-Bellman equation (HJB equation), but the HJB equation is a nonlinear partial differential equation and is very difficult to solve. Traditional dynamic programming, variational methods, spectral methods, etc. often face great limitations in practical applications due to their extremely high computational complexity.

自适应动态规划作为近年来兴起的一种新型的智能控制算法，通过将强化学习、神经网络近似、动态规划以及自适应控制等技术进行融合，可实现对最优控制器的在线学习，有效克服了传统方法计算复杂度高的问题。针对非线性离散时间系统的最优控制问题，Jennie Si和Yu-Tsung Wang在论文“Online learning control by association andreinforcement”中首次提出了直接启发式动态规划算法，该算法采用广义策略迭代的基本思想，通过引入两个神经网络(即执行网络和评价网络)，可实现对最优控制器和最优值函数的实时在线学习。经过近些年的不断发展，算法的收敛性和稳定性分析目前也具有一定的理论基础。虽然直接启发式动态规划算法可实现在线自适应最优控制，该算法仍存在以下不足：1)该算法采用了在轨策略(on-policy)学习机制，存在对状态-策略空间探索不足的问题，容易陷入局部最优解；2)执行网络和评价网络的激活函数均采用的双曲正切函数，并且目前所有的收敛性和稳定性分析结果均以双曲正切函数为基础，对于其他类型的激活函数则不再适用。Adaptive dynamic programming is a new type of intelligent control algorithm that has emerged in recent years. By integrating reinforcement learning, neural network approximation, dynamic programming and adaptive control technologies, it can achieve online learning of the optimal controller and effectively overcome problems. It solves the problem of high computational complexity of traditional methods. For the optimal control problem of nonlinear discrete-time systems, Jennie Si and Yu-Tsung Wang first proposed a direct heuristic dynamic programming algorithm in the paper "Online learning control by association andreinforcement". This algorithm adopts the basic idea of generalized strategy iteration. By introducing two neural networks (ie, execution network and evaluation network), real-time online learning of the optimal controller and optimal value function can be achieved. After continuous development in recent years, the convergence and stability analysis of the algorithm now also has a certain theoretical basis. Although the direct heuristic dynamic programming algorithm can achieve online adaptive optimal control, this algorithm still has the following shortcomings: 1) This algorithm uses an on-orbit policy (on-policy) learning mechanism, which has the problem of insufficient exploration of the state-policy space. , it is easy to fall into the local optimal solution; 2) The activation functions of the execution network and the evaluation network use the hyperbolic tangent function, and all current convergence and stability analysis results are based on the hyperbolic tangent function. For other types of Activation functions no longer apply.

因此，如何克服上述的直接启发式动态规划方法存在的以上不足，使得收敛性和稳定性分析结果不再局限双曲正切函数，成为现有技术亟需解决的技术问题。Therefore, how to overcome the above shortcomings of the above-mentioned direct heuristic dynamic programming method so that the convergence and stability analysis results are no longer limited to the hyperbolic tangent function has become a technical problem that needs to be solved urgently in the existing technology.

发明内容Contents of the invention

本发明的目的在于提出一种非线性离散时间系统的在线学习控制方法，能够对状态-策略空间具有更好的探索能力，使得执行网络和评价网络的激活函数类型可任意选择，不再局限于双曲正切函数；相比于策略迭代或值迭代等迭代式方法，该方法可实现对最优控制器的在线学习，并且无需系统模型，仅需要行为策略产生的状态数据。The purpose of the present invention is to propose an online learning control method for nonlinear discrete time systems, which can have better exploration capabilities for the state-policy space, so that the activation function types of the execution network and the evaluation network can be selected arbitrarily, and are no longer limited to Hyperbolic tangent function; compared with iterative methods such as policy iteration or value iteration, this method can achieve online learning of the optimal controller and does not require a system model, only the state data generated by the behavioral policy.

为达此目的，本发明采用以下技术方案：To achieve this goal, the present invention adopts the following technical solutions:

一种非线性离散时间系统的在线学习控制方法，包括如下步骤：An online learning control method for nonlinear discrete-time systems, including the following steps:

行为策略选择步骤S110：Behavior strategy selection step S110:

根据被控对象的特点，利用已有经验选择行为策略u，行为策略为学习过程中实际应用到被控对象的控制策略，其主要作用是用来产生学习过程中需要用到的系统状态数据；According to the characteristics of the controlled object, existing experience is used to select the behavioral strategy u. The behavioral strategy is the control strategy actually applied to the controlled object in the learning process. Its main function is to generate system status data that needs to be used in the learning process;

最优Q-函数定义步骤S120：Optimal Q-function definition step S120:

定义如下的最优Q-函数：Define the optimal Q-function as follows:

其物理意义为：在k时刻，采取行为策略u，而在之后的所有时刻，均采取最优控制策略u^*，即目标策略，由最优Q-函数定义可知，上式可等价表示为：Its physical meaning is: at time k, adopt the behavioral strategy u, and at all subsequent moments, adopt the optimal control strategy u ^* , that is, the target strategy. It can be seen from the definition of the optimal Q-function that the above formula can be equivalently expressed as :

最优控制可表示为：/> best control It can be expressed as:/>

对于线性系统，Q^*(x_k,u_k)和分别是关于(x_k,u_k)和x_k的非线性函数；For linear systems, Q ^* (x _k ,u _k ) and are nonlinear functions about (x _k ,u _k ) and x _k respectively;

评价网络和执行网络引入步骤S130：Evaluate the network and perform the network introduction step S130:

引入评价网络和执行网络分别对Q^*(x_k,u_k)和进行在线逼近，所述评价网络和执行网络为神经网络；Introduce the evaluation network and execution network to respectively control Q ^* (x _k ,u _k ) and Perform online approximation, and the evaluation network and execution network are neural networks;

评价网络用来学习最优Q-函数Q^*(x_k,u_k)，执行网络用来学习最优控制器u^*，假设评价网络中神经网络激活函数的数量为N_c，并记为最小二乘意义下评价网络对Q^*(x_k,u_k)的最佳逼近，则/>可表示为：The evaluation network is used to learn the optimal Q-function Q ^* (x _k , _uk ), and the execution network is used to learn the optimal controller u ^* . Assume that the number of neural network activation functions in the evaluation network is N _c , and note is the best approximation of the evaluation network to Q ^* (x _k ,u _k ) in the sense of least squares, then/> It can be expressed as:

其中，W_c为隐藏层到输出层的权重，φ_c(·)为评价网络中隐藏层中所有激活函数构成的集合，为评价网络输入层到隐藏层的权重，其中，/>为第i个激活函数对应的权重，/>表示(x_k,u_k)对应的各激活函数的输入值，/>表示第i个激活函数的输入值；Among them, W _c is the weight from the hidden layer to the output layer, φ _c (·) is the set of all activation functions in the hidden layer in the evaluation network, To evaluate the weight from the input layer to the hidden layer of the network, where,/> is the weight corresponding to the i-th activation function,/> Indicates the input value of each activation function corresponding to (x _k ,u _k ),/> Represents the input value of the i-th activation function;

设执行网络激活函数的数量为N_a，并记为最小二乘意义下执行网络对/>的最佳逼近，则/>可表示为：Let the number of execution network activation functions be N _a , and note To perform network pairs in a least squares sense/> The best approximation of , then/> It can be expressed as:

执行网络的输入为系统状态，其中，W_a为隐藏层到输入层的权重，φ_a(·)为执行网络隐藏层激活函数构成的集合，为输入层到隐藏层的权重，其中，/>为第i个激活函数对应的权重，/>代表x_k对应的各激活函数的输入值，/>表示第i个激活函数的输入值，对于x_k+1，则有 The input of the execution network is the system state, where W _a is the weight from the hidden layer to the input layer, φ _a (·) is the set of activation functions of the hidden layer of the execution network, is the weight from the input layer to the hidden layer, where,/> is the weight corresponding to the i-th activation function,/> Represents the input value of each activation function corresponding to x _k ,/> Represents the input value of the i-th activation function. For x _k+1 , there is

估计误差计算步骤S140:Estimation error calculation step S140:

最优近似值和/>代替精确值Q^*(x_k,u_k)和/>可得如下的估计误差：best approximation and/> Instead of the exact values Q ^* (x _k ,u _k ) and/> The following estimation error can be obtained:

其中，表示输入为/>时，评价网络中各激活函数的输入值，即 in, Indicates that the input is/> When , evaluate the input values of each activation function in the network, that is

最优权重计算步骤S150：Optimal weight calculation step S150:

对评价网络的最优权重W_c和执行网络的最优权重W_a进行在线学习，假设在k时刻，评价网络和执行网络对W_c和W_a的估计值分别为和/>其中，l≤k，即学习过程要在行为策略开始产生状态数据之后进行，则执行网络在k时刻的输出可表示为：Online learning is performed on the optimal weight W _c of the evaluation network and the optimal weight W _a of the execution network. It is assumed that at time k, the estimated values of W _c and W _a by the evaluation network and execution network are respectively and/> Among them, l≤k, that is, the learning process should be carried out after the behavioral strategy starts to generate state data, then the output of the execution network at time k can be expressed as:

在行为策略u_k生成下一个状态x_k+1之前，执行网络还无法给出k+1时刻对W_a的估计，因此，k+1时刻执行网络对W_a的估计值仍采用则k+1时刻执行网络的输出为：Before the behavior policy u _k generates the next state x _k+1 , the execution network cannot give an estimate of W _a at time k+1. Therefore, the execution network's estimate of W _a at time k+1 still uses Then the output of the execution network at time k+1 is:

同理，当输入为(x_k,u_k)时，评价网络的输出为：In the same way, when the input is (x _k , _uk ), the output of the evaluation network is:

当输入为时，评价网络的输出为：When the input is When , the output of the evaluation network is:

其中，同样，在生成状态x_k+1之前，评价网络也无法给出k+1时刻对W_c的估计，所以k+1时刻评价网络对W_c的估计值同样取/>因此有：in, _Similarly _, before generating _state So there are:

用估计值代替真实值得到如下的估计误差：Substituting estimated values for true values yields the following estimation error:

对于评价网络的权重采用梯度下降法进行调节，For the weight of the evaluation network Adjust using gradient descent method,

对于执行网络的权重则采用重要性加权法进行训练，并采用改进的梯度下降法对/>进行在线调节，For the weight of the execution network The importance weighting method is used for training, and the improved gradient descent method is used for training/> Make online adjustments,

当评价网络的权重和执行网络的权重/>收敛之后，执行网络的输出即为最优控制器的近似值。When evaluating the weight of a network and the weight of the execution network/> After convergence, the output of the execution network is an approximation of the optimal controller.

可选的，在评价网络和执行网络引入步骤S130中，Optionally, in step S130 of evaluating the network and executing the network introduction,

对于评价网络，将设为常值，只需调节隐藏层到输出层的权重；For the evaluation network, we will Set to a constant value and only need to adjust the weight from the hidden layer to the output layer;

对于执行网络，将设为常值，仅调节隐藏层到输出层的权重。For the execution network, the Set to a constant value and only adjust the weight from the hidden layer to the output layer.

可选的，在最优权重计算步骤S150中：Optionally, in the optimal weight calculation step S150:

对于评价网络的权重，采用如下的梯度下降法进行调节具体为：For the weight of the evaluation network, the following gradient descent method is used to adjust the weight:

其中，α＞0评价网络的学习率，Δφ_c(k)＝φ_c(θ₂(k+1))-φ_c(θ₁(k))为回归向量，Φ_c(k)＝(1+Δφ_c(k)^TΔφ_c(k))²为归一化项；Among them, α>0 evaluates the learning rate of the network, Δφ _c (k) = φ _c (θ ₂ (k+1))-φ _c (θ ₁ (k)) is the regression vector, φ _c (k) = (1 +Δφ _c (k) ^T Δφ _c (k)) ² is the normalization term;

对于执行网络的权重，则采用重要性加权法进行训练，并采用改进的梯度下降法对进行在线调节具体为：For the weight of the execution network, the importance weighting method is used for training, and the improved gradient descent method is used to The details of online adjustment are as follows:

可选的，行为策略选择步骤S110中，所述行为策略为：u_k＝u′_k+n_k，其中u′为任意的一个可行控制策略，根据被控系统的特征和经验选择，n_k为探索噪声，n_k是包含较多，例如足够多频率的正弦、余弦信号或者幅值有限的随机信号。Optionally, in the behavior strategy selection step S110, the behavior strategy is: u _k =u′ _k +n _k , where u′ is any feasible control strategy, selected according to the characteristics and experience of the controlled system, n _k To explore noise, n _k contains more, such as sine and cosine signals with enough frequencies or random signals with limited amplitude.

可选的，所述评价网络和所述执行网络为单隐藏层的前馈神经网络，用于近似Q-函数的评价网络的输入为状态和控制输入，执行网络的输入为系统状态，输出为多m维向量。Optionally, the evaluation network and the execution network are feedforward neural networks with a single hidden layer. The inputs of the evaluation network used to approximate the Q-function are state and control inputs. The inputs of the execution network are system states, and the output is Multi-m-dimensional vector.

可选的，所述评价网络和所述执行网络只调节隐藏层到输出层的权重，输入层到隐藏层的权重在学习过程开始之前随机生成，在学习的过程中保持不变。Optionally, the evaluation network and the execution network only adjust the weight from the hidden layer to the output layer. The weight from the input layer to the hidden layer is randomly generated before the learning process starts and remains unchanged during the learning process.

可选的，所述评价网络和所述执行网络的激活函数为双曲正切函数、Sigmoid函数、线性整流器、多项式函数中的一个。Optionally, the activation function of the evaluation network and the execution network is one of a hyperbolic tangent function, a sigmoid function, a linear rectifier, and a polynomial function.

本发明进一步公开了一种存储介质，用于存储计算机可执行指令，其特征在于：The invention further discloses a storage medium for storing computer executable instructions, which is characterized by:

所述计算机可执行指令在被处理器执行时执行所述的非线性离散时间系统的在线学习控制方法。The computer-executable instructions, when executed by the processor, execute the online learning control method of the nonlinear discrete-time system.

本发明具有如下的优点：The invention has the following advantages:

1、本发明提出了适用于一般非线性离散时间系统的在线学习控制方法，该方法无需在策略评估和策略提高之间反复迭代，可实现对最优控制器的实时在线学习；1. The present invention proposes an online learning control method suitable for general nonlinear discrete-time systems. This method does not require repeated iterations between strategy evaluation and strategy improvement, and can achieve real-time online learning of the optimal controller;

2、本发明采用离轨策略学习机制，有效克服了直接启发式动态规划方法对状态-策略空间探索不足的问题；另外，执行网络和评价网络可使用任意形式的激活函数。2. The present invention adopts an off-track policy learning mechanism, which effectively overcomes the problem of insufficient exploration of the state-policy space by the direct heuristic dynamic programming method; in addition, the execution network and the evaluation network can use any form of activation function.

3、与经典的直接启发式动态规划方法相比，此项专利提出的在线学习方法对状态-策略空间具有更好的探索能力，并且执行网络和评价网络的激活函数类型可任意选择，不再局限于双曲正切函数；相比于策略迭代或值迭代等迭代式方法，该方法可实现对最优控制器的在线学习，并且无需系统模型，仅需要行为策略产生的状态数据。3. Compared with the classic direct heuristic dynamic programming method, the online learning method proposed by this patent has better exploration capabilities for the state-policy space, and the activation function types of the execution network and evaluation network can be selected arbitrarily. Limited to the hyperbolic tangent function; compared with iterative methods such as policy iteration or value iteration, this method can achieve online learning of the optimal controller and does not require a system model, only the state data generated by the behavioral policy.

附图说明Description of the drawings

图1是根据本发明具体实施例的非线性离散时间系统的在线学习控制方法的流程图；Figure 1 is a flow chart of an online learning control method for a nonlinear discrete-time system according to a specific embodiment of the present invention;

图2是根据本发明具体实施例的非线性离散时间系统的在线学习控制方法的评价网络的示意图；Figure 2 is a schematic diagram of an evaluation network of an online learning control method for a nonlinear discrete-time system according to a specific embodiment of the present invention;

图3是根据本发明具体实施例的非线性离散时间系统的在线学习控制方法的执行网络的示意图；Figure 3 is a schematic diagram of an execution network of an online learning control method for a nonlinear discrete-time system according to a specific embodiment of the present invention;

图4是根据本发明具体实施例的非线性离散时间系统的在线学习控制方法的算法示意图。Figure 4 is an algorithm schematic diagram of an online learning control method for a nonlinear discrete-time system according to a specific embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅仅用于解释本发明，而非对本发明的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与本发明相关的部分而非全部结构。The present invention will be further described in detail below in conjunction with the accompanying drawings and examples. It can be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention. In addition, it should be noted that, for convenience of description, only some but not all structures related to the present invention are shown in the drawings.

本发明首先需要考虑如下的非线性离散时间系统最优控制问题。考虑如下的离散时间系统：The present invention first needs to consider the following optimal control problem of nonlinear discrete time system. Consider the following discrete-time system:

x_k+1＝F(x_k,u_k),x₀＝x(0)x _k+1 =F(x _k ,u _k ),x ₀ =x(0)

其中，x_k为系统状态，u_k为系统输入。系统函数F(x_k,u_k)在紧集上是Lipschitz连续的，并且满足F(0,0)＝0。假设系统在Ω上是可镇定的，即存在一个控制序列u₁,…,u_k,…，使得x_k→0。另外，假设系统函数F(x_k,u_k)是未知的。非线性系统的最优控制的目标是找到一个可行控制策略，使得系统镇定，并同时最小化如下的值函数：Among them, x _k is the system state and u _k is the system input. The system function F(x _k ,u _k ) is in the compact set is Lipschitz continuous and satisfies F(0,0)=0. Assume that the system is stabilized on Ω, that is, there is a control sequence u ₁ ,…,u _k ,… such that x _k →0. In addition, it is assumed that the system function F(x _k , _uk ) is unknown. The goal of optimal control of nonlinear systems is to find a feasible control strategy that stabilizes the system and at the same time minimizes the following value function:

根据贝尔曼最优性原理，最优控制策略u^*应满足如下的贝尔曼方程：According to the Bellman optimality principle, the optimal control strategy u ^* should satisfy the following Bellman equation:

s.t.x_k+1＝F(x_k,u_k)stx _k+1 =F(x _k ,u _k )

因此，最优控制器u^*有如下表达：Therefore, the optimal controller u ^* has the following expression:

将上式带入贝尔曼方程便得到如下的HJB方程：Putting the above equation into Bellman equation, we get the following HJB equation:

因此，参见图1，示出了根据本发明的非线性离散时间系统的在线学习控制方法，包括如下步骤：Therefore, referring to Figure 1, an online learning control method for a nonlinear discrete-time system according to the present invention is shown, which includes the following steps:

行为策略选择步骤S110：Behavior strategy selection step S110:

根据被控对象的特点，利用已有经验选择行为策略u，行为策略为学习过程中实际应用到被控对象的控制策略，其主要作用是用来产生学习过程中需要用到的系统状态数据。According to the characteristics of the controlled object, existing experience is used to select the behavioral strategy u. The behavioral strategy is the control strategy actually applied to the controlled object in the learning process. Its main function is to generate system status data that needs to be used in the learning process.

在选择了行为策略以后，即将对最优控制器进行在线学习。After selecting the behavioral strategy, the optimal controller will be learned online.

最优Q-函数定义步骤S120：Optimal Q-function definition step S120:

定义如下的最优Q-函数：Define the optimal Q-function as follows:

最优控制可表示为：/> best control It can be expressed as:/>

对于线性系统，Q^*(x_k,u_k)和分别是关于(x_k,u_k)和x_k的非线性函数。For linear systems, Q ^* (x _k ,u _k ) and are nonlinear functions about (x _k ,u _k ) and x _k respectively.

考虑到当前馈神经网络中激活函数的数量足够多时，前馈神经网络可对光滑或者连续的非线性函数实现任意精度的逼近，故引入评价网络和执行网络分别对Q^*(x_k,u_k)和进行在线逼近，所述评价网络和执行网络为神经网络；Considering that when the number of activation functions in the feedforward neural network is large enough, the feedforward neural network can achieve arbitrary precision approximation of smooth or continuous nonlinear functions, so the evaluation network and execution network are introduced to Q ^* (x _k ,u _k )and Perform online approximation, and the evaluation network and execution network are neural networks;

评价网络用来学习最优Q-函数Q^*(x_k,u_k)，执行网络则用来学习最优控制器u^*，假设评价网络中神经网络激活函数的数量为N_c，并记为最小二乘意义下评价网络对Q^*(x_k,u_k)的最佳逼近，则/>可表示为：The evaluation network is used to learn the optimal Q-function Q ^* (x _k , _uk ), and the execution network is used to learn the optimal controller u ^* . Assume that the number of neural network activation functions in the evaluation network is N _c , and note is the best approximation of the evaluation network to Q ^* (x _k ,u _k ) in the sense of least squares, then/> It can be expressed as:

本发明将设为常值，因此，只需调节隐藏层到输出层的权重。The present invention will Set to a constant value, so you only need to adjust the weights from the hidden layer to the output layer.

同样地，设执行网络激活函数的数量为N_a，并记为最小二乘意义下执行网络对/>的最佳逼近，则/>可表示为：Similarly, let the number of execution network activation functions be N _a , and note To perform network pairs in a least squares sense/> The best approximation of , then/> It can be expressed as:

本发明也将设为常值，仅调节隐藏层到输出层的权重。The present invention will also Set to a constant value and only adjust the weight from the hidden layer to the output layer.

估计误差计算步骤S140:Estimation error calculation step S140:

最优权重计算步骤S150：Optimal weight calculation step S150:

用估计值代替真实值可得如下的估计误差：Substituting the estimated value for the true value gives the following estimation error:

对于评价网络，其目标是通过在线学习使得估计误差e_k趋于0，因此，对于评价网络的权重，采用如下的梯度下降法进行调节：For the evaluation network, the goal is to make the estimation error e _k tend to 0 through online learning. Therefore, the weight of the evaluation network is adjusted using the following gradient descent method:

其中，α＞0评价网络的学习率，Δφ_c(k)＝φ_c(θ₂(k+1))-φ_c(θ₁(k))为回归向量，Φ_c(k)＝(1+Δφ_c(k)^TΔφ_c(k))²为归一化项。Among them, α>0 evaluates the learning rate of the network, Δφ _c (k) = φ _c (θ ₂ (k+1))-φ _c (θ ₁ (k)) is the regression vector, φ _c (k) = (1 +Δφ _c (k) ^T Δφ _c (k)) ² is the normalization term.

对于执行网络的权重则采用重要性加权法进行训练。执行网络的目标函数定义为：/>其中，执行网络的预测误差e_a(k)定义为：/>U_c为期望的最终目标函数，本发明中U_c＝0，即执行网络要在学习的过程中尽可能最小化同样，采用如下改进的梯度下降法对/>进行在线调节：For the weight of the execution network The importance weighting method is used for training. The objective function of the execution network is defined as:/> Among them, the prediction error e _a (k) of the execution network is defined as:/> U _c is the desired final objective function. In the present invention, U _c =0, that is, the execution network should be minimized as much as possible during the learning process. Similarly, the following improved gradient descent method is used to // Make online adjustments:

β＞0为执行网络的学习率，Φ_a(k)＝(1+φ_a(θ₄(k))^Tφ_a(θ₄(k)))²为归一项。β>0 is the learning rate of the execution network, Φ _a (k) = (1 + φ _a (θ ₄ (k)) ^T φ _a (θ ₄ (k))) ² is the normalization term.

由评价网络和执行网络的训练过程可以看出，学习过程中用到的所有状态数据均由行为策略u产生，当评价网络的权重和执行网络的权重/>收敛之后，执行网络的输出即为最优控制器的近似值。It can be seen from the training process of the evaluation network and the execution network that all state data used in the learning process are generated by the behavioral policy u. When the weight of the evaluation network and the weight of the execution network/> After convergence, the output of the execution network is an approximation of the optimal controller.

对于行为策略：For behavioral strategies:

在一个具体的实施例中，在对最优控制器的在线学习过程中，用到的所有状态数据均由行为策略u产生，为了确保算法对策略空间具有一定的探测能力，行为策略产生的状态数据需要足够丰富并满足一定的持续激励条件，以确保算法的收敛性。本发明中行为策略为：u_k＝u′_k+n_k，其中u′为任意的一个可行控制策略，通常根据被控系统的特征和经验选择，n_k为探索噪声，n_k可以是包含较多，例如足够多的频率的正弦、余弦信号或者幅值有限的随机信号。In a specific embodiment, in the online learning process of the optimal controller, all state data used are generated by the behavioral strategy u. In order to ensure that the algorithm has a certain detection capability for the policy space, the state generated by the behavioral strategy The data needs to be rich enough and meet certain continuous excitation conditions to ensure the convergence of the algorithm. The behavioral strategy in the present invention is: u _k =u′ _k +n _k , where u′ is any feasible control strategy, usually selected according to the characteristics and experience of the controlled system, n _k is exploration noise, and n _k can include More, such as sine and cosine signals with sufficient frequencies or random signals with limited amplitude.

对于评价网络和执行网络：For evaluation network and execution network:

本发明中的评价网络和执行网络均采用具有单隐藏层的前馈神经网络，其中，用于近似Q-函数的评价网络的输入为状态和控制输入，二者的输出均为标量。执行网络的输入同样为系统状态，其输出为多m维向量。在学习的过程中，三个神经网络均只调节隐藏层到输出层的权重，输入层到隐藏层的权重在学习过程开始之前随机生成，在学习的过程中保持不变。三个神经网络隐藏层的激活函数可以选择为常用的双曲正切函数、Sigmoid函数、线性整流器、多项式函数等。Both the evaluation network and the execution network in the present invention adopt a feedforward neural network with a single hidden layer, in which the inputs of the evaluation network used to approximate the Q-function are state and control inputs, and the outputs of both are scalars. The input of the execution network is also the system state, and its output is a multi-m-dimensional vector. During the learning process, the three neural networks only adjust the weights from the hidden layer to the output layer. The weights from the input layer to the hidden layer are randomly generated before the learning process starts and remain unchanged during the learning process. The activation functions of the three hidden layers of the neural network can be selected from commonly used hyperbolic tangent functions, sigmoid functions, linear rectifiers, polynomial functions, etc.

参见图2、图3，分别示出了评价网络和神经网络的示意图。Referring to Figures 2 and 3, schematic diagrams of the evaluation network and the neural network are shown respectively.

当然本发明的评价网络和执行网络也可以选为具有多个隐藏层的前馈神经网络，在学习的过程中也可以调节所有连接层的权重。Of course, the evaluation network and execution network of the present invention can also be selected as feedforward neural networks with multiple hidden layers, and the weights of all connection layers can also be adjusted during the learning process.

参见图4，示出了本发明的在线学习控制方法的原理示意图。Referring to Figure 4, a schematic diagram of the principle of the online learning control method of the present invention is shown.

所述计算机可执行指令在被处理器执行时执行上述的非线性离散时间系统的在线学习控制方法。The computer-executable instructions, when executed by the processor, execute the above-mentioned online learning control method of the nonlinear discrete-time system.

本发明具有如下的优点：The invention has the following advantages:

显然，本领域技术人员应该明白，上述的本发明的各单元或各步骤可以用通用的计算装置来实现，它们可以集中在单个计算装置上,可选地，他们可以用计算机装置可执行的程序代码来实现，从而可以将它们存储在存储装置中由计算装置来执行，或者将它们分别制作成各个集成电路模块，或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样，本发明不限制于任何特定的硬件和软件的结合。Obviously, those skilled in the art should understand that the above-mentioned units or steps of the present invention can be implemented with a general-purpose computing device. They can be concentrated on a single computing device. Alternatively, they can use programs executable by the computer device. Codes are implemented, so that they can be stored in a storage device and executed by a computing device, or they are separately made into individual integrated circuit modules, or multiple modules or steps among them are made into a single integrated circuit module for implementation. As such, the invention is not limited to any specific combination of hardware and software.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施方式仅限于此，对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单的推演或替换，都应当视为属于本发明由所提交的权利要求书确定保护范围。The above content is a further detailed description of the present invention in combination with specific preferred embodiments. It cannot be concluded that the specific embodiments of the present invention are limited to this. For those of ordinary skill in the technical field to which the present invention belongs, without departing from the concept of the present invention, Below, several simple deductions or substitutions can be made, which should all be deemed to belong to the protection scope of the present invention as determined by the submitted claims.

Claims

1. An online learning control method of a nonlinear discrete time system comprises the following steps:

behavior policy selection step S110:

according to the characteristics of the controlled object, an existing experience is utilized to select a behavior strategy u, wherein the behavior strategy is a control strategy which is actually applied to the controlled object in the learning process and is mainly used for generating system state data which is needed in the learning process;

optimal Q-function definition step S120:

the following optimal Q-function is defined:

the physical meaning is as follows: at time k, action policy u is taken, and at all times thereafter, optimal control policy u is taken ^* I.e. target strategy, as defined by the optimal Q-function, the above equation can be equivalently expressed as:

optimal controlCan be expressed as: />

For linear systems, Q ^* (x _k ,u _k) and respectively about (x) _k ,u _k) and x_k Is a nonlinear function of (2);

evaluation of network and execution of network introduction step S130:

introducing evaluation network and execution network to Q respectively ^* (x _k ,u _k) and performing online approximation, wherein the evaluation network and the execution network are neural networks;

the evaluation network is used for learning the optimal Q-function Q ^* (x _k ,u _k ) The execution network is used for learning the optimal controller u ^* Assume that the number of neural network activation functions in the evaluation network is N _c And recordEvaluating network pairs Q in the least squares sense ^* (x _k ,u _k ) Is>Can be expressed as:

wherein ,for hiding layer to output layer weights, +.>For evaluating the set of all activation functions in the hidden layer in the network +.>To evaluate the weight of the network input layer to the hidden layer, wherein +.>Weight corresponding to the ith activation function, < ->Representation (x) _k ,u _k ) Input value of the respective activation function, < ->An input value representing an ith activation function;

let N be the number of executing network activation functions _a And recordExecuting the network pair +.>Is>Can be expressed as:

executive netThe input to the network is the system state, wherein,for hiding layer-to-input layer weights, +.>For executing the set of network hidden layer activation functions, < ->For inputting layer-to-hidden layer weights, wherein +.>Weight corresponding to the ith activation function, < ->Represents x _k Input value of the respective activation function, < ->Representing the input value of the ith activation function, for x _k+1 There is->

An estimation error calculation step S140:

optimal approximation and />Instead of the exact value +.> and />The following estimation errors are available:

wherein ,the representation input is +.>When evaluating the input value of each activation function in the network, i.e

The optimal weight calculation step S150:

optimal weight W for evaluation network _c And performing an optimal weight W of the network _a On-line learning is performed, assuming that at time k, the network is evaluated and the network pair W is executed _c and W_a The estimated values of (a) are respectively and />Wherein l.ltoreq.k, i.e. the learning process is to be performed after the behavior strategy starts to generate state data, the output of the execution network at time k can be expressed as:

in action policy u _k Generating the next state x _k+1 Previously, the execution network has not been able to give the k+1 time versus W _a Thus, the network pair W is performed at time k+1 _a The estimated value of (2) is still adoptedThe output of the execution network at time k+1 is:

similarly, when the input is (x _k ,u _k ) The output of the evaluation network is:

when the input isThe output of the evaluation network is:

wherein ,also in the generation state x _k+1 The evaluation network could not give the k+1 time to W before _c So k+1 time evaluates the network pair W _c The estimated value of (2) is also +.>Thus, there are:

replacing the actual value with the estimated value yields the following estimation error:

weights for evaluation networksThe gradient descent method is adopted for adjustment,

weights for execution networkTraining is performed by importance weighting and modified gradient descent is used for ++>The on-line adjustment is carried out,

when evaluating the weight of a networkAnd performing a weight of the network->After convergence, the output of the execution network is the approximation of the optimal controller.

2. The online learning control method according to claim 1, characterized in that:

in the evaluation network and execution network introduction step S130,

for an evaluation network, it willSetting the weight from the hidden layer to the output layer only needs to be adjusted to be a constant value;

for the execution network, it willSet to a constant value, only the weights of the hidden layer to the output layer are adjusted.

3. The online learning control method according to claim 2, characterized in that:

in the optimum weight calculation step S150:

for evaluating the weight of the network, the following gradient descent method is adopted for adjustment, and the specific steps are as follows:

wherein, alpha is more than 0, the learning rate of the network is evaluated, and delta phi _c (k)＝φ _c (θ ₂ (k+1))-φ _c (θ ₁ (k) Is regression vector phi _c (k)＝(1+Δφ _c (k) ^T Δφ _c (k)) ² Is a normalization term;

for the weight of the execution network, training is performed by adopting an importance weighting method, and an objective function of the execution network is defined as follows:wherein the prediction error e of the network is performed _a (k) The definition is as follows: />U _c U in the present invention is the desired final objective function _c =0, i.e. the execution network is to minimize as much as possible during learningAnd adopts the improved gradient descent method to treat +.>The on-line adjustment is specifically as follows:

beta > 0 is the learning rate of the execution network, phi _a (k)＝(1+φ _a (θ ₄ (k)) ^T φ _a (θ ₄ (k))) ² Is a normalization item.

4. The online learning control method according to claim 3, characterized in that:

in the behavior policy selection step S110, the behavior policy is: u (u) _k ＝u′ _k +n _k Where u' is any one of the possible control strategies, n is selected according to the characteristics and experience of the controlled system _k To explore noise, n _k Is a random signal that contains more, e.g., sufficiently high frequencies of sine, cosine signals or of limited amplitude.

5. The online learning control method according to claim 3, characterized in that:

the evaluation network and the execution network are feedforward neural networks with single hidden layers, the input of the evaluation network for approximating the Q-function is a state and control input, the input of the execution network is a system state, and the output is a multi-m-dimensional vector.

6. The online learning control method of claim 5, wherein:

the evaluation network and the execution network only adjust weights from the hidden layer to the output layer, and the weights from the input layer to the hidden layer are randomly generated before the learning process starts and remain unchanged in the learning process.

7. The online learning control method of claim 5, wherein:

the activation function of the evaluation network and the execution network is one of a hyperbolic tangent function, a Sigmoid function, a linear rectifier, a polynomial function.

8. A storage medium storing computer-executable instructions, characterized by:

the computer executable instructions, when executed by a processor, perform the online learning control method of the nonlinear discrete time system in any one of claims 1-7.