CN102063640B

CN102063640B - Robot Behavior Learning Model Based on Utility Difference Network

Info

Publication number: CN102063640B
Application number: CN 201010564142
Authority: CN
Inventors: 宋晓; 麻士东; 龚光红
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2010-11-29
Filing date: 2010-11-29
Publication date: 2013-01-30
Anticipated expiration: 2030-11-29
Also published as: CN102063640A

Abstract

The invention relates to a robot behavior learning model based on a utility differential network, which comprises a utility fitting network unit, a differential signal calculating network unit, a confidence evaluating network unit, an action decision network unit, an action correcting network unit and an action executing unit. The model realizes the offline learning process and the online decision process. The utility fitting network unit calculates and obtains a utility fitting value of a state after action is executed; the differential signal calculating network unit is used for calculating a differential signal; the confidence evaluating network unit outputs the confidence obtained by calculating to the action correcting network unit; the action decision network unit outputs an action selecting function; and the action correcting network unit corrects the action selecting function by utilizing confidence, calculates a probability value selected by each action and outputs the action with largest probability to the action executing unit for executing. The invention can more favorably ensure the completeness of a robot for obtaining environmental knowledge and more favorably ensure the timeliness and effectiveness of robot behavior decision.

Description

Robot Behavior Learning Model Based on Utility Difference Network

技术领域 technical field

本发明涉及一种基于效用差分网络的机器人行为学习模型，属于人工智能领域的新应用之一。The invention relates to a robot behavior learning model based on a utility difference network, which belongs to one of new applications in the field of artificial intelligence.

背景技术 Background technique

机器人智能行为一般是指机器人在感知周边环境的基础上进行推理与决策，达到行为智能决策的过程。智能行为决策模型的建立需要对知识进行获取、表示和推理，并且能够自动评价机器人行为的优劣。目前，基于强化学习技术的认知行为模型在知识的获取、对决策环境的适应性、可重用性等方面所具有的优点，使其成为智能行为建模的首选。Robot intelligent behavior generally refers to the process in which a robot performs reasoning and decision-making on the basis of perceiving the surrounding environment to achieve behavioral intelligent decision-making. The establishment of an intelligent behavior decision-making model requires the acquisition, representation and reasoning of knowledge, and the ability to automatically evaluate the pros and cons of robot behavior. At present, the cognitive behavioral model based on reinforcement learning technology has advantages in knowledge acquisition, adaptability to decision-making environment, and reusability, making it the first choice for intelligent behavioral modeling.

强化学习过程需要对环境进行探索。可表述为：在某个状态下，决策者选择并执行一个动作，然后感知下一步的环境状态以及相应的回报。决策者并没有被直接告知在什么情况下要采取什么行动，而是根据回报修正自身的行为，来赢得更多的回报。简单地说，强化学习过程就是允许决策者通过不断尝试以得到最佳行动序列的过程。The reinforcement learning process requires exploration of the environment. It can be expressed as: In a certain state, the decision maker chooses and executes an action, and then perceives the next environment state and the corresponding reward. Decision makers are not directly told what actions to take under what circumstances, but modify their behavior according to the rewards to win more rewards. Simply put, the reinforcement learning process is the process that allows the decision maker to obtain the best sequence of actions through repeated trials.

目前机器人强化学习的行为决策中使用较多的是基于特定知识或规则的反应式方式，这种方式的缺点一是知识获取有限，二是问题获取的知识往往带有经验性，不能及时学习新的知识，三是推理过程实时性不高等。At present, the behavioral decision-making of robot reinforcement learning is mostly based on specific knowledge or rules. The disadvantage of this method is that the knowledge acquisition is limited, and the second is that the knowledge acquired by the problem is often empirical and cannot learn new ones in time. The third is that the reasoning process is not real-time.

发明内容 Contents of the invention

本发明针对目前机器人强化学习的行为决策存在的缺点，建立了一种基于效用差分网络的机器人行为学习模型。该模型是一个基于评价的学习系统，通过对环境的交互，自动生成系统的控制率，进而控制给出选择动作。本发明基于效用差分网络的机器人行为学习模型，解决一般行为决策模型知识获取有限、经验性过强的问题，实现的离线学习过程和在线决策过程，解决推理过程实时性不高的问题。Aiming at the shortcomings of the behavior decision-making of the current robot reinforcement learning, the invention establishes a robot behavior learning model based on utility difference network. The model is an evaluation-based learning system. Through the interaction with the environment, the control rate of the system is automatically generated, and then the control is given to select actions. The robot behavior learning model based on the utility difference network of the present invention solves the problem of limited knowledge acquisition and too strong experience of the general behavior decision-making model, realizes the offline learning process and online decision-making process, and solves the problem of low real-time reasoning process.

一种基于效用差分网络的机器人行为学习模型，包括：效用拟合网络单元、差分信号计算网络单元、置信度评价网络单元、动作决策网络单元、动作校正网络单元和动作执行单元；所述的效用拟合网络单元用来计算t时刻动作a_t经动作执行单元执行后产生的状态空间向量s_t所得到的效用拟合值

并输出给差分信号计算网络单元；差分信号计算网络单元根据输入的效用拟合值

以及根据状态空间向量s_t计算的立即回报函数，进一步计算得到差分信号ΔTD_t，并将该差分信号ΔTD_t输出给效用拟合网络单元、置信度评价网络单元以及动作决策网络单元；效用拟合网络单元利用差分信号ΔTD_t更新效用拟合网络单元中神经网络的权值；置信度评价网络单元利用效用拟合网络单元中神经网络的输入层的输入向量和隐层的输出向量以及差分信号，计算动作决策结果的置信度，并将该置信度输出给动作校正网络单元；动作决策网络单元根据输入的差分信号ΔTD_t与状态空间向量s_t，进行动作的选择学习，输出动作选择函数

给动作校正网络单元，其中j、k为大于0的整数；动作校正网络单元利用输入的置信度，对输入的动作选择函数

进行校正，然后计算校正后的动作的选取概率值，将概率最大的动作输出给动作执行单元执行，该动作执行后的状态空间向量再反馈输入给效用拟合网络单元、差分信号计算网络单元和动作决策网络单元。A robot behavior learning model based on a utility differential network, comprising: a utility fitting network unit, a differential signal calculation network unit, a confidence evaluation network unit, an action decision network unit, an action correction network unit and an action execution unit; The fitting network unit is used to calculate the utility fitting value obtained by the state space vector s _t generated by the action a _t executed by the action execution unit at time t

And output to the differential signal calculation network unit; the differential signal calculation network unit fits the value according to the input utility

And according to the immediate reward function calculated by the state space vector s _t , the difference signal ΔTD _t is further calculated, and the difference signal ΔTD _t is output to the utility fitting network unit, the confidence evaluation network unit and the action decision network unit; utility fitting The network unit uses the differential signal ΔTD _t to update the weight of the neural network in the utility fitting network unit; the confidence evaluation network unit utilizes the input vector of the input layer of the input layer of the neural network in the utility fitting network unit, the output vector of the hidden layer and the differential signal, Calculate the confidence degree of the action decision result, and output the confidence degree to the action correction network unit; the action decision network unit performs action selection learning according to the input differential signal ΔTD _t and the state space vector s _t , and outputs the action selection function

Give the action correction network unit, where j and k are integers greater than 0; the action correction network unit uses the confidence of the input to select a function for the input action

Perform correction, then calculate the selected probability value of the corrected action, output the action with the highest probability to the action execution unit for execution, and then feed back the state space vector after the action is executed to the utility fitting network unit, differential signal calculation network unit and Action Decision Network Unit.

所述的学习模型具有两个过程：离线学习过程和在线决策过程；所述的离线学习过程中上述各单元都要参与，所述的在线决策过程中仅由离线学习最后得到的动作决策网络单元与动作执行单元参与，在线决策过程中的动作决策网络单元根据t时刻动作执行后的状态空间向量s_t进行计算并得出输出动作选择函数

通过动作选择器输出最终选择的动作给动作执行单元执行，执行动作后得到的状态空间向量再输入给动作决策网络单元。The learning model has two processes: an offline learning process and an online decision-making process; in the offline learning process, each of the above-mentioned units must participate, and in the online decision-making process, only the action decision network unit finally obtained by offline learning Participating with the action execution unit, the action decision network unit in the online decision-making process calculates according to the state space vector s _t after the action execution at time t and obtains the output action selection function

The final selected action is output through the action selector to the action execution unit for execution, and the state space vector obtained after the action is executed is then input to the action decision network unit.

本发明的优点与有益效果为：Advantage of the present invention and beneficial effect are:

(1)本发明的机器人学习模型不需要计算产生正确的行动，而是通过在行动-环境交互-评价的学习环境中解决机器人知识获取困难的问题。由于此学习模型不需要明确指定环境模型，环境的因果关系已经隐含在具体差分反馈网络中，从而能较好保证机器人获取环境知识的完备性；(1) The robot learning model of the present invention does not need to calculate and generate correct actions, but solves the problem of difficult robot knowledge acquisition in the learning environment of action-environment interaction-evaluation. Since this learning model does not need to explicitly specify the environment model, the causal relationship of the environment has been implicit in the specific differential feedback network, which can better ensure the completeness of the robot's acquisition of environmental knowledge;

(2)本模型设计的离线学习过程能在机器人决策前完成环境知识学习过程，在线决策过程能进一步完成机器人环境知识获取，运行时的决策不再进行探索和学习活动，只需要利用重构的网络进行计算和相加，这种离线与在线的模型设计保证了机器人的行为决策具有较好的实时性，较好地保证了机器人行为决策的及时性和有效性。(2) The offline learning process designed by this model can complete the environmental knowledge learning process before the robot makes a decision, and the online decision-making process can further complete the acquisition of the robot's environmental knowledge. The decision-making at runtime does not carry out exploration and learning activities, and only needs to use the reconstructed The network performs calculation and addition, and this offline and online model design ensures that the robot's behavior decision-making has better real-time performance, and better guarantees the timeliness and effectiveness of the robot's behavior decision-making.

附图说明 Description of drawings

图1为本发明学习模型第一实施例的离线学习过程结构示意图；Fig. 1 is a schematic structural diagram of the offline learning process of the first embodiment of the learning model of the present invention;

图2为本发明学习模型第一实施例的动作决策网络流程示意图；Fig. 2 is a schematic diagram of the action decision network flow of the first embodiment of the learning model of the present invention;

图3为本发明学习模型第一实施例中动作决策网络中的遗传算子编码结构示意图；Fig. 3 is a schematic diagram of the genetic operator coding structure in the action decision network in the first embodiment of the learning model of the present invention;

图4为本发明学习模型第一实施例中动作决策网络中的遗传算子交叉操作示意图；Fig. 4 is a schematic diagram of the genetic operator crossover operation in the action decision network in the first embodiment of the learning model of the present invention;

图5为本发明学习模型第二实施例中在线决策过程的示意图。Fig. 5 is a schematic diagram of the online decision-making process in the second embodiment of the learning model of the present invention.

具体实施方式 Detailed ways

下面将结合附图和实施例对本发明作进一步的详细说明。其中，第一实施例对本发明学习模型的离线学习过程进行了具体说明；第二实施例对在线决策过程进行说明。The present invention will be further described in detail with reference to the accompanying drawings and embodiments. Among them, the first embodiment specifically describes the offline learning process of the learning model of the present invention; the second embodiment describes the online decision-making process.

如图1所示，本发明学习模型包括五个部分：效用拟合网络单元11、差分信号计算网络单元12、置信度评价网络单元13、动作决策网络单元14和动作校正网络单元15。本发明学习模型的离线学习过程中，五个部分都参与其中。As shown in FIG. 1 , the learning model of the present invention includes five parts: a utility fitting network unit 11 , a differential signal calculation network unit 12 , a confidence evaluation network unit 13 , an action decision network unit 14 and an action correction network unit 15 . In the off-line learning process of the learning model of the present invention, five parts are all involved.

效用拟合网络单元11用来计算t时刻选择的动作a_t经动作执行单元16执行后产生的不同的状态空间向量s_t所得到的效用拟合值

并输出效用拟合值

给差分信号计算网络单元12，差分信号计算网络单元12输出差分信号ΔTD_t给置信度评价网络单元13和效用拟合网络单元11。效用拟合网络单元11再利用差分信号计算网络单元12输入的差分信号ΔTD_t来不断更新，从而达到真实的效用拟合。The utility fitting network unit 11 is used to calculate the utility fitting value obtained from the different state space vectors s _t generated by the action a _t selected at time t after being executed by the action execution unit 16

and output the utility fitted value

To the differential signal calculation network unit 12 , the differential signal calculation network unit 12 outputs the differential signal ΔTD _t to the confidence evaluation network unit 13 and the utility fitting network unit 11 . The utility fitting network unit 11 utilizes the differential signal ΔTD _t inputted by the differential signal calculation network unit 12 to continuously update, so as to achieve real utility fitting.

差分信号计算网络单元12根据输入的效用拟合值

以及根据状态空间向量s_t计算的立即回报函数，进一步计算得到差分信号ΔTD_t，并将该差分信号ΔTD_t输出给效用拟合网络单元11、置信度评价网络单元13以及动作决策网络单元14。The differential signal calculation network unit 12 fits the value according to the input utility

And according to the immediate reward function calculated by the state space vector _st , the difference signal ΔTD _t is further calculated, and the difference signal ΔTD _t is output to the utility fitting network unit 11, the confidence evaluation network unit 13 and the action decision network unit 14.

置信度评价网络单元13利用效用拟合网络单元11中神经网络的输入层的输入向量和隐层的输出向量以及差分信号ΔTD_t计算动作决策结果的置信度，并将该置信度输出给动作校正网络单元15，用于对动作选择的调整。The confidence evaluation network unit 13 uses the input vector of the input layer of the neural network in the utility fitting network unit 11, the output vector of the hidden layer and the difference signal ΔTD _t to calculate the confidence of the action decision result, and outputs the confidence to the action correction The network unit 15 is used for adjusting the action selection.

动作决策网络单元14根据输入的差分信号ΔTD_t与状态空间向量s_t，利用递阶遗传算法对神经网络进行优化，实现动作的选择学习，输出动作选择函数

给动作校正网络单元15，其中j、k为大于0的整数。The action decision network unit 14 uses the hierarchical genetic algorithm to optimize the neural network according to the input differential signal ΔTD _t and the state space vector s _t , realizes action selection learning, and outputs an action selection function

Give the action correction network unit 15, where j and k are integers greater than 0.

动作校正网络单元15利用输入的置信度，对输入的动作选择函数

进行校正，将概率最大的动作输出。动作执行后的状态空间向量再反馈输入给效用拟合网络单元11、差分信号计算网络单元12和动作决策网络单元14。The action correction network unit 15 uses the confidence of the input to select a function for the input action

Make corrections and output the action with the highest probability. The state space vector after the action is executed is fed back to the utility fitting network unit 11 , the differential signal calculation network unit 12 and the action decision network unit 14 .

其中，效用拟合网络单元11用来对特定的行为引起的状态变化进行效用评价，得到效用拟合值，由两层反馈的神经网络构成，如图1所示。神经网络的输入为状态空间向量s_t，隐层激活函数为Sigmoid函数，神经网络输出为对动作执行之后状态的效用拟合值，神经网络的权系数为A、B和C(。该神经网络包含n个输入向量单元，以及h个隐层单元，每个隐层单元接受n个输入并具有n个连接权值，输出单元接受n+h个输入并有n+h个权值。对于h的值，用户可以自行设定，一般设定为3，本发明实施例中设置为2。Among them, the utility fitting network unit 11 is used to evaluate the utility of the state change caused by a specific behavior, and obtain the utility fitting value, which is composed of a two-layer feedback neural network, as shown in FIG. 1 . The input of the neural network is the state space vector s _t , the activation function of the hidden layer is the Sigmoid function, the output of the neural network is the utility fitting value of the state after the action is executed, and the weight coefficients of the neural network are A, B and C (. The neural network Contains n input vector units, and h hidden layer units, each hidden layer unit accepts n inputs and has n connection weights, and the output unit accepts n+h inputs and has n+h weights. For h The value of can be set by the user, generally it is set to 3, and it is set to 2 in the embodiment of the present invention.

该神经网络的输入向量为x_i(t)，i＝1，2，3...n，函数x_i(t)是s_t经过归一化得到的，则隐层单元的输出向量为：The input vector of the neural network is x _i (t), i=1, 2, 3...n, the function x _i (t) is obtained by normalizing _st , then the output vector of the hidden layer unit is:

${y the y}_{j j} ((t t)) = = g g [[{Σ Σ}_{i i = = 11}^{n no} {a a}_{ij ij} ((t t)) {x x}_{i i} ((i i,, j j = = 1,2,3 1,2,3,, . . . . . . h h$

上式中所用到的函数

a_ij(t)为输入层与隐层的权值A的向量，。效用拟合网络11的输出为对效用的拟合值

它是对输入层和隐层的线性组合：The functions used in the above formula

a _ij (t) is the vector of the weight A of the input layer and the hidden layer,. The output of utility fitting network 11 is the fitted value of utility

It is a linear combination of the input and hidden layers:

$\overset{^^}{U u (({s the s}_{t t}))} = = {Σ Σ}_{i i = = 11}^{n no} {b b}_{i i} ((t t)) {x x}_{i i} ((t t)) + + {Σ Σ}_{j j = = 11}^{h h} {c c}_{j j} ((t t)) {y the y}_{j j} ((t t))$

其中，b_i(t)表示输入层与输出层的权值B的向量，c_j(t)表示隐层与输出层的权值C的向量。Among them, b _i (t) represents the vector of the weight B of the input layer and the output layer, and c _j (t) represents the vector of the weight C of the hidden layer and the output layer.

网络的权值A、B和C利用差分信号ΔTD_t进行更新，如果差分信号ΔTD_t为正，则说明在上一个行动产生了积极的效果，因此该行动被选择的机会应得到加强。输入层与输出层的权值B和隐层与输出层的权值C利用下式进行更新：The weights A, B, and C of the network are updated using the differential signal ΔTD _t . If the differential signal ΔTD _t is positive, it means that the last action had a positive effect, so the chance of this action being selected should be strengthened. The weight B of the input layer and the output layer and the weight C of the hidden layer and the output layer are updated using the following formula:

b_i(t+1)＝b_i(t)+λ·ΔTD_t+1·x_i(t)，i＝1，2，3...n _bi (t+1)= _bi (t)+λ·ΔTD _t+1 · _xi (t), i=1, 2, 3...n

c_j(t+1)＝c_j(t)+λ·ΔTD_t+1·y_j(t)，j＝1，2，3...hc _j (t+1)=c _j (t)+λ·ΔTD _t+1 y _j (t), j=1, 2, 3...h

式中，λ为大于零的常数，可由用户自行设置。输入与隐层的权值A的更新按照下式进行：In the formula, λ is a constant greater than zero, which can be set by the user. The update of the weight A of the input and hidden layer is carried out according to the following formula:

a_ij(t+1)＝a_ij(t)+λ_h·ΔTD_t+1·y_j(t)·sgn(c_j(t))·x_i(t)a _ij (t+1)＝a _ij (t)+λ _h ∆TD _t+1 y _j (t) sgn(c _j (t)) x _i (t)

其中，λ_h为大于零的数，可由用户自行设置，ΔTD_t+1表示对应t+1时刻动作执行后产生的状态空间向量的差分信号，sgn是如下函数：Among them, λ _h is a number greater than zero, which can be set by the user. ΔTD _t+1 represents the difference signal of the state space vector corresponding to the execution of the action at time t+1, and sgn is the following function:

$(z) = \{\begin{matrix} 1 & z > 0 \\ 0 & z = 0 \\ - 1 & z < 0 \end{matrix},$ 此处z为权值C的向量c_j(t)。 $(z) = \{\begin{matrix} 1 & z > 0 \\ 0 & z = 0 \\ - 1 & z < 0 \end{matrix},$ Here z is the vector c _j (t) of weight C.

如图1所示，差分信号计算网络单元12根据效用拟合网络单元11输出的拟合效用以及状态的立即回报函数R(s_t)计算得到差分信号ΔTD_t。根据瞬时差分算法，ΔTD_t是利用下式进行迭代计算得到的：As shown in Figure 1, the differential signal calculation network unit 12 is based on the fitting utility output by the utility fitting network unit 11 And the immediate reward function R(s _t ) of the state is calculated to obtain the differential signal ΔTD _t . According to the instantaneous difference algorithm, ΔTD _t is obtained by iterative calculation using the following formula:

$ΔT ΔT {D D.}_{t t} = = R R (({s the s}_{t t})) + + γ γ \cdot \cdot \overset{^^}{U u} (({s the s}_{t t + + 11})) - - \overset{^^}{U u} (({s the s}_{t t}))$

其中，R(s_t)是对状态s_t的立即评价，就是立即回报函数的输出，γ为折扣系数，可由用户自行设置。

表示t+1时刻动作执行后产生的状态空间向量s_t+1所得到的效用拟合值，

表示t时刻动作执行后产生的状态空间向量s_t所得到的效用拟合值。Among them, R(s _t ) is the immediate evaluation of the state s _t , which is the output of the immediate reward function, and γ is the discount coefficient, which can be set by the user.

Indicates the utility fitting value obtained by the state space vector s _t+1 generated after the execution of the action at time t+1,

Indicates the utility fitting value obtained from the state space vector s _t generated after the execution of the action at time t.

计算得到的差分信号ΔTD_t用于对效用拟合网络单元11以及置信度评价网络单元13的权系数进行训练更新。如果差分信号ΔTD_t产生了积极的作用，则应加强这种动作，并且对它的置信度也应加强，即更相信此动作应被选择。另外，差分信号ΔTD_t还用来对动作决策网络单元14中动作选择函数的权值进行更新，以保证实现对最优动作的选择。The calculated differential signal ΔTD _t is used to train and update the weight coefficients of the utility fitting network unit 11 and the confidence evaluation network unit 13 . If the differential signal ΔTD _t has a positive effect, this action should be strengthened, and its confidence should also be strengthened, that is, it is more believed that this action should be selected. In addition, the differential signal ΔTD _t is also used to update the weight of the action selection function in the action decision network unit 14, so as to ensure the selection of the optimal action.

如图1所示，在动作决策网络单元14输出动作决策函数时，置信度评价网络单元13要计算输出动作的置信度，该置信度用于对动作选择的调整。置信度评价网络单元13的输入是状态向量x_i(t)和y_j(t)，它们从效用拟合网络单元11的隐层和输出层引出。As shown in FIG. 1 , when the action decision network unit 14 outputs the action decision function, the confidence evaluation network unit 13 needs to calculate the confidence of the output action, and the confidence is used to adjust the action selection. The input of the confidence evaluation network unit 13 is the state vectors x _i (t) and y _j (t), which are derived from the hidden layer and the output layer of the utility fitting network unit 11 .

置信度p₀(t)通过如下公式计算：Confidence p ₀ (t) is calculated by the following formula:

${p p}_{00} ((t t)) = = {Σ Σ}_{i i = = 11}^{n no} {α α}_{i i} ((t t)) {x x}_{i i} ((t t)) + + {Σ Σ}_{j j = = 11}^{h h} {β β}_{j j} ((t t)) {y the y}_{j j} ((t t))$

其中，权值α_i(t)和β_j(t)利用下式进行更新：Among them, the weights α _i (t) and β _j (t) are updated using the following formula:

α_i(t+1)＝α_i(t)+λ_p·ΔTD_t+1·x_i(t)，i＝1，2，3...nα _i (t+1)=α _i (t)+λ _p ΔTD _t ₊₁ xi (t), i=1, 2, 3...n

β_j(t+1)＝β_j(t)+λ_p·ΔTD_t+1·y_j(t)，j＝1，2，3...hβ _j (t+1)=β _j (t)+λ _p ΔTD _t+1 y _j (t), j=1, 2, 3...h

其中，λ_p表示学习率，是0-1之间的数值，经验值是0.618，用户可以根据自己的经验进行设置。从上式来看，难以保证p₀(t)的置信度区间在[0，1]，故引入Sigmoid函数对p₀(t)进行变换，得到p(t)，这样，输出置信度就与随机函数概率相吻合：Among them, λ _p represents the learning rate, which is a value between 0 and 1, and the experience value is 0.618. Users can set it according to their own experience. From the above formula, it is difficult to ensure that the confidence interval of p ₀ (t) is [0, 1], so the Sigmoid function is introduced to transform p ₀ (t) to obtain p(t), so that the output confidence is the same as Random function probabilities coincide:

$p p ((t t)) = = \frac{11}{11 + + {e e}^{- - a a {p p}_{00} ((t t))}}$

置信度修正因子a起到平滑学习过程的作用，改变a，就可改变学习对环境的调节范围，若a过大，则会使学习系统失去调节作用，应根据先验知识设定合适的a值，a＞0，本发明中a的取值范围是[1，10]。The confidence correction factor a plays a role in smoothing the learning process. Changing a can change the adjustment range of learning to the environment. If a is too large, the learning system will lose its adjustment function. A suitable a should be set according to prior knowledge Value, a > 0, the value range of a in the present invention is [1, 10].

置信度对动作选择的调节作用，反映了决策的不确定性。可以看出，随着状态的效用逐渐趋于真实值，即ΔTD_t的增加，置信度p(t)也是逐渐增加的，对动作的选择越来越确定。再利用输出置信度p(t)对动作决策网络单元14的每一个输出动作选择函数

进行校正，校正过程在动作校正网络单元15里完成。The moderating effect of confidence on action choice reflects the uncertainty of decision-making. It can be seen that as the utility of the state gradually tends to the true value, that is, the increase of ΔTD _t , the confidence p(t) also gradually increases, and the choice of action becomes more and more certain. Reuse the output confidence p (t) to each output action selection function of the action decision network unit 14

Correction is performed, and the correction process is completed in the action correction network unit 15.

动作决策网络单元14采用神经网络实现，它共分为四层，如图1所示，第一层到第四层分别是：输入层，模糊子集层，可变节点层和函数输出层，其中，可变节点层也称函数拟合层。分别用h＝1，2，3，4表示网络的四层。设

分别为第h层的第i个节点的输入和输出，i为每层的节点，其中，第一层节点数为I个，第二层节点数为I*J个，第三层节点数为L个，第四层节点数为K个，I，J，K，L都是正整数。均值m_ij，方差σ_ij分别为第二层中对应x_i(t)输入的第j个节点的高斯隶属函数的位置参数和宽度。Action decision network unit 14 adopts neural network to realize, and it is divided into four layers altogether, as shown in Figure 1, the first layer to the fourth layer are respectively: input layer, fuzzy subset layer, variable node layer and function output layer, Among them, the variable node layer is also called the function fitting layer. Use h=1, 2, 3, 4 to represent the four layers of the network respectively. set up

are the input and output of the i-th node of the h-th layer, i is the node of each layer, wherein, the number of nodes in the first layer is I, the number of nodes in the second layer is I*J, and the number of nodes in the third layer is L, the number of nodes in the fourth layer is K, and I, J, K, L are all positive integers. The mean m _ij and the variance σ _ij are respectively the position parameter and width of the Gaussian membership function of the jth node corresponding to the input of x _i (t) in the second layer.

动作决策网络单元14的神经网络的输入层，输入量为状态空间向量s_t归一化得到的x_i(t)，它表征了输入时刻的机器人态势信息。输入层的第i个节点的输入

为：In the input layer of the neural network of the action decision network unit 14, the input quantity is _xi (t) obtained by normalizing the state space vector _st , which represents the situation information of the robot at the input moment. The input of the i-th node of the input layer

for:

$I I {N N}_{i i}^{11} = = {x x}_{i i} ((t t)),, i i = = 1,2,3 1,2,3 . . . . . . I I$

模糊子集层用来对输入层的输入变量进行模糊化处理。输出为每一输入向量的隶属度。输入层的每个x_i(t)在模糊子集层对应有J个输入，例如图1中，此处的J为2，其中，每个输入就是x_i(t)的一个模糊子集，输出是x_i(t)在这一模糊子集的隶属度。它的每一节点激活函数为高斯隶属函数，输出为：The fuzzy subset layer is used to fuzzify the input variables of the input layer. The output is the degree of membership for each input vector. Each x _i (t) of the input layer corresponds to J inputs in the fuzzy subset layer, for example, in Figure 1, where J is 2, where each input is a fuzzy subset of x _i (t), The output is the degree of membership of x _i (t) in this fuzzy subset. Its activation function of each node is a Gaussian membership function, and the output is:

${Q Q}_{{x x}_{i i} j j}^{22} = = exp exp [[- - {((\frac{{x x}_{j j} ((t t)) - - {m m}_{ij ij}}{{σ σ}_{ij ij}}))}^{22}]],, i i = = 1,2,3 1,2,3 . . . . . . I I,, j j = = 1,2,3 1,2,3 . . . . . . J J$

其中，

为对应于输入x_i(t)的第j个输出，，exp是以自然对数e为底的指数函数，x_j(t)为输入层的第j个节点的输入。in,

is the jth output corresponding to the input x _i (t), exp is an exponential function with natural logarithm e as the base, and x _j (t) is the input of the jth node of the input layer.

神经网络为满足对于动作函数的拟合，需要在一定程度调整输出，可变节点层用来实现这种调节功能。可变节点层是通过节点数以及连接权值的变化实现调节功能的，节点数以及连接权值利用递阶遗传算法进行优化，动态确定它们的数目以及大小，以满足网络对动作函数的拟合，具体在后面介绍。可变节点层的激活函数为高斯函数，其位置参数与宽度分别为m_l和σ_l。第二层与第三层的连接数也是不确定的，也需要在优化过程中动态调整，连接权值都为1。第三层节点的输出为：In order to satisfy the fitting of the action function, the neural network needs to adjust the output to a certain extent, and the variable node layer is used to realize this adjustment function. The variable node layer realizes the adjustment function through changes in the number of nodes and connection weights. The number of nodes and connection weights are optimized using a hierarchical genetic algorithm, and their number and size are dynamically determined to meet the network's fitting of the action function. , which will be introduced later. The activation function of the variable node layer is a Gaussian function, and its position parameter and width are m _l and σ _l respectively. The number of connections between the second layer and the third layer is also uncertain and needs to be dynamically adjusted during the optimization process, and the connection weights are all 1. The output of the third layer node is:

${O o}_{l l}^{33} = = exp exp [[- - {((\frac{{Σ Σ}_{i i = = 11,, j j = = 11}^{I I,, J J} {O o}^{22}_{{x x}_{i i} j j} - - {m m}_{l l}}{{σ σ}_{l l}}))}^{22}]],, l l = = 1,2,3 1,2,3 . . . . . . L L$

节点数目与可选动作数是相同的，函数输出层输出的是对动作函数的拟合值，用来计算得到每个动作的选择概率。第四层节点的输出为：The number of nodes is the same as the number of optional actions, and the output of the function output layer is the fitting value of the action function, which is used to calculate the selection probability of each action. The output of the fourth layer node is:

${O o}_{k k}^{44} = = {Σ Σ}_{l l = = 11}^{L L} {ω ω}_{lk lk} {O o}_{l l}^{33},, k k = = 1,2 1,2,, 33 . . . . . . K K$

其中，第四层的输出O_k ⁴就是动作选择函数

Among them, the output O _k ⁴ of the fourth layer is the action selection function

${\overset{^^}{A A}}_{k k} (({s the s}_{t t})) = = {Σ Σ}_{l l = = 11}^{L L} {ω ω}_{lk lk} {O o}_{l l}^{33},, k k = = 1,2 1,2,, 33 . . . . . . K K$

第三层每个节点与第四层都有连接，ω_lk为第三层第l个节点与第四层第k个节点的连接权值，连接权值ω_lk也需要在优化过程中动态调整。Every node in the third layer is connected to the fourth layer, ω _lk is the connection weight between the lth node in the third layer and the kth node in the fourth layer, and the connection weight ω _lk also needs to be dynamically adjusted during the optimization process .

假设网络第一层有I个输入，第i个输入在第二层有k_i个模糊划分，则第二层结点数共有k₁+k₂+...+k_I个，节点函数为各输入对于其模糊子集的隶属度函数。总结起来，需要动态调整优化的神经网络结构为：第三层节点数、第二层与第三层的连接数。需调整优化的网络参数为：第二层输入参数隶属函数的位置m_ij和宽度σ_ij、第三层(隐层)高斯激活函数的位置参数m_l与宽度σ_l以及第三层与第四层的连接权值ω_lk。Assuming that the first layer of the network has I inputs, and the i-th input has k _i fuzzy divisions in the second layer, then there are k ₁ +k ₂ +...+k _I nodes in the second layer, and the node functions are The membership function of the input for its fuzzy subset. To sum up, the neural network structure that needs to be dynamically adjusted and optimized is: the number of nodes in the third layer, and the number of connections between the second layer and the third layer. The network parameters to be adjusted and optimized are: the position m _ij and width σ _ij of the membership function of the input parameter of the second layer, the position parameter m _l and width σ _l of the Gaussian activation function of the third layer (hidden layer), and the third layer and the fourth layer Layer connection weight ω _lk .

这里，利用混合递阶遗传算法对动作决策网络中的神经网络的结构和参数进行优化，网络的结构优化为确定第三层节点数、第二层与第三层的连接数。网络的参数优化包括输入向量的隶属度函数位置参数m_ij和宽度σ_ij、第三层隐节点的高斯函数的位置参数m_l与宽度σ_l以及第三层与第四层的连接权值ω_lk。利用递阶遗传算法对神经网络进行优化和调整，使网络在每一轮决策时，根据输入差分信号的变化，不断优化得到动作选择函数，以实现对动作的选择作用。Here, the hybrid hierarchical genetic algorithm is used to optimize the structure and parameters of the neural network in the action decision network. The network structure optimization is to determine the number of nodes in the third layer and the number of connections between the second layer and the third layer. The parameter optimization of the network includes the position parameter m _ij and width σ _ij of the membership function of the input vector, the position parameter m _l and width σ _l of the Gaussian function of the hidden node in the third layer, and the connection weight ω between the third layer and the fourth layer _lk . The hierarchical genetic algorithm is used to optimize and adjust the neural network, so that the network can continuously optimize the action selection function according to the change of the input differential signal in each round of decision-making, so as to realize the action selection function.

动作校正网络单元15利用置信度评价网络单元13输出的评价值即动作的置信度p(t)，对动作选择网络单元14输出的动作选择函数

进行校正，然后计算得到每个动作选取的概率值，将概率最大的动作输出。The action correction network unit 15 uses the evaluation value output by the confidence evaluation network unit 13, that is, the confidence degree p(t) of the action, to the action selection function output by the action selection network unit 14

Perform correction, and then calculate the probability value of each action selection, and output the action with the highest probability.

校正过程是以

为均值，以p(t)为概率生成一个随机函数，作为新的动作选择函数A_j(s_t)。p(t)越小，则A_j(s_t)就越远离

反之，则越靠近以新的A_j(s_t)代替

The calibration process is based on

As the mean value, generate a random function with p(t) as the probability, as the new action selection function A _j (s _t ). The smaller p(t) is, the farther A _j (s _t ) is from

On the contrary, the closer Replace with new A _j (s _t )

动作选择函数A_j(s_t)值越大，则对应的动作a_j被选择的概率越大。选择概率的计算公式为：The greater the value of the action selection function A _j (st _t ), the greater the probability that the corresponding action a _j will be selected. The formula for calculating the probability of selection is:

$P P (({a a}_{j j} | | {s the s}_{t t})) = = \frac{{e e}^{{A A}_{j j} (({s the s}_{t t}))}}{\underset{k k}{Σ Σ} {e e}^{{A A}_{k k} (({s the s}_{t t}))}}$

则输出为概率值最大的动作。Then the output is the action with the highest probability value.

机器人行为学习模型中，所述动作决策网络单元14还包括4个子单元：编码单元141，种群初始化单元142，适应度函数确定单元143，以及遗传操作单元144，如图2所示。In the robot behavior learning model, the action decision network unit 14 also includes four subunits: an encoding unit 141, a population initialization unit 142, a fitness function determination unit 143, and a genetic operation unit 144, as shown in FIG. 2 .

编码单元141是对遗传算法的染色体结构进行确定。递阶遗传算法是根据生物染色体的层次结构提出的，生物体中染色体中的基因可分为调节基因与构造基因，调节基因的作用是控制构造基因是否被激活。这里，借鉴生物染色体基因的这种特点，对上述优化问题进行编码。种群中的每个个体由决定网络的结构和参数两部分组成。种群个体的基因结构采用二级递阶结构编码，即按照生物染色体的基因层次结构分两层实现，上层基因实现对第三层节点数量以及第二层输入隶属函数的编码，也就是第三层节点数以及第二层输入隶属函数的参数m_ij和σ_ij。如图3所示，实现对第三层(隐层)节点数量进行控制的部分称为控制基因，下层为参数基因，实现对第三层(隐层)节点的隶属函数以及网络连接的编码，包括第三层(隐层)节点隶属函数参数m_l与σ_l以及第二层与第三层的连接数，以及第三层与第四层的连接权值ω_lk。The encoding unit 141 determines the chromosome structure of the genetic algorithm. Hierarchical genetic algorithm is proposed based on the hierarchical structure of biological chromosomes. Genes in chromosomes in organisms can be divided into regulatory genes and structural genes. The role of regulatory genes is to control whether structural genes are activated. Here, referring to this characteristic of biological chromosome genes, the above optimization problem is coded. Each individual in the population consists of two parts that determine the structure and parameters of the network. The genetic structure of population individuals adopts a two-level hierarchical structure encoding, that is, it is realized in two layers according to the gene hierarchy of biological chromosomes, and the upper layer genes realize the encoding of the number of nodes in the third layer and the input membership function of the second layer, that is, the third layer The number of nodes and the parameters m _ij and σ _ij of the second layer input membership function. As shown in Figure 3, the part that controls the number of nodes in the third layer (hidden layer) is called the control gene, and the lower layer is the parameter gene, which realizes the membership function of the nodes in the third layer (hidden layer) and the encoding of network connections. Including the third layer (hidden layer) node membership function parameters m _l and σ _l , the number of connections between the second layer and the third layer, and the connection weight ω _lk between the third layer and the fourth layer.

控制基因的隐节点数以及参数基因的表示网络连接的基因均采用二进制编码，用“0”、“1”分别表示“无”和“有”的情况。其他表示隶属函数参数以及连接权值的基因均采用实值编码，即用实数表示。将第三层结构编码为一个二进制串，一位表示第三层一个节点，作为控制基因，“1”表示该节点起作用，“0”表示该节点不起作用。这样，控制基因串中“1”的个数即为起作用的神经网络隐层节点的实际个数。参数基因中，第二、三层连接基因采用二进制编码，“1”表示相应的第二层与第三层有连接，“0”表示相应的第二层与第三层没连接。第三、四层权值基因采用实值编码，表示了第三层与第四层的连接权值。The number of hidden nodes controlling the gene and the gene representing the network connection of the parameter gene are all encoded in binary, and "0" and "1" are used to represent "no" and "yes" respectively. Other genes that represent membership function parameters and connection weights are coded by real values, that is, expressed by real numbers. The third layer structure is coded as a binary string, one bit represents a node in the third layer, as a control gene, "1" indicates that the node works, and "0" indicates that the node does not work. In this way, the number of "1" in the control gene string is the actual number of hidden layer nodes of the neural network. Among the parameter genes, the second and third layer connection genes are coded in binary, "1" indicates that the corresponding second layer is connected to the third layer, and "0" indicates that the corresponding second layer is not connected to the third layer. The weight genes of the third and fourth layers are encoded by real values, which represent the connection weights of the third layer and the fourth layer.

由此可知，控制基因控制着节点的个数，如果某一节点为“0”，则此节点与前后两层都无连接，相应地它所对应的参数基因都是不存在的，可以看出，参数基因由控制基因来控制，如果上层控制基因的某一节点不存在，那么相应的下层参数基因就没有被激活，这正体现了控制基因的控制作用，并且这种控制作用能和网络的拓扑结构相对应。编码而成的一个个染色体构成种群，利用它们完成进化。It can be seen that the control gene controls the number of nodes. If a node is "0", then this node has no connection with the two layers before and after, and accordingly its corresponding parameter genes do not exist. It can be seen that , the parameter gene is controlled by the control gene, if a certain node of the upper control gene does not exist, then the corresponding lower parameter gene will not be activated, which just reflects the control function of the control gene, and this control function can be compared with the network corresponding to the topology. The encoded chromosomes form a population, and they are used to complete evolution.

进一步地，种群初始化单元142是对染色体种群进行初始化。为了顺利进行遗传算法运行，需要在之前产生一定数量的染色体个体，并且这些个体应当是随机产生的，代表了多种网络结构的可能性，即应有足够的求解空间。合适的种群规模对于遗传算法的收敛具有重要意义，种群数量太小难以求得满意的结果，太大则计算复杂，种群规模一般取10～160。Further, the population initialization unit 142 initializes the chromosome population. In order to run the genetic algorithm smoothly, a certain number of chromosome individuals need to be generated before, and these individuals should be randomly generated, representing the possibility of various network structures, that is, there should be enough solution space. Appropriate population size is of great significance to the convergence of genetic algorithm. If the population size is too small, it is difficult to obtain satisfactory results. If the population size is too large, the calculation will be complicated. The population size is generally 10-160.

进一步地，确定染色体的适应度函数单元143。个体的适应度函数采用个体误差和结构的复杂度来表示，在个体误差寻优的同时考虑控制网络的复杂度，从而得到最优的网络结构。网络的适应度函数形式如下：Further, the fitness function unit 143 of the chromosome is determined. The fitness function of the individual is expressed by the individual error and the complexity of the structure, and the complexity of the control network is considered while optimizing the individual error, so as to obtain the optimal network structure. The fitness function of the network has the following form:

$f f ((i i)) = = α α \frac{11}{E E. ((i i))} + + β β \frac{11}{H h ((i i))},, i i = = 1,2 1,2,, . . . . . .,, I I$

其中，E(i)，H(i)分别表示第i个个体的个体误差和结构复杂度，其中：Among them, E(i), H(i) represent the individual error and structural complexity of the i-th individual, respectively, where:

$E E. ((i i)) = = {Σ Σ}_{j j = = 11}^{K K} {(({\overset{^^}{y the y}}_{ij ij} - - {y the y}_{ij ij}))}^{22}$

H(i)＝1+exp[-c(N_i(0))]H(i)＝1+exp[-c(N _i (0))]

和y_ij为第i个个体的第j个输出和期望输出，其中，期望输出y_ij为期望动作的选择函数

如果期望输出某个动作，则设它的期望值

其他期望动作函数都设为0。N_i(0)为第i个个体的隐层节点为零的数目，c为参数调节因子。其中，b，c为常值，α与β为大于零的常数，α+β＝1。利用这样的适应值函数可保证在优化网络权值的同时得到合适的神经网络结构。

and y _ij are the jth output and expected output of the i-th individual, where the expected output y _ij is the selection function of the expected action

If an action is expected to be output, set its expected value

All other expected action functions are set to 0. N _i (0) is the number of hidden layer nodes of the i-th individual that is zero, and c is the parameter adjustment factor. Wherein, b and c are constant values, α and β are constants greater than zero, and α+β=1. Utilizing such a fitness function can ensure that a suitable neural network structure is obtained while optimizing the network weight.

进一步地，进行遗传操作单元144，遗传操作包括选择、交叉和变异。初始的种群，经过选择、交叉和变异之后，进行了一轮遗传操作，完成了一轮进化，得到了新一代的子种群，并循环这个过程，使得进化不断进行，以使子代收敛到最优。Further, a genetic operation unit 144 is performed, and the genetic operation includes selection, crossover and mutation. The initial population, after selection, crossover, and mutation, undergoes a round of genetic operations, completes a round of evolution, and obtains a new generation of subpopulations, and repeats this process so that the evolution continues, so that the offspring converge to the optimal population. excellent.

选择是从上代种群中，根据个体的适应度，按照一定的规则或方法，选择出一些优良的个体遗传到下一代群体中。算法中采用精英选择的方法进行选择，即根据适应度值大小，每一代种群中最优的个体保留到下一代，这种方式保证了算法的渐进收敛。对于个体i，它的选择概率为：Selection is to select some excellent individuals from the previous generation population and inherit them to the next generation population according to the fitness of individuals and according to certain rules or methods. The algorithm adopts the method of elite selection for selection, that is, according to the size of the fitness value, the best individual in each generation population is retained to the next generation, which ensures the gradual convergence of the algorithm. For individual i, its selection probability is:

${p p}_{s the s} ((i i)) = = \frac{{f f}_{i i}}{{Σ Σ}_{j j = = 11}^{N N} {f f}_{j j}}$

其中，f_i为个体i的适应度，N为种群的个体数。Among them, _fi is the fitness of individual i, and N is the number of individuals in the population.

交叉操作就是随机地使得两个体的基因对应位互换，这个过程反映了随机信息交换，目的在于产生新的基因组合，即产生新的个体。进化到一定程度时，特别是出现大多数个体相同的群体时，交叉是无法产生新的个体的，这时只能靠变异产生新的个体。变异是以一定概率使基因位发生改变，以增加新的搜索空间，也就是说，变异增加了全局优化的特质。在交叉和变异的过程中，随机性起到了重要的作用，只有随机的交叉和变异操作才保证了更新个体的出现，而这种随机性是通过交叉和变异概率表现出来的。The crossover operation is to randomly exchange the corresponding bits of the genes of two individuals. This process reflects random information exchange, and the purpose is to generate new gene combinations, that is, to generate new individuals. When the evolution reaches a certain level, especially when a population with most of the same individuals appears, new individuals cannot be produced by crossover, and new individuals can only be produced by mutation. Mutation is to change the gene position with a certain probability to increase a new search space, that is to say, mutation increases the characteristics of global optimization. In the process of crossover and mutation, randomness plays an important role. Only random crossover and mutation operations can guarantee the appearance of updated individuals, and this randomness is expressed through the probability of crossover and mutation.

在遗传操作过程中，交叉概率和变异概率对遗传算法的性能有很大影响。如果在遗传算法(Genetic Algorithm，简称GA)运行初期，将交叉概率选大，变异概率选小，可以加快算法的收敛速度，有利于搜索最优解。但随着搜索的进行，就需要降低交叉概率增加变异概率，以至算法不易陷入局部极值，能搜索新的解。In the process of genetic operation, crossover probability and mutation probability have great influence on the performance of genetic algorithm. If the crossover probability is selected to be large and the mutation probability is selected to be small at the initial stage of the Genetic Algorithm (GA) operation, the convergence speed of the algorithm can be accelerated, which is conducive to searching for the optimal solution. However, as the search progresses, it is necessary to reduce the crossover probability and increase the mutation probability, so that the algorithm is not easy to fall into the local extremum and can search for new solutions.

同时变异概率不能取得太大，否则算法将难以收敛以及破坏最优解的基因。对于适应度高的解，取较低的交叉概率和变异概率，使其有较大的机会进入到下一代；而对于适应度较低的解，应取较高的交叉概率和变异概率，使其尽快被淘汰掉；当成熟收敛发生时，应加大交叉概率和变异概率，以加快新个体的产生。按照以上的交叉和变异概率的选取原则，采用一种自适应的交叉概率和变异概率的方法，其计算公式为：At the same time, the mutation probability cannot be too large, otherwise the algorithm will be difficult to converge and destroy the genes of the optimal solution. For solutions with high fitness, lower crossover probability and mutation probability should be chosen to make it more likely to enter the next generation; for solutions with lower fitness, higher crossover probability and mutation probability should be chosen so that It is eliminated as soon as possible; when mature convergence occurs, the probability of crossover and mutation should be increased to speed up the generation of new individuals. According to the above selection principles of crossover and mutation probability, an adaptive method of crossover probability and mutation probability is adopted, and its calculation formula is:

${p p}_{c c} = = \{\begin{matrix} \frac{{f f}_{max max} - - {f f}_{avg avg}}{f f} & (({f f}_{max max} - - {f f}_{avg avg})) < < f f \\ 0.8 0.8 & (({f f}_{max max} - - {f f}_{avg avg})) &GreaterEqual; &Greater Equal; f f \end{matrix}$

${p p}_{m m} = = \{\begin{matrix} \frac{0.2 0.2 (({f f}_{max max} - - {f f}^{' '}))}{{f f}_{max max} - - {f f}_{avg avg}} & (({f f}_{max max} - - {f f}^{' '})) < < (({f f}_{max max} - - {f f}_{avg avg})) \\ 0.2 0.2 & (({f f}_{max max} - - {f f}^{' '})) &GreaterEqual; &Greater Equal; (({f f}_{max max} - - {f f}_{avg avg})) \end{matrix}$

其中，p_c为交叉概率，p_m为变异概率。f_max为群体中的最大适应度，f_avg为平均适应度，f是交叉的两个个体中较大的适应度，f′为变异个体的适应度。Among them, p _c is the crossover probability, and p _m is the mutation probability. f _max is the maximum fitness in the population, f _avg is the average fitness, f is the greater fitness of the two crossed individuals, and f' is the fitness of the mutant individual.

该方法在进化空间较大时，能够快速找到最优解；在收敛到局部最优解附近，增加群体的多样性。可以看出适应度最大的个体变异概率为零，适应度较大的个体交叉和变异概率都很小，这样保护了优良个体。而适应度较小的个体交叉和变异概率都很大，需不断破坏它。This method can quickly find the optimal solution when the evolution space is large; it can increase the diversity of the population when it converges to the local optimal solution. It can be seen that the mutation probability of the individual with the greatest fitness is zero, and the probability of crossover and mutation of the individual with greater fitness is very small, which protects the excellent individuals. However, individuals with low fitness have a high probability of crossover and mutation, so it needs to be constantly destroyed.

按照交叉概率在选中的两个个体之间进行交叉操作，交叉操作分别对控制基因以及参数基因的相对应部分进行操作，如图4所示。这样的交叉操作能使两个染色体的对应基因进行交叉，也保证了二进制编码和实数编码基因的对应交叉。两个染色体对应位的交叉采用单点交叉，随机地选择两个个体的相同位置，在选中的位置进行基因的互换操作。According to the crossover probability, the crossover operation is performed between the two selected individuals, and the crossover operation is performed on the corresponding parts of the control gene and the parameter gene, as shown in Figure 4. Such a crossover operation can make the corresponding genes of the two chromosomes to be crossed over, and also ensures the corresponding crossover of the binary-coded and real-coded genes. The crossover of the corresponding positions of two chromosomes adopts single-point crossover, randomly selects the same position of two individuals, and performs gene exchange operation at the selected position.

变异操作包含对所有基因的操作，对控制基因以及参数基因中的二进制编码基因，采用位变异，进行逻辑取反操作，即把“1”变为“0”，把“0”变为“1”。对于实值编码的基因进行线形组合的高斯变异：The mutation operation includes the operation of all genes. For the control gene and the binary coded gene in the parameter gene, the bit mutation is used to perform the logical inversion operation, that is, to change "1" into "0" and "0" into "1" ". Gaussian mutation for linear combinations of real-valued coded genes:

${\overset{^^}{m m}}_{ij ij} = = {m m}_{ij ij} + + α α \frac{11}{f f} N N ((0,1 0,1))$

${\overset{^^}{σ σ}}_{ij ij} = = {σ σ}_{ij ij} + + α α \frac{11}{f f} N N ((0,1 0,1))$

${\overset{^^}{m m}}_{l l} = = {m m}_{l l} + + α α \frac{11}{f f} N N ((0,1 0,1))$

${\overset{^^}{σ σ}}_{l l} = = {σ σ}_{l l} + + α α \frac{11}{f f} N N ((0,1 0,1))$

${\overset{^^}{ω ω}}_{lk lk} = = {ω ω}_{lk lk} + + α α \frac{11}{f f} N N ((0,1 0,1))$

其中，α为进化率，f为每个个体的适应度，N(0，1)为期望为0，标准差为1的正态分布随机函数。Among them, α is the evolution rate, f is the fitness of each individual, N(0,1) is a normal distribution random function with expectation 0 and standard deviation 1.

综上所述，递阶遗传算法实现神经网络优化的算法步骤为如下：To sum up, the algorithm steps of hierarchical genetic algorithm to realize neural network optimization are as follows:

1.对网络结构和参数按照递阶结构进行编码，生成染色体个体。1. Encode the network structure and parameters according to the hierarchical structure to generate chromosome individuals.

2.随机生成2N个初始染色体种群，进化代数设为t＝0。2. Randomly generate 2N initial chromosome populations, and set the evolution algebra to t=0.

3.根据公式计算每个个体的适应度值及种群中最大适应度值和平均适应度值。3. Calculate the fitness value of each individual, the maximum fitness value and the average fitness value in the population according to the formula.

4.按照个体选择概率在种群中选择N个个体作为父代，令t＝t+1。4. Select N individuals in the population as parents according to the individual selection probability, let t=t+1.

5.从父代中随机选择两个个体，按照交叉概率进行交叉操作。如果交叉，则首先复制两个体，原个体保留。用复制的个体进行交叉操作，产生两个新个体。直到父代种群都交叉完毕。5. Randomly select two individuals from the parent generation, and perform the crossover operation according to the crossover probability. If it crosses, first copy the two bodies, and keep the original body. Perform crossover operation with duplicated individuals to generate two new individuals. Until the crossover of the parent population is completed.

6.对所有个体按照变异概率进行变异操作。6. Perform mutation operations on all individuals according to the mutation probability.

7.当最优个体的适应度和群体适应度达到给定的阀值时，或者达到最大进化代数，则算法的迭代过程收敛、算法结束。否则转3继续执行，直至满足结束条件。7. When the fitness of the optimal individual and the fitness of the group reach a given threshold, or reach the maximum evolution algebra, the iterative process of the algorithm converges and the algorithm ends. Otherwise, go to 3 and continue to execute until the end condition is met.

优化结束后，取最优个体的网络结构和参数作为决策网络，利用它实现动作决策的计算。After the optimization is finished, the network structure and parameters of the optimal individual are taken as the decision-making network, and it is used to realize the calculation of action decision-making.

在动作决策网络单元14中，用递阶遗传算法来优化网络的结构和参数。在每一个新的态势出现后，首先利用瞬时差分算法(Temporal-Difference method，TD)所提供的差分信号ΔTD_t来对动作选择网络进行参数更新，以期得到更有利的可选动作。具体地说，它是利用差分信号ΔTD_t，通过对种群中的染色体每个参数基因中的第三层与第四层连接权值进行更新，之后再进行遗传操作。这样对应这个动作函数的权值空间都进行了更新，经遗传得到的对应动作的新权值也应该是更大的，能够反映对此最优动作的学习。差分信号对于连接权值的更新过程为：In the action decision network unit 14, a hierarchical genetic algorithm is used to optimize the structure and parameters of the network. After each new situation appears, first use the differential signal ΔTD _t provided by the Temporal-Difference method (TD) to update the parameters of the action selection network in order to obtain more favorable optional actions. Specifically, it uses the differential signal ΔTD _t to update the connection weights of the third layer and the fourth layer in each parameter gene of the chromosome in the population, and then performs genetic operations. In this way, the weight space corresponding to this action function has been updated, and the new weight value of the corresponding action obtained through inheritance should also be larger, which can reflect the learning of this optimal action. The update process of the differential signal for the connection weight is:

其中，ω_ij为第三层第i个隐节点与第四层第j个动作选择函数的连接权值，

是加权系数，是0-1之间的数值，经验值是0.62。Among them, ω _ij is the connection weight of the i-th hidden node in the third layer and the j-th action selection function in the fourth layer,

It is a weighting coefficient, a value between 0-1, and the empirical value is 0.62.

本实施例利用递阶遗传算法对神经网络进行训练，实现知识学习。解决了现有技术中行为决策研究中较多是基于特定知识或规则的反应式方式，较好地解决了机器人行为决策的知识获取，推理决策问题，主体通过与环境交互学习逼近知识的完备性，具有较高层次的学习和推理能力。In this embodiment, a hierarchical genetic algorithm is used to train the neural network to realize knowledge learning. It solves the reactive way in which behavior decision-making research in the prior art is mostly based on specific knowledge or rules, and better solves the knowledge acquisition and reasoning decision-making problems of robot behavior decision-making. The subject approaches the completeness of knowledge through interactive learning with the environment , with higher-level learning and reasoning abilities.

图5为本发明学习模型第二实施例中在线决策过程的示意图。离线学习之后，最后得到的动作决策网络单元14为最优的，使用该动作决策网络单元14用于实时的在线决策。而其他，如效用拟合网络单元11、差分信号计算网络单元12、置信度评价网络单元13和动作校正网络单元15在在线决策过程中都去掉，不再使用。动作决策网络单元14根据选择的动作a_t经动作执行单元16执行后的状态空间向量s_t进行计算并得出输出动作选择函数

通过动作选择器输出最终选择的动作，该动作经动作执行单元16执行后得到的状态空间向量再输入给动作决策网络单元14。Fig. 5 is a schematic diagram of the online decision-making process in the second embodiment of the learning model of the present invention. After offline learning, the finally obtained action decision network unit 14 is optimal, and this action decision network unit 14 is used for real-time online decision-making. Others, such as utility fitting network unit 11 , differential signal calculation network unit 12 , confidence evaluation network unit 13 and action correction network unit 15 are removed in the online decision-making process and are no longer used. The action decision network unit 14 performs calculations according to the state space vector _st after the selected action _at is executed by the action execution unit 16, and obtains the output action selection function

The finally selected action is output through the action selector, and the state space vector obtained after the action is executed by the action execution unit 16 is then input to the action decision network unit 14 .

本实施例利用训练得到的神经网络，进行机器人的行为实时决策。学习过程与决策过程的分离，保证了在线决策的效率，满足实时运行的需要。In this embodiment, the trained neural network is used to make real-time decision-making on the behavior of the robot. The separation of learning process and decision-making process ensures the efficiency of online decision-making and meets the needs of real-time operation.

Claims

1. A modeling apparatus of a robot behavior learning model based on a utility difference network, comprising an action execution unit (16), characterized by further comprising: the device comprises a utility fitting network unit (11), a differential signal calculation network unit (12), a confidence evaluation network unit (13), an action decision network unit (14) and an action correction network unit (15);

the utility fitting network unit (11) is used for calculating action a at the moment t_tA state space vector s generated after execution by the action execution unit (16)_tThe obtained utility fitting value

And output to the differential signal calculation network unit (12); the difference signal computing network unit (12) fits values according to the utility of the input

And according to the state space vector s_tThe calculated immediate return function is further calculated to obtain a differential signal delta TD_tAnd the differential signal Δ TD is converted into a differential signal_tThe confidence coefficient is output to a utility fitting network unit (11), a confidence coefficient evaluation network unit (13) and an action decision network unit (14); the utility fitting network unit (11) uses the differential signal Δ TD_tUpdating the weight of the neural network in the utility fitting network unit (11); the confidence evaluation network unit (13) calculates the confidence of the action decision result by using the input vector of the input layer of the neural network in the utility fitting network unit (11), the output vector of the hidden layer and the differential signal, and outputs the confidence to the action correction network unit (15); the action decision network unit (14) is based on the input differential signal Delta TD_tAnd state space vector s_tPerforming selection learning of the motion and outputting a motion selection function

Giving an action correction network element (15), wherein j, k are integers greater than 0; the action correction network unit (15) selects a function for the action of the input using the confidence of the input

Correcting, calculating the selection probability value of the corrected action, outputting the action with the maximum probability to an action execution unit (16) for execution, and feeding back and inputting the state space vector after the action is executed to a utility fitting network unit (11), a differential signal calculation network unit (12) and an action decision network unitA member (14);

the utility fitting network unit (11) is composed of a neural network, and comprises an input layer, a hidden layer and an output layer, wherein the weight of the neural network is A, B and C, and the input vector x of the input layer of the neural network is_i(t) is the state space vector s generated after the action is executed at time t_tThe hidden layer activation function is a Sigmoid function, and the output of the neural network is a utility fitting value of a state after the action is executed

Wherein, b_i(t) vector representing weight B of input layer and output layer, c_j(t) represents the vector of the weight C of the hidden layer and the output layer, n is the number of units of the input layer, h is the number of units of the hidden layer, y_j(t) is the output vector of the hidden layer unit:

according to a function

Calculation of a_ij(t) is a vector of the weight A of the input layer and the hidden layer; the vector of the weight of the neural network in the utility fitting network unit (11) is specifically updated by using the following formula:

b_i(t+1)＝b_i(t)+λ·ΔTD_t+1·x_i(t)，i＝1,2,3...n

c_j(t+1)＝c_j(t)+λ·ΔTD_t+1·y_j(t)，j＝1,2,3...h

a_ij(t+1)＝a_ij(t)+λ_h·ΔTD_t+1·y_j(t)·sgn(c_j(t))·x_i(t)

wherein λ is a constant greater than zero，λ_hIs a number greater than zero, Δ TD_t+1A differential signal, sgn (c), representing the state space vector generated after the execution of the action corresponding to time t +1_j(t)) is determined according to a function sgn, which is:

sgn (z) = \{\begin{matrix} 1 & z > 0 \\ 0 & z = 0 \\ - 1 & z < 0 \end{matrix}

the differential signal computing network unit (12) computes a differential signal delta TD according to an instantaneous differential algorithm_t：Wherein R(s)_t) Is to the state space vector s_tIs evaluated immediately, gamma is the discount coefficient,

represents the state space vector s generated after the action is executed at the moment t +1_t+1The resulting utility fit values are then compared to the expected values,

representing the state space vector s generated after the execution of the motion at time t_t(ii) the resulting utility fit value;

the confidence p (t) finally output by the confidence evaluation network unit (13) is as follows:

p (t) = \frac{1}{1 + e^{- a p_{0} (t)}},

p_{0} (t) = Σ_{i = 1}^{n} α_{i} (t) x_{i} (t) + Σ_{j = 1}^{h} β_{j} (t) y_{j} (t)

wherein, the value range of the confidence coefficient correction factor a is [1, 10]]，x_i(t)、y_j(t) are respectively the effectsThe input vector of the neural network in the fitting network unit (11) and the output vector of the hidden layer unit, n and h are the number of the neural network input layer units and the number of the hidden layer units in the utility fitting network unit (11) respectively; weight alpha after action execution corresponding to t +1 moment_i(t +1) and beta_jThe update of (t +1) is as follows:

α_i(t+1)＝α_i(t)+λ_p·ΔTD_t+1·x_i(t)，i＝1,2,3...n

β_j(t+1)＝β_j(t)+λ_p·ΔTD_t+1·y_j(t)，j＝1,2,3...h

wherein λ is_pDenotes a learning rate, and is a value between 0 and 1, Δ TD_t+1A differential signal representing a state space vector generated after the action is executed corresponding to the time t + 1;

the action decision network unit (14) is realized by adopting a neural network, the neural network comprises an input layer, a fuzzy subset layer, a variable node layer and a function output layer, and the input IN of the ith node of the input layer_i ¹Comprises the following steps:

IN_i ¹＝x_i(t)，i＝1,2,3...I

wherein I is the number of nodes in the input layer, x_i(t) is the state space vector s after execution of the action_tNormalizing the obtained vector; the fuzzy subset layer is used for fuzzifying the input of the input layer, and corresponds to the input x_iThe jth output of (t)

Comprises the following steps:

{O_{x_{i} j}}^{2} = \exp [- {(\frac{x_{j} (t) - m_{ij}}{σ_{ij}})}^{2}], i = 1,2,3 . . . I, j = 1,2,3 . . . J

wherein J is each x of the input layers_i(t) number of inputs corresponding to fuzzy subset layer, m_ijAnd σ_ijPosition parameters and widths, x, of membership functions representing input vectors, respectively_j(t) is the input vector for the jth node of the input layer;

the activation function of the variable node layer is a Gaussian function, and the position parameter and the width of the Gaussian function are m respectively_lAnd σ_lNode output of variable node layer O_l ³Comprises the following steps:

{O_{l}}^{3} = \exp [- {(\frac{Σ_{i = 1, j = 1}^{I, J} {O^{2}}_{x_{i} j} - m_{l}}{σ_{l}})}^{2}], l = 1,2,3 . . . L

wherein L is the number of nodes of the variable node layer; the output of the function output layer is a fitting value to the action function, namely an action selection function

{\hat{A}}_{k} (s_{t}) = Σ_{l = 1}^{L} ω_{lk} {O_{l}}^{3}, k = 1,2,3 . . . K

Wherein K is the number of nodes of the function output layerCounting; omega_lkThe connection weight of the first node of the third layer and the kth node of the fourth layer; i, J, K and L are all positive integers;

the membership function position parameter m of the input vector_ijAnd width σ_ijPosition parameter m of Gaussian function of variable node layer_lAnd width sigma_lAnd the connection weight of the variable node layer and the function output layer is optimized and adjusted by adopting a hierarchical genetic algorithm;

said action correcting network element (15) to

Taking p (t) as the probability to generate a random function as a new action selection function A for the mean value_j(s_t) Then calculating the probability value P (a)_j|s_t) Outputting the action with the maximum probability value; the formula of the selected probability value is as follows:wherein, a_jFor the jth action, s_tFor obtaining state space vectors after the execution of the actions at time t, A_k(s_t) Selecting a function for the kth action, A_j(s_t) Selecting a function for the jth action;

the modeling device has two processes: an offline learning process and an online decision process; all the units are involved in the offline learning process, only the action decision network unit (14) and the action execution unit (16) which are obtained finally by the offline learning participate in the online decision process, and the action decision network unit (14) in the online decision process executes the action according to the state space vector s generated after the action execution unit (16) executes the action at the moment t_tCalculating and obtaining an output action selection function

By action of selectorAnd outputting the finally selected action to an action execution unit (16) for execution, and inputting the state space vector obtained after the action is executed to an action decision network unit (14).