[go: up one dir, main page]

CN119511739A - Training method and device for quadruped robot controller based on reinforcement learning - Google Patents

Training method and device for quadruped robot controller based on reinforcement learning Download PDF

Info

Publication number
CN119511739A
CN119511739A CN202510089324.5A CN202510089324A CN119511739A CN 119511739 A CN119511739 A CN 119511739A CN 202510089324 A CN202510089324 A CN 202510089324A CN 119511739 A CN119511739 A CN 119511739A
Authority
CN
China
Prior art keywords
target
controller
control signal
training
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202510089324.5A
Other languages
Chinese (zh)
Other versions
CN119511739B (en
Inventor
付鑫
韩默渊
朱西硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
General Coal Research Institute Co Ltd
CCTEG Beijing Tianma Intelligent Control Technology Co Ltd
Original Assignee
General Coal Research Institute Co Ltd
CCTEG Beijing Tianma Intelligent Control Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by General Coal Research Institute Co Ltd, CCTEG Beijing Tianma Intelligent Control Technology Co Ltd filed Critical General Coal Research Institute Co Ltd
Priority to CN202510089324.5A priority Critical patent/CN119511739B/en
Publication of CN119511739A publication Critical patent/CN119511739A/en
Application granted granted Critical
Publication of CN119511739B publication Critical patent/CN119511739B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Manipulator (AREA)

Abstract

本申请提供了一种基于强化学习的四足机器人控制器的训练方法、装置和电子设备,该方法包括:根据第一传感器数据、第一高程图数据和第一目标控制信号对第一控制器进行训练,得到训练完成的第一目标控制器;在仿真环境中控制四足机器人执行第二目标控制信号,确定第二传感器数据、第二高程图数据和RGB图像数据;根据每个关节的目标位置信息、RGB图像数据、第二传感器数据和第二目标控制信号,构建训练数据;根据训练数据对待训练的第二控制器进行训练,得到训练完成的第二目标控制器,由此,通过使用RGB图像数据作为视觉感知信息,够更加适应复杂的光照和障碍物环境,可以更加精准和高效地对四足机器人进行控制。

The present application provides a training method, device and electronic device for a quadruped robot controller based on reinforcement learning, the method comprising: training the first controller according to first sensor data, first elevation map data and a first target control signal to obtain a trained first target controller; controlling the quadruped robot to execute a second target control signal in a simulation environment to determine second sensor data, second elevation map data and RGB image data; constructing training data according to target position information of each joint, RGB image data, second sensor data and second target control signal; training the second controller to be trained according to the training data to obtain a trained second target controller, thereby, by using RGB image data as visual perception information, it can be more adaptable to complex lighting and obstacle environments, and can control the quadruped robot more accurately and efficiently.

Description

Training method and device of four-foot robot controller based on reinforcement learning
Technical Field
The invention relates to the technical field of deep learning, in particular to a training method and device for a four-foot robot controller based on reinforcement learning and electronic equipment.
Background
The four-legged robot has huge potential application in various tasks such as inspection, rescue, post-disaster exploration and the like, in order to cope with complex terrain environments in real environments, visual perception information is generally required to be fused into a control algorithm of the four-legged robot, in the related technology, a depth camera or a laser radar is used for acquiring the visual perception information in a common way at present, but the visual perception information cannot be acquired accurately in the mode, so that the control of the four-legged robot cannot be realized more accurately and flexibly.
Disclosure of Invention
The present application aims to solve at least one of the technical problems in the related art to some extent.
According to a first aspect of the application, a training method of a four-foot robot controller based on reinforcement learning is provided, and the training method comprises the steps of controlling the four-foot robot to execute first target control signals in a simulation environment, determining first sensor data and first elevation chart data, training the first controller according to the first sensor data, the first elevation chart data and the first target control signals to obtain a first target controller after training, wherein the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network, controlling the four-foot robot to execute second target control signals in the simulation environment, determining second sensor data, second elevation chart data and RGB image data, inputting the second sensor data, the second elevation chart data and the second target control signals into the first target controller, obtaining target position information of each joint of the four-foot robot, constructing training networks according to target position information of each joint, the second target external perception network and the second target controller, and the second target perception network, and obtaining training network, and constructing second target perception network and second target control network according to the training position information of each joint.
According to a second aspect of the application, a training device of a four-foot robot controller based on reinforcement learning is provided, and the training device comprises a first determining module, a first training module, a first obtaining module, a second obtaining module and a second obtaining module, wherein the first determining module is used for controlling the four-foot robot to execute a first target control signal in a simulation environment and determining first sensor data, second elevation map data and RGB image data, the first training module is used for training the first controller according to the first sensor data, the first elevation map data and the first target control signal to obtain a first target controller after training, the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network, the second determining module is used for controlling the four-foot robot to execute a second target control signal in the simulation environment and determining second sensor data, second elevation map data and RGB image data, the first obtaining module is used for inputting the second sensor data, the second elevation map data and the second target control signal into the first target controller to obtain the training position information of each human joint, the second target controller is used for obtaining the second target sensor data, the second target sensor data and the second target control network to obtain the training information according to the second target position information, and the second target control network is used for obtaining the second target sensor data.
The third aspect of the application provides an electronic device, which is characterized by comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the training method of the four-foot robot controller based on reinforcement learning according to the first aspect is realized when the processor executes the program.
A fourth aspect of the present application is directed to a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the training method of the reinforcement learning-based quadruped robot controller of the first aspect.
A fifth aspect of the application proposes a computer program product comprising a computer program which, when executed by a processor, implements the reinforcement learning based four-legged robot controller training method according to the first aspect.
The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects:
The training method of the quadruped robot controller based on reinforcement learning provided by the application has the advantages that the RGB image data is used as visual perception information, the training method is more suitable for complex illumination and obstacle environments, more ground information can be acquired, the quadruped robot can be controlled more accurately and efficiently, and a solid foundation is laid for the follow-up quadruped robot to safely and smoothly execute tasks.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.
Drawings
The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:
fig. 1 is a schematic flow chart of a training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application;
FIG. 2 is a flow chart of another training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application;
FIG. 3 is a flow chart of another training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application;
Fig. 4 is a schematic flow chart of obtaining predicted position information of each joint according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a training device of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The training method of the four-foot robot controller based on reinforcement learning of the present application will be described in detail with reference to the following examples.
Fig. 1 is a flow chart of a training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application.
As shown in fig. 1, the training method of the four-foot robot controller based on reinforcement learning provided in the embodiment specifically includes the following steps:
s101, controlling the quadruped robot to execute a first target control signal in a simulation environment, and determining first sensor data and first elevation map data.
Wherein the first target control signal includes a target speedAnd a target angular velocity
In each simulation step t, motion of the quadruped robot and contact dynamics of the quadruped robot with the ground are simulated in a simulation environment, and first sensor data obtained by executing a first target control signal by the quadruped robot in the simulation environment are obtainedFirst elevation map data
Wherein the first sensor data includes, but is not limited to, inertial measurement unit (Inertial measurement unit, IMU) data and joint encoder data.
It should be noted that IMU data includes, but is not limited to, fuselage velocityAngular velocity of fuselage: and body pose The body pose is represented using a projection of the gravity vector under the robot's own coordinate system.
It should be noted that the joint encoder data includes, but is not limited to, joint angles: And joint angular velocity
It should be noted that the specific manner of acquiring the first elevation map data is not limited in the present application, and alternatively, the first elevation map data may be acquired by a depth camera or a laser radar.
S102, training the first controller according to the first sensor data, the first elevation chart data and the first target control signal to obtain a trained first target controller, wherein the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network.
In the embodiment of the application, the first sensor data can be input to a first internal perception network in a first controller to obtain a first coding vector, the first elevation map data is input to a first external perception network in the first controller to obtain a second coding vector, the first target vector, the first coding vector and the second coding vector corresponding to a first target control signal are spliced to obtain a first splicing vector, the first splicing vector is input to the first controller network in the first controller to obtain first prediction position information of each joint of the quadruped robot and actual control signals of the quadruped robot, and the first controller is trained according to the actual control signals and the first target control signals to obtain the first target controller.
S103, controlling the quadruped robot to execute a second target control signal in the simulation environment, and determining second sensor data, second elevation map data and RGB image data.
In the related art, visual perception information is obtained based on a depth camera and a laser radar, the depth camera is easy to fail in a strong light or low light environment, especially when a strong reflection or shadow area is encountered, the visual perception information is often inaccurate, a smooth, transparent or high reflection material can cause the depth camera to not accurately obtain the depth information of the surface, a larger error is generated, in complex environments filled with dust, smoke or water mist and the like, signals of the depth camera are easy to be interfered, image blurring or incomplete depth data are caused, especially in scenes such as mines, fire rescue and the like, the detection range of the laser radar is limited, especially in multi-obstacle or complex environments, blind areas are easy to generate, in addition, a laser beam is influenced when penetrating through the obstacle, so that a ranging error is caused, the depth camera and the laser radar can provide geometric structures of the topography, however, the physical and mechanical properties of the ground cannot be evaluated, for example, when the quadruped robot walks on a grass or slippery ground, parameters such as friction coefficient, hardness and the like of the ground can directly influence the motion and stability of the quadruped robot, and for soft or unstable ground such as sandy ground, snowfield and the like, the existing sensing means lack the capability of identifying the collapse risks of the terrain, which can cause the quadruped robot to enter a dangerous area by mistake and influence task execution, so that in order to compensate the limitation of a single sensor, multi-sensor information such as a depth camera and a laser radar is generally required to be fused, the complexity and the calculation burden of the system are increased, especially in real-time application, the delay of data processing can cause the untimely response of the quadruped robot, the hardware cost of the depth camera and the laser radar is higher, especially the high-precision laser radar, and the large-scale deployment equipment is difficult to realize in some application scenes.
In the embodiment of the application, the camera on the quadruped robot can be used for collecting Red Green Blue (RGB for short) image data, and the camera on the quadruped robot is used for directly simulating human visual perception so as to be suitable for more complicated illumination and obstacle environments, and meanwhile, more ground information can be obtained.
Alternatively, a physical engine with rendering capabilities may be used as a building basis for the simulation environment, as RGB image data needs to be acquired.
For example, isaac Lab or physical engine+rendering engine approaches, such as Multi-Joint DYNAMICS WITH Contact MuJoCo) + UnReal Engine, may be used.
Wherein the second elevation map data and the RGB image data correspond to the same time.
S104, inputting the second sensor data, the second elevation map data and the second target control signal into the first target controller, and acquiring target position information of each joint of the quadruped robot.
In the embodiment of the application, the second sensor data can be input into the first target internal perception network to obtain an internal coding vector, the second elevation map data is input into the first target external perception network to obtain an external coding vector, the target vector, the internal coding vector and the external coding vector corresponding to the second target control signal are spliced to obtain a spliced vector, and the spliced vector is input into the first target controller network to output the target position information of each joint of the quadruped robot.
For example, taking a four-legged robot with 12 degrees of freedom as an example, the target position information of each joint of the four-legged robot is output
S105, training data is constructed according to the target position information, RGB image data, second sensor data and second target control signals of each joint.
In the embodiment of the application, the first target controller takes the elevation map data calculated based on the depth camera or the laser radar data as input, but the second controller takes the RGB image data acquired based on the RGB camera as input, takes the first target controller as a label as output data of the first target controller, and adopts a second controller taking RGB image data as input in simulated learning training.
In the embodiment of the present application, the target position information, RGB image data, second sensor data, and second target control signal of each joint may be correlated to construct training data.
S106, training the second controller to be trained according to the training data to obtain a trained second target controller, wherein the second target controller comprises a second target external perception network, a second target external perception network and a second target controller network.
In the embodiment of the application, after the training data is obtained, the second controller to be trained can be trained according to the training data, so as to obtain the second target controller after the training is completed.
According to the training method of the four-foot robot controller based on reinforcement learning, the four-foot robot is controlled to execute the first target control signal in the simulation environment, the first sensor data and the first elevation chart data are determined, the first controller is trained according to the first sensor data, the first elevation chart data and the first target control signal, the first target controller after training is obtained, wherein the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network, the four-foot robot is controlled to execute the second target control signal in the simulation environment, the second sensor data, the second elevation chart data and the RGB image data are determined, the second sensor data, the second elevation chart data and the second target control signal are input into the first target controller, the target position information of each joint of the four-foot robot is obtained, the training is carried out according to the target position information, the RGB image data, the second sensor data and the second target control signal of each joint, the first target controller is constructed, the training data is carried out according to the training data, the second target controller is more accurately perceived by the second target controller, the second target controller is more accurately perceived by the aid of the four-foot robot, the four-foot robot controller can realize the external perception of the four-foot robot controller, the four-foot robot controller can realize the visual perception of the visual obstacle control system, the four-foot robot controller can realize the visual perception of the four-stage robot controller based on the four-vision system, and the four-robot controller can realize the visual perception, and the visual perception.
Fig. 2 is a flow chart of a training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application.
As shown in fig. 2, the training method of the four-foot robot controller based on reinforcement learning provided in this embodiment specifically includes the following steps:
s201, controlling the quadruped robot to execute a first target control signal in a simulation environment, and determining first sensor data and first elevation map data.
Any implementation manner of the embodiments of the present application may be adopted for this step S201, and will not be described herein.
S202, inputting first sensor data to a first internal sensing network in a first controller to obtain a first coding vector, and inputting first elevation map data to a first external sensing network in the first controller to obtain a second coding vector.
In the embodiment of the application, after the first sensor data is acquiredAnd first elevation map dataFirst sensor dataInputting the sensor data into a first internal perception network, and encoding the sensor data by the first internal perception network to obtain a first encoding vectorFirst elevation map dataInputting the first data into a first external perception network, and encoding the first elevation map data by the first external perception network to obtain a second encoding vector
S203, splicing the first target vector, the first coding vector and the second coding vector corresponding to the first target control signal to obtain a first spliced vector, and inputting the first spliced vector into a first controller network in a first controller to obtain first prediction position information of each joint of the quadruped robot and actual control signals of the quadruped robot.
S204, training the first controller according to the actual control signal and the first target control signal to obtain the first target controller.
In the embodiment of the application, the reward function of the first controller can be obtained according to the actual control signal and the first target control signal, the parameters of the first controller are adjusted according to the reward function until the reward function meets the training ending condition, and the first controller after the parameters are adjusted last time is determined to be the first target controller.
Optionally, when the bonus function value of the first controller reaches a preset threshold, it is determined that the bonus function meets the training end condition.
In the embodiment of the application, the target speed can be obtained from the first target control signalAnd a target angular velocityAnd obtain the first speed from the actual control signalFirst angular velocityAnd moment of forceAccording to the target speedAnd actual speedDetermining a speed rewarding functionAnd obtain the speed rewarding weight of the speed rewarding functionAccording to the target angular velocityAnd actual angular velocityDetermining angular velocity bonus functionsAnd obtain the angular velocity rewarding weight of the angular velocity rewarding functionAccording to the actual momentDetermining a moment rewarding functionAnd obtain the moment rewarding weight of the moment rewarding functionAnd obtaining the rewarding function of the first controller according to the speed rewarding function and the speed rewarding weight, the angular speed rewarding function and the angular speed rewarding weight, and the moment rewarding function and the moment rewarding weight.
For example, the bonus function of the first controller may be determined according to the following formula:
Wherein, A bonus function for the first controller,Is a speed rewarding function,Awarding weight for speed,Awarding functions for angular velocityBonus weight for angular velocity,For moment rewarding functionsWeights are awarded for the moments.
S205, controlling the quadruped robot to execute a second target control signal in the simulation environment, and determining second sensor data, second elevation map data and RGB image data.
S206, inputting the second sensor data, the second elevation map data and the second target control signal into the first target controller, and acquiring target position information of each joint of the quadruped robot.
S207, training data is constructed according to the target position information, RGB image data, second sensor data and second target control signals of each joint.
Any implementation manner of each embodiment of the present application may be adopted for the steps S205-S207, and will not be described herein.
S208, inputting the RGB image data, the second sensor data and the second target control signal into a second controller to obtain the predicted position information of each joint of the quadruped robot.
In the embodiment of the present application, as shown in fig. 3, the second sensor data may be input to a second internal sensing network in the second controller to obtain a third encoding vector, the RGB image data is input to a second external sensing network in the second controller to obtain a fourth encoding vector, the second target vector, the third encoding vector and the fourth encoding vector corresponding to the second target control signal are spliced to obtain a second spliced vector, and the second spliced vector is input to a second controller network in the second controller to obtain the predicted position information of each joint of the quadruped robot.
S209, training the second controller according to the predicted position information of each joint and the target position information of each joint to obtain a trained second target controller.
In the embodiment of the application, the loss function can be obtained according to the predicted position information of each joint and the target position information of each joint, the parameters of the second controller are adjusted according to the loss function until the loss function meets the training ending condition, and the second controller after the parameters are adjusted last time is determined to be the second target controller.
For example, when the loss function value reaches a preset threshold, it is determined that the loss function satisfies the training end condition.
After the trained second target controller is obtained, the quadruped robot may be controlled based on the second target controller.
For example, as shown in fig. 4, the four-legged robot sensor data, RGB image data, and target speed are input to the second target controller, the second target controller outputs positional information of each joint, and the four-legged robot is controlled based on the positional information of each joint.
The application provides a training method of a four-foot robot controller based on reinforcement learning, which is characterized in that a four-foot robot is controlled to execute a first target control signal in a simulation environment, first sensor data and first elevation map data are determined, the first sensor data are input into a first internal perception network in the first controller to obtain a first coding vector, the first elevation map data are input into a first external perception network in the first controller to obtain a second coding vector, the first target vector, the first coding vector and the second coding vector corresponding to the first target control signal are spliced to obtain a first splicing vector, the first splicing vector is input into a first controller network of the first controller to obtain first predicted position information of each joint of the four-foot robot and the actual control signal of the four-foot robot, the first controller is trained according to the actual control signal and the first target control signal to obtain a first target controller, the first target controller is trained according to the actual control signal and the first target control signal, the first controller is trained to obtain a first target controller, the first target position information of each joint is obtained from the second elevation map data, the second sensor is input into each joint of the four-foot robot, the predicted position information of each joint is obtained according to the second RGB (RGB) control signal, the position information of each joint of the four-foot robot is input into the second joint position information, the four-foot robot is trained according to the position information of each joint of the four-foot robot, the training-completed second target controller is obtained, so that the training-completed second target controller is more suitable for complex illumination and obstacle environments by using RGB image data as visual perception information, more ground information can be acquired, the four-legged robot can be controlled more accurately and efficiently by inputting the RGB image data into the second target controller as visual perception information, the complexity and cost for acquiring the visual perception information are reduced, and a solid foundation is laid for the follow-up four-legged robot to safely and smoothly execute tasks.
In order to achieve the above embodiments, the present embodiment provides a training device of a four-foot robot controller based on reinforcement learning, and fig. 5 is a schematic structural diagram of the training device of the four-foot robot controller based on reinforcement learning according to the embodiment of the present application.
As shown in fig. 5, the training device 1000 of the reinforcement learning-based quadruped robot controller includes a first determining module 110, a first training module 120, a second determining module 130, a first acquiring module 140, a second acquiring module 150, and a second training module 160.
A first determining module 110, configured to control the quadruped robot to execute a first target control signal in a simulation environment, and determine first sensor data and first elevation map data;
The first training module 120 is configured to train the first controller according to the first sensor data, the first elevation map data, and the first target control signal, so as to obtain a trained first target controller, where the first target controller includes a first target internal perception network, a first target external perception network, and a first target controller network;
A second determining module 130, configured to control the quadruped robot to execute a second target control signal in the simulation environment, and determine second sensor data, second elevation map data, and RGB image data;
A first obtaining module 140, configured to input the second sensor data, the second elevation map data, and the second target control signal to a first target controller, and obtain target position information of each joint of the quadruped robot;
a second obtaining module 150, configured to construct training data according to the target position information of each joint, the RGB image data, the second sensor data, and the second target control signal;
And the second training module 160 is configured to train the second controller to be trained according to the training data, so as to obtain a trained second target controller, where the second target controller includes a second target external perception network, and a second target controller network.
According to an embodiment of the present application, the first training module 120 is further configured to input the first sensor data to a first internal sensing network in the first controller to obtain a first encoded vector, input the first elevation map data to a first external sensing network in the first controller to obtain a second encoded vector, splice a first target vector corresponding to a first target control signal, the first encoded vector and the second encoded vector to obtain a first spliced vector, input the first spliced vector to a first controller network in the first controller to obtain first predicted position information of each joint of the quadruped robot and an actual control signal of the quadruped robot, and train the first controller according to the actual control signal and the first target control signal to obtain the first target controller.
In one embodiment of the present application, the first training module 120 is further configured to obtain a reward function of the first controller according to the actual control signal and the first target control signal, adjust a parameter of the first controller according to the reward function until the reward function meets a training end condition, and determine the first controller after the last adjustment of the parameter as the first target controller.
In one embodiment of the present application, the first training module 120 is further configured to obtain a target speed and a target angular speed from the first target control signal, obtain an actual speed, an actual angular speed, and an actual torque from the actual control signal, determine a speed reward function according to the target speed and the actual speed, and obtain a speed reward weight of the speed reward function, determine an angular speed reward function according to the target angular speed and the actual angular speed, and obtain an angular speed reward weight of the angular speed reward function, determine a torque reward function according to the actual torque, and obtain a torque reward weight of the torque reward function, and obtain a reward function of the first controller according to the speed reward function and the speed reward weight, the angular speed reward function and the angular speed reward weight, the torque reward function, and the torque reward weight.
In one embodiment of the present application, the second obtaining module 150 is further configured to correlate the target position information of each joint, the RGB image data, the second sensor data, and the second target control signal to construct training data.
In one embodiment of the present application, the second training module 160 is further configured to input the RGB image data, the second sensor data, and the second target control signal to the second controller, obtain the predicted position information of each joint of the quadruped robot, and train the second controller according to the predicted position information of each joint and the target position information of each joint, to obtain a trained second target controller.
In one embodiment of the present application, the second training module 160 is further configured to obtain a loss function according to the predicted position information of each joint and the target position information of each joint, adjust the parameters of the second controller according to the loss function until the loss function meets the training ending condition, and determine the second controller after the last adjustment of the parameters as the second target controller.
The second training module 160 is further configured to input the second sensor data to a second internal sensing network in the second controller to obtain a third encoded vector, input the RGB image data to a second external sensing network in the second controller to obtain a fourth encoded vector, splice a second target vector corresponding to a second target control signal, the third encoded vector and the fourth encoded vector to obtain a second spliced vector, and input the second spliced vector to a second controller network in the second controller to obtain predicted position information of each joint of the quadruped robot.
The training device of the four-foot robot controller based on reinforcement learning provided by the application executes a first target control signal by controlling the four-foot robot in a simulation environment, determines first sensor data and first elevation chart data, trains the first controller according to the first sensor data, the first elevation chart data and the first target control signal to obtain a trained first target controller, wherein the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network, controls the four-foot robot in the simulation environment to execute a second target control signal, determines second sensor data, second elevation chart data and RGB image data, inputs the second sensor data, the second elevation chart data and the second target control signal into the first target controller, acquiring target position information of each joint of the quadruped robot, constructing training data according to the target position information of each joint, RGB image data, second sensor data and second target control signals, training a second controller to be trained according to the training data to obtain a trained second target controller, wherein the second target controller comprises a second target external perception network, a first target external perception network and a first target controller network, thereby being more suitable for complex illumination and obstacle environments by using the RGB image data as visual perception information, acquiring more ground information, controlling the quadruped robot more accurately and efficiently by inputting the RGB image data into the second target controller as visual perception information, reducing complexity and cost for acquiring the visual perception information, a solid foundation is laid for the follow-up four-foot robot to safely and smoothly execute tasks.
In order to implement the above embodiment, the present application also proposes an electronic device 2000, as shown in fig. 6, including a memory 210, a processor 220, and a computer program stored in the memory 210 and executable on the processor 220, where the processor implements the training method of the four-foot robot controller based on reinforcement learning according to the first aspect when executing the program.
To achieve the above-mentioned embodiments, the present application proposes a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the training method of the reinforcement learning-based quadruped robot controller according to the first aspect.
To achieve the above embodiments, the present application also proposes a computer program product comprising a computer program which, when executed by a processor, implements the reinforcement learning based four-legged robot controller training method of the first aspect.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.
The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims (10)

1.一种基于强化学习的四足机器人控制器的训练方法,其特征在于,所述方法,包括:1. A training method for a quadruped robot controller based on reinforcement learning, characterized in that the method comprises: 在仿真环境中控制四足机器人执行第一目标控制信号,确定第一传感器数据和第一高程图数据;Controlling the quadruped robot to execute a first target control signal in a simulation environment to determine first sensor data and first elevation map data; 根据所述第一传感器数据、所述第一高程图数据和所述第一目标控制信号,对第一控制器进行训练,得到训练完成的第一目标控制器,其中,所述第一目标控制器包括第一目标内部感知网络、第一目标外部感知网络和第一目标控制器网络;According to the first sensor data, the first elevation map data and the first target control signal, a first controller is trained to obtain a trained first target controller, wherein the first target controller includes a first target internal perception network, a first target external perception network and a first target controller network; 在仿真环境中控制四足机器人执行第二目标控制信号,确定第二传感器数据、第二高程图数据和RGB图像数据;Controlling the quadruped robot to execute a second target control signal in a simulation environment, and determining second sensor data, second elevation map data, and RGB image data; 将所述第二传感器数据、所述第二高程图数据和所述第二目标控制信号输入至第一目标控制器中,获取所述四足机器人的每个关节的目标位置信息;Inputting the second sensor data, the second elevation map data and the second target control signal into the first target controller to obtain target position information of each joint of the quadruped robot; 根据所述每个关节的目标位置信息、所述RGB图像数据、所述第二传感器数据和所述第二目标控制信号,构建训练数据;Constructing training data according to the target position information of each joint, the RGB image data, the second sensor data and the second target control signal; 根据所述训练数据,对待训练的第二控制器进行训练,得到训练完成的第二目标控制器,其中,所述第二目标控制器包括第二目标外部感知网络、第二目标外部感知网络和第二目标控制器网络。The second controller to be trained is trained according to the training data to obtain a trained second target controller, wherein the second target controller includes a second target external perception network, a second target external perception network, and a second target controller network. 2.根据权利要求1所述的方法,其特征在于,所述根据所述第一传感器数据、所述第一高程图数据和所述第一目标控制信号,对第一控制器进行训练,得到训练完成的第一目标控制器,包括:2. The method according to claim 1, characterized in that the step of training the first controller according to the first sensor data, the first elevation map data and the first target control signal to obtain a trained first target controller comprises: 将所述第一传感器数据输入至所述第一控制器中的第一内部感知网络,得到第一编码向量,并将所述第一高程图数据输入所述第一控制器中的第一外部感知网络,得到第二编码向量;Inputting the first sensor data into a first internal perception network in the first controller to obtain a first encoding vector, and inputting the first elevation map data into a first external perception network in the first controller to obtain a second encoding vector; 对第一目标控制信号对应的第一目标向量、所述第一编码向量和所述第二编码向量进行拼接,得到第一拼接向量,将所述第一拼接向量输入所述第一控制器中的第一控制器网络,得到所述四足机器人的每个关节的第一预测位置信息和所述四足机器人的实际控制信号;splicing a first target vector corresponding to a first target control signal, the first encoding vector, and the second encoding vector to obtain a first splicing vector, inputting the first splicing vector into a first controller network in the first controller, and obtaining first predicted position information of each joint of the quadruped robot and an actual control signal of the quadruped robot; 根据所述实际控制信号和所述第一目标控制信号,对所述第一控制器进行训练,得到所述第一目标控制器。The first controller is trained according to the actual control signal and the first target control signal to obtain the first target controller. 3.根据权利要求2所述的方法,其特征在于,所述根据所述实际控制信号和所述第一目标控制信号,对所述第一控制器进行训练,得到所述第一目标控制器,包括:3. The method according to claim 2, characterized in that the step of training the first controller according to the actual control signal and the first target control signal to obtain the first target controller comprises: 根据所述实际控制信号和所述第一目标控制信号,获取所述第一控制器的奖励函数;Acquire a reward function of the first controller according to the actual control signal and the first target control signal; 根据所述奖励函数,调整所述第一控制器的参数,直至所述奖励函数满足训练结束条件,将最后一次调整参数后的第一控制器确定为所述第一目标控制器。According to the reward function, the parameters of the first controller are adjusted until the reward function meets the training end condition, and the first controller after the last parameter adjustment is determined as the first target controller. 4.根据权利要求3所述的方法,其特征在于,所述根据所述实际控制信号和所述第一目标控制信号,获取所述第一控制器的奖励函数,包括:4. The method according to claim 3, characterized in that the step of obtaining a reward function of the first controller according to the actual control signal and the first target control signal comprises: 从所述第一目标控制信号中获取目标速度和目标角速度,并从所述实际控制信号获取实际速度、实际角速度和实际力矩;Obtaining a target speed and a target angular speed from the first target control signal, and obtaining an actual speed, an actual angular speed and an actual torque from the actual control signal; 根据所述目标速度和所述实际速度,确定速度奖励函数,并获取所述速度奖励函数的速度奖励权重;Determine a speed reward function according to the target speed and the actual speed, and obtain a speed reward weight of the speed reward function; 根据所述目标角速度和所述实际角速度,确定角速度奖励函数,并获取所述角速度奖励函数的角速度奖励权重;Determining an angular velocity reward function according to the target angular velocity and the actual angular velocity, and obtaining an angular velocity reward weight of the angular velocity reward function; 根据所述实际力矩,确定力矩奖励函数,并获取所述力矩奖励函数的力矩奖励权重;Determine a torque reward function according to the actual torque, and obtain a torque reward weight of the torque reward function; 根据所述速度奖励函数和所述速度奖励权重、所述角速度奖励函数和所述角速度奖励权重、所述力矩奖励函数和所述力矩奖励权重,得到所述第一控制器的奖励函数。The reward function of the first controller is obtained according to the speed reward function and the speed reward weight, the angular velocity reward function and the angular velocity reward weight, the torque reward function and the torque reward weight. 5.根据权利要求1所述的方法,其特征在于,所述根据所述每个关节的目标位置信息、所述RGB图像数据、所述第二传感器数据和所述第二目标控制信号,构建训练数据,包括:5. The method according to claim 1, characterized in that the step of constructing training data according to the target position information of each joint, the RGB image data, the second sensor data and the second target control signal comprises: 对所述每个关节的目标位置信息、所述RGB图像数据、所述第二传感器数据和所述第二目标控制信号进行关联,以构建训练数据。The target position information of each joint, the RGB image data, the second sensor data and the second target control signal are associated to construct training data. 6.根据权利要求5所述的方法,其特征在于,所述根据所述训练数据,对待训练的第二控制器进行训练,得到训练完成的第二目标控制器,包括:6. The method according to claim 5, characterized in that the step of training the second controller to be trained according to the training data to obtain a trained second target controller comprises: 将所述RGB图像数据、所述第二传感器数据、所述第二目标控制信号输入至所述第二控制器中,获取所述四足机器人的每个关节的预测位置信息;Inputting the RGB image data, the second sensor data, and the second target control signal into the second controller to obtain predicted position information of each joint of the quadruped robot; 根据所述每个关节的预测位置信息和所述每个关节的目标位置信息,对所述第二控制器进行训练,得到训练完成的第二目标控制器。The second controller is trained according to the predicted position information of each joint and the target position information of each joint to obtain a trained second target controller. 7.根据权利要求6所述的方法,其特征在于,所述根据所述每个关节的预测位置信息和所述每个关节的目标位置信息,对所述第二控制器进行训练,得到训练完成的第二目标控制器,包括:7. The method according to claim 6, characterized in that the step of training the second controller according to the predicted position information of each joint and the target position information of each joint to obtain a trained second target controller comprises: 根据所述每个关节的预测位置信息和所述每个关节的目标位置信息,获取损失函数;Acquire a loss function according to the predicted position information of each joint and the target position information of each joint; 根据所述损失函数,调整所述第二控制器的参数,直至所述损失函数满足训练结束条件,将最后一次调整参数后的第二控制器确定为所述第二目标控制器。According to the loss function, the parameters of the second controller are adjusted until the loss function meets the training end condition, and the second controller after the last parameter adjustment is determined as the second target controller. 8.根据权利要求6所述的方法,其特征在于,所述将所述RGB图像数据、所述第二传感器数据、所述第二目标控制信号输入至所述第二控制器中,获取所述四足机器人的每个关节的预测位置信息,包括:8. The method according to claim 6, characterized in that the step of inputting the RGB image data, the second sensor data, and the second target control signal into the second controller to obtain the predicted position information of each joint of the quadruped robot comprises: 将所述第二传感器数据输入至所述第二控制器中的第二内部感知网络,得到第三编码向量,并将所述RGB图像数据输入所述第二控制器中的第二外部感知网络,得到第四编码向量;Input the second sensor data into a second internal perception network in the second controller to obtain a third encoding vector, and input the RGB image data into a second external perception network in the second controller to obtain a fourth encoding vector; 对第二目标控制信号对应的第二目标向量、所述第三编码向量和所述第四编码向量进行拼接,得到第二拼接向量,将所述第二拼接向量输入所述第二控制器中的第二控制器网络,得到所述四足机器人的每个关节的预测位置信息。The second target vector corresponding to the second target control signal, the third encoding vector and the fourth encoding vector are spliced to obtain a second spliced vector, and the second spliced vector is input into the second controller network in the second controller to obtain the predicted position information of each joint of the quadruped robot. 9.一种基于强化学习的四足机器人控制器的训练装置,其特征在于,所述装置,包括:9. A training device for a quadruped robot controller based on reinforcement learning, characterized in that the device comprises: 第一确定模块,用于在仿真环境中控制四足机器人执行第一目标控制信号,确定第一传感器数据和第一高程图数据;A first determination module is used to control the quadruped robot to execute a first target control signal in a simulation environment, and determine first sensor data and first elevation map data; 第一训练模块,用于根据所述第一传感器数据、所述第一高程图数据和所述第一目标控制信号,对第一控制器进行训练,得到训练完成的第一目标控制器,其中,所述第一目标控制器包括第一目标内部感知网络、第一目标外部感知网络和第一目标控制器网络;A first training module is used to train a first controller according to the first sensor data, the first elevation map data and the first target control signal to obtain a trained first target controller, wherein the first target controller includes a first target internal perception network, a first target external perception network and a first target controller network; 第二确定模块,用于在仿真环境中控制四足机器人执行第二目标控制信号,确定第二传感器数据、第二高程图数据和RGB图像数据;A second determination module is used to control the quadruped robot to execute a second target control signal in a simulation environment, and determine second sensor data, second elevation map data, and RGB image data; 第一获取模块,用于将所述第二传感器数据、所述第二高程图数据和所述第二目标控制信号输入至第一目标控制器中,获取所述四足机器人的每个关节的目标位置信息;A first acquisition module, used for inputting the second sensor data, the second elevation map data and the second target control signal into a first target controller to acquire target position information of each joint of the quadruped robot; 第二获取模块,用于根据所述每个关节的目标位置信息、所述RGB图像数据、所述第二传感器数据和所述第二目标控制信号,构建训练数据;A second acquisition module, used for constructing training data according to the target position information of each joint, the RGB image data, the second sensor data and the second target control signal; 第二训练模块,用于根据所述训练数据,对待训练的第二控制器进行训练,得到训练完成的第二目标控制器,其中,所述第二目标控制器包括第二目标外部感知网络、第二目标外部感知网络和第二目标控制器网络。The second training module is used to train the second controller to be trained according to the training data to obtain a trained second target controller, wherein the second target controller includes a second target external perception network, a second target external perception network and a second target controller network. 10.一种电子设备,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求 1-8中任一项所述的方法。10. An electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the method described in any one of claims 1-8.
CN202510089324.5A 2025-01-21 2025-01-21 Training method and device of four-foot robot controller based on reinforcement learning Active CN119511739B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510089324.5A CN119511739B (en) 2025-01-21 2025-01-21 Training method and device of four-foot robot controller based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510089324.5A CN119511739B (en) 2025-01-21 2025-01-21 Training method and device of four-foot robot controller based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN119511739A true CN119511739A (en) 2025-02-25
CN119511739B CN119511739B (en) 2025-04-25

Family

ID=94666726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510089324.5A Active CN119511739B (en) 2025-01-21 2025-01-21 Training method and device of four-foot robot controller based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN119511739B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9008840B1 (en) * 2013-04-19 2015-04-14 Brain Corporation Apparatus and methods for reinforcement-guided supervised learning
KR20220065232A (en) * 2020-11-13 2022-05-20 주식회사 플라잎 Apparatus and method for controlling robot based on reinforcement learning
CN115546547A (en) * 2022-10-11 2022-12-30 南京理工大学 A Quadruped Robot Motion Control Method Based on Spiking Neural Network in Complex Environment
CN116627041A (en) * 2023-07-19 2023-08-22 江西机电职业技术学院 Control method for motion of four-foot robot based on deep learning
CN118818968A (en) * 2024-05-27 2024-10-22 浙江大学 A quadruped robot motion control method based on deep reinforcement learning
WO2025011165A1 (en) * 2023-07-12 2025-01-16 腾讯科技(深圳)有限公司 Control method and apparatus for legged robot, and legged robot and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9008840B1 (en) * 2013-04-19 2015-04-14 Brain Corporation Apparatus and methods for reinforcement-guided supervised learning
KR20220065232A (en) * 2020-11-13 2022-05-20 주식회사 플라잎 Apparatus and method for controlling robot based on reinforcement learning
CN115546547A (en) * 2022-10-11 2022-12-30 南京理工大学 A Quadruped Robot Motion Control Method Based on Spiking Neural Network in Complex Environment
WO2025011165A1 (en) * 2023-07-12 2025-01-16 腾讯科技(深圳)有限公司 Control method and apparatus for legged robot, and legged robot and medium
CN116627041A (en) * 2023-07-19 2023-08-22 江西机电职业技术学院 Control method for motion of four-foot robot based on deep learning
CN118818968A (en) * 2024-05-27 2024-10-22 浙江大学 A quadruped robot motion control method based on deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张勤;李岳炀;赵钦君;: "四足机器人野外导航关键技术研究进展", 济南大学学报(自然科学版), no. 05, 6 June 2016 (2016-06-06) *

Also Published As

Publication number Publication date
CN119511739B (en) 2025-04-25

Similar Documents

Publication Publication Date Title
CN115200588B (en) SLAM autonomous navigation method and device for mobile robot
JP7351079B2 (en) Control device, control method and program
CN111958591A (en) Autonomous inspection method and system for semantic intelligent substation inspection robot
KR101956447B1 (en) Method and apparatus for position estimation of unmanned vehicle based on graph structure
CN112097769B (en) Homing pigeon brain-hippocampus-imitated unmanned aerial vehicle simultaneous positioning and mapping navigation system and method
CN111338383B (en) GAAS-based autonomous flight method and system, and storage medium
CN113010958B (en) Simulation system of self-propelled ship and operation method thereof
CN113325837A (en) Control system and method for multi-information fusion acquisition robot
KR101423139B1 (en) Method for localization and mapping using 3D line, and mobile body thereof
CN109855616B (en) A Multi-sensor Robot Navigation Method Based on Virtual Environment and Reinforcement Learning
CN115182747B (en) Automatic tunnel crack repairing method, device and system and readable storage medium
JP2016024598A (en) Control method of autonomous mobile device
CN111198513A (en) Equipment control system
CN113703462A (en) Unknown space autonomous exploration system based on quadruped robot
CN114995468B (en) Intelligent control method of underwater robot based on Bayesian depth reinforcement learning
KR101406176B1 (en) Apparatus and Method for Estimating the Position of Underwater Robot
CN114815851A (en) Robot following method, device, electronic device and storage medium
CN114964268A (en) Unmanned aerial vehicle navigation method and device
CN114571460A (en) Robot control method, device and storage medium
CN119958563A (en) Multi-UAV navigation method, device, equipment and medium based on artificial intelligence
CN119511739B (en) Training method and device of four-foot robot controller based on reinforcement learning
CN115290090A (en) SLAM map construction method based on multi-sensor information fusion
Shang et al. Indoor testing and simulation platform for close-distance visual inspection of complex structures using micro quadrotor UAV
CN114820595B (en) Method and related components for quadruped robot collaborative unmanned aerial vehicle detection area damage
TWI679511B (en) Method and system for planning trajectory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant