CN119511739A

CN119511739A - Training method and device for quadruped robot controller based on reinforcement learning

Info

Publication number: CN119511739A
Application number: CN202510089324.5A
Authority: CN
Inventors: 付鑫; 韩默渊; 朱西硕
Original assignee: General Coal Research Institute Co Ltd; CCTEG Beijing Tianma Intelligent Control Technology Co Ltd
Current assignee: General Coal Research Institute Co Ltd; CCTEG Beijing Tianma Intelligent Control Technology Co Ltd
Priority date: 2025-01-21
Filing date: 2025-01-21
Publication date: 2025-02-25
Anticipated expiration: 2045-01-21
Also published as: CN119511739B

Abstract

The present application provides a training method, device and electronic device for a quadruped robot controller based on reinforcement learning, the method comprising: training the first controller according to first sensor data, first elevation map data and a first target control signal to obtain a trained first target controller; controlling the quadruped robot to execute a second target control signal in a simulation environment to determine second sensor data, second elevation map data and RGB image data; constructing training data according to target position information of each joint, RGB image data, second sensor data and second target control signal; training the second controller to be trained according to the training data to obtain a trained second target controller, thereby, by using RGB image data as visual perception information, it can be more adaptable to complex lighting and obstacle environments, and can control the quadruped robot more accurately and efficiently.

Description

Training method and device of four-foot robot controller based on reinforcement learning

Technical Field

The invention relates to the technical field of deep learning, in particular to a training method and device for a four-foot robot controller based on reinforcement learning and electronic equipment.

Background

The four-legged robot has huge potential application in various tasks such as inspection, rescue, post-disaster exploration and the like, in order to cope with complex terrain environments in real environments, visual perception information is generally required to be fused into a control algorithm of the four-legged robot, in the related technology, a depth camera or a laser radar is used for acquiring the visual perception information in a common way at present, but the visual perception information cannot be acquired accurately in the mode, so that the control of the four-legged robot cannot be realized more accurately and flexibly.

Disclosure of Invention

The present application aims to solve at least one of the technical problems in the related art to some extent.

According to a first aspect of the application, a training method of a four-foot robot controller based on reinforcement learning is provided, and the training method comprises the steps of controlling the four-foot robot to execute first target control signals in a simulation environment, determining first sensor data and first elevation chart data, training the first controller according to the first sensor data, the first elevation chart data and the first target control signals to obtain a first target controller after training, wherein the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network, controlling the four-foot robot to execute second target control signals in the simulation environment, determining second sensor data, second elevation chart data and RGB image data, inputting the second sensor data, the second elevation chart data and the second target control signals into the first target controller, obtaining target position information of each joint of the four-foot robot, constructing training networks according to target position information of each joint, the second target external perception network and the second target controller, and the second target perception network, and obtaining training network, and constructing second target perception network and second target control network according to the training position information of each joint.

According to a second aspect of the application, a training device of a four-foot robot controller based on reinforcement learning is provided, and the training device comprises a first determining module, a first training module, a first obtaining module, a second obtaining module and a second obtaining module, wherein the first determining module is used for controlling the four-foot robot to execute a first target control signal in a simulation environment and determining first sensor data, second elevation map data and RGB image data, the first training module is used for training the first controller according to the first sensor data, the first elevation map data and the first target control signal to obtain a first target controller after training, the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network, the second determining module is used for controlling the four-foot robot to execute a second target control signal in the simulation environment and determining second sensor data, second elevation map data and RGB image data, the first obtaining module is used for inputting the second sensor data, the second elevation map data and the second target control signal into the first target controller to obtain the training position information of each human joint, the second target controller is used for obtaining the second target sensor data, the second target sensor data and the second target control network to obtain the training information according to the second target position information, and the second target control network is used for obtaining the second target sensor data.

The third aspect of the application provides an electronic device, which is characterized by comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the training method of the four-foot robot controller based on reinforcement learning according to the first aspect is realized when the processor executes the program.

A fourth aspect of the present application is directed to a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the training method of the reinforcement learning-based quadruped robot controller of the first aspect.

A fifth aspect of the application proposes a computer program product comprising a computer program which, when executed by a processor, implements the reinforcement learning based four-legged robot controller training method according to the first aspect.

The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects:

The training method of the quadruped robot controller based on reinforcement learning provided by the application has the advantages that the RGB image data is used as visual perception information, the training method is more suitable for complex illumination and obstacle environments, more ground information can be acquired, the quadruped robot can be controlled more accurately and efficiently, and a solid foundation is laid for the follow-up quadruped robot to safely and smoothly execute tasks.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

fig. 1 is a schematic flow chart of a training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application;

FIG. 2 is a flow chart of another training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application;

FIG. 3 is a flow chart of another training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application;

Fig. 4 is a schematic flow chart of obtaining predicted position information of each joint according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a training device of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The training method of the four-foot robot controller based on reinforcement learning of the present application will be described in detail with reference to the following examples.

Fig. 1 is a flow chart of a training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application.

As shown in fig. 1, the training method of the four-foot robot controller based on reinforcement learning provided in the embodiment specifically includes the following steps:

s101, controlling the quadruped robot to execute a first target control signal in a simulation environment, and determining first sensor data and first elevation map data.

Wherein the first target control signal includes a target speedAnd a target angular velocity

In each simulation step t, motion of the quadruped robot and contact dynamics of the quadruped robot with the ground are simulated in a simulation environment, and first sensor data obtained by executing a first target control signal by the quadruped robot in the simulation environment are obtainedFirst elevation map data。

Wherein the first sensor data includes, but is not limited to, inertial measurement unit (Inertial measurement unit, IMU) data and joint encoder data.

It should be noted that IMU data includes, but is not limited to, fuselage velocityAngular velocity of fuselage: and body pose The body pose is represented using a projection of the gravity vector under the robot's own coordinate system.

It should be noted that the joint encoder data includes, but is not limited to, joint angles: And joint angular velocity 。

It should be noted that the specific manner of acquiring the first elevation map data is not limited in the present application, and alternatively, the first elevation map data may be acquired by a depth camera or a laser radar.

S102, training the first controller according to the first sensor data, the first elevation chart data and the first target control signal to obtain a trained first target controller, wherein the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network.

In the embodiment of the application, the first sensor data can be input to a first internal perception network in a first controller to obtain a first coding vector, the first elevation map data is input to a first external perception network in the first controller to obtain a second coding vector, the first target vector, the first coding vector and the second coding vector corresponding to a first target control signal are spliced to obtain a first splicing vector, the first splicing vector is input to the first controller network in the first controller to obtain first prediction position information of each joint of the quadruped robot and actual control signals of the quadruped robot, and the first controller is trained according to the actual control signals and the first target control signals to obtain the first target controller.

S103, controlling the quadruped robot to execute a second target control signal in the simulation environment, and determining second sensor data, second elevation map data and RGB image data.

In the related art, visual perception information is obtained based on a depth camera and a laser radar, the depth camera is easy to fail in a strong light or low light environment, especially when a strong reflection or shadow area is encountered, the visual perception information is often inaccurate, a smooth, transparent or high reflection material can cause the depth camera to not accurately obtain the depth information of the surface, a larger error is generated, in complex environments filled with dust, smoke or water mist and the like, signals of the depth camera are easy to be interfered, image blurring or incomplete depth data are caused, especially in scenes such as mines, fire rescue and the like, the detection range of the laser radar is limited, especially in multi-obstacle or complex environments, blind areas are easy to generate, in addition, a laser beam is influenced when penetrating through the obstacle, so that a ranging error is caused, the depth camera and the laser radar can provide geometric structures of the topography, however, the physical and mechanical properties of the ground cannot be evaluated, for example, when the quadruped robot walks on a grass or slippery ground, parameters such as friction coefficient, hardness and the like of the ground can directly influence the motion and stability of the quadruped robot, and for soft or unstable ground such as sandy ground, snowfield and the like, the existing sensing means lack the capability of identifying the collapse risks of the terrain, which can cause the quadruped robot to enter a dangerous area by mistake and influence task execution, so that in order to compensate the limitation of a single sensor, multi-sensor information such as a depth camera and a laser radar is generally required to be fused, the complexity and the calculation burden of the system are increased, especially in real-time application, the delay of data processing can cause the untimely response of the quadruped robot, the hardware cost of the depth camera and the laser radar is higher, especially the high-precision laser radar, and the large-scale deployment equipment is difficult to realize in some application scenes.

In the embodiment of the application, the camera on the quadruped robot can be used for collecting Red Green Blue (RGB for short) image data, and the camera on the quadruped robot is used for directly simulating human visual perception so as to be suitable for more complicated illumination and obstacle environments, and meanwhile, more ground information can be obtained.

Alternatively, a physical engine with rendering capabilities may be used as a building basis for the simulation environment, as RGB image data needs to be acquired.

For example, isaac Lab or physical engine+rendering engine approaches, such as Multi-Joint DYNAMICS WITH Contact MuJoCo) + UnReal Engine, may be used.

Wherein the second elevation map data and the RGB image data correspond to the same time.

S104, inputting the second sensor data, the second elevation map data and the second target control signal into the first target controller, and acquiring target position information of each joint of the quadruped robot.

In the embodiment of the application, the second sensor data can be input into the first target internal perception network to obtain an internal coding vector, the second elevation map data is input into the first target external perception network to obtain an external coding vector, the target vector, the internal coding vector and the external coding vector corresponding to the second target control signal are spliced to obtain a spliced vector, and the spliced vector is input into the first target controller network to output the target position information of each joint of the quadruped robot.

For example, taking a four-legged robot with 12 degrees of freedom as an example, the target position information of each joint of the four-legged robot is output。

S105, training data is constructed according to the target position information, RGB image data, second sensor data and second target control signals of each joint.

In the embodiment of the application, the first target controller takes the elevation map data calculated based on the depth camera or the laser radar data as input, but the second controller takes the RGB image data acquired based on the RGB camera as input, takes the first target controller as a label as output data of the first target controller, and adopts a second controller taking RGB image data as input in simulated learning training.

In the embodiment of the present application, the target position information, RGB image data, second sensor data, and second target control signal of each joint may be correlated to construct training data.

S106, training the second controller to be trained according to the training data to obtain a trained second target controller, wherein the second target controller comprises a second target external perception network, a second target external perception network and a second target controller network.

In the embodiment of the application, after the training data is obtained, the second controller to be trained can be trained according to the training data, so as to obtain the second target controller after the training is completed.

According to the training method of the four-foot robot controller based on reinforcement learning, the four-foot robot is controlled to execute the first target control signal in the simulation environment, the first sensor data and the first elevation chart data are determined, the first controller is trained according to the first sensor data, the first elevation chart data and the first target control signal, the first target controller after training is obtained, wherein the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network, the four-foot robot is controlled to execute the second target control signal in the simulation environment, the second sensor data, the second elevation chart data and the RGB image data are determined, the second sensor data, the second elevation chart data and the second target control signal are input into the first target controller, the target position information of each joint of the four-foot robot is obtained, the training is carried out according to the target position information, the RGB image data, the second sensor data and the second target control signal of each joint, the first target controller is constructed, the training data is carried out according to the training data, the second target controller is more accurately perceived by the second target controller, the second target controller is more accurately perceived by the aid of the four-foot robot, the four-foot robot controller can realize the external perception of the four-foot robot controller, the four-foot robot controller can realize the visual perception of the visual obstacle control system, the four-foot robot controller can realize the visual perception of the four-stage robot controller based on the four-vision system, and the four-robot controller can realize the visual perception, and the visual perception.

Fig. 2 is a flow chart of a training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application.

As shown in fig. 2, the training method of the four-foot robot controller based on reinforcement learning provided in this embodiment specifically includes the following steps:

s201, controlling the quadruped robot to execute a first target control signal in a simulation environment, and determining first sensor data and first elevation map data.

Any implementation manner of the embodiments of the present application may be adopted for this step S201, and will not be described herein.

S202, inputting first sensor data to a first internal sensing network in a first controller to obtain a first coding vector, and inputting first elevation map data to a first external sensing network in the first controller to obtain a second coding vector.

In the embodiment of the application, after the first sensor data is acquiredAnd first elevation map dataFirst sensor dataInputting the sensor data into a first internal perception network, and encoding the sensor data by the first internal perception network to obtain a first encoding vectorFirst elevation map dataInputting the first data into a first external perception network, and encoding the first elevation map data by the first external perception network to obtain a second encoding vector。

S203, splicing the first target vector, the first coding vector and the second coding vector corresponding to the first target control signal to obtain a first spliced vector, and inputting the first spliced vector into a first controller network in a first controller to obtain first prediction position information of each joint of the quadruped robot and actual control signals of the quadruped robot.

S204, training the first controller according to the actual control signal and the first target control signal to obtain the first target controller.

In the embodiment of the application, the reward function of the first controller can be obtained according to the actual control signal and the first target control signal, the parameters of the first controller are adjusted according to the reward function until the reward function meets the training ending condition, and the first controller after the parameters are adjusted last time is determined to be the first target controller.

Optionally, when the bonus function value of the first controller reaches a preset threshold, it is determined that the bonus function meets the training end condition.

In the embodiment of the application, the target speed can be obtained from the first target control signalAnd a target angular velocityAnd obtain the first speed from the actual control signalFirst angular velocityAnd moment of forceAccording to the target speedAnd actual speedDetermining a speed rewarding functionAnd obtain the speed rewarding weight of the speed rewarding functionAccording to the target angular velocityAnd actual angular velocityDetermining angular velocity bonus functionsAnd obtain the angular velocity rewarding weight of the angular velocity rewarding functionAccording to the actual momentDetermining a moment rewarding functionAnd obtain the moment rewarding weight of the moment rewarding functionAnd obtaining the rewarding function of the first controller according to the speed rewarding function and the speed rewarding weight, the angular speed rewarding function and the angular speed rewarding weight, and the moment rewarding function and the moment rewarding weight.

For example, the bonus function of the first controller may be determined according to the following formula:

Wherein, A bonus function for the first controller,Is a speed rewarding function,Awarding weight for speed,Awarding functions for angular velocityBonus weight for angular velocity,For moment rewarding functionsWeights are awarded for the moments.

S205, controlling the quadruped robot to execute a second target control signal in the simulation environment, and determining second sensor data, second elevation map data and RGB image data.

S206, inputting the second sensor data, the second elevation map data and the second target control signal into the first target controller, and acquiring target position information of each joint of the quadruped robot.

S207, training data is constructed according to the target position information, RGB image data, second sensor data and second target control signals of each joint.

Any implementation manner of each embodiment of the present application may be adopted for the steps S205-S207, and will not be described herein.

S208, inputting the RGB image data, the second sensor data and the second target control signal into a second controller to obtain the predicted position information of each joint of the quadruped robot.

In the embodiment of the present application, as shown in fig. 3, the second sensor data may be input to a second internal sensing network in the second controller to obtain a third encoding vector, the RGB image data is input to a second external sensing network in the second controller to obtain a fourth encoding vector, the second target vector, the third encoding vector and the fourth encoding vector corresponding to the second target control signal are spliced to obtain a second spliced vector, and the second spliced vector is input to a second controller network in the second controller to obtain the predicted position information of each joint of the quadruped robot.

S209, training the second controller according to the predicted position information of each joint and the target position information of each joint to obtain a trained second target controller.

In the embodiment of the application, the loss function can be obtained according to the predicted position information of each joint and the target position information of each joint, the parameters of the second controller are adjusted according to the loss function until the loss function meets the training ending condition, and the second controller after the parameters are adjusted last time is determined to be the second target controller.

For example, when the loss function value reaches a preset threshold, it is determined that the loss function satisfies the training end condition.

After the trained second target controller is obtained, the quadruped robot may be controlled based on the second target controller.

For example, as shown in fig. 4, the four-legged robot sensor data, RGB image data, and target speed are input to the second target controller, the second target controller outputs positional information of each joint, and the four-legged robot is controlled based on the positional information of each joint.

The application provides a training method of a four-foot robot controller based on reinforcement learning, which is characterized in that a four-foot robot is controlled to execute a first target control signal in a simulation environment, first sensor data and first elevation map data are determined, the first sensor data are input into a first internal perception network in the first controller to obtain a first coding vector, the first elevation map data are input into a first external perception network in the first controller to obtain a second coding vector, the first target vector, the first coding vector and the second coding vector corresponding to the first target control signal are spliced to obtain a first splicing vector, the first splicing vector is input into a first controller network of the first controller to obtain first predicted position information of each joint of the four-foot robot and the actual control signal of the four-foot robot, the first controller is trained according to the actual control signal and the first target control signal to obtain a first target controller, the first target controller is trained according to the actual control signal and the first target control signal, the first controller is trained to obtain a first target controller, the first target position information of each joint is obtained from the second elevation map data, the second sensor is input into each joint of the four-foot robot, the predicted position information of each joint is obtained according to the second RGB (RGB) control signal, the position information of each joint of the four-foot robot is input into the second joint position information, the four-foot robot is trained according to the position information of each joint of the four-foot robot, the training-completed second target controller is obtained, so that the training-completed second target controller is more suitable for complex illumination and obstacle environments by using RGB image data as visual perception information, more ground information can be acquired, the four-legged robot can be controlled more accurately and efficiently by inputting the RGB image data into the second target controller as visual perception information, the complexity and cost for acquiring the visual perception information are reduced, and a solid foundation is laid for the follow-up four-legged robot to safely and smoothly execute tasks.

In order to achieve the above embodiments, the present embodiment provides a training device of a four-foot robot controller based on reinforcement learning, and fig. 5 is a schematic structural diagram of the training device of the four-foot robot controller based on reinforcement learning according to the embodiment of the present application.

As shown in fig. 5, the training device 1000 of the reinforcement learning-based quadruped robot controller includes a first determining module 110, a first training module 120, a second determining module 130, a first acquiring module 140, a second acquiring module 150, and a second training module 160.

A first determining module 110, configured to control the quadruped robot to execute a first target control signal in a simulation environment, and determine first sensor data and first elevation map data;

The first training module 120 is configured to train the first controller according to the first sensor data, the first elevation map data, and the first target control signal, so as to obtain a trained first target controller, where the first target controller includes a first target internal perception network, a first target external perception network, and a first target controller network;

A second determining module 130, configured to control the quadruped robot to execute a second target control signal in the simulation environment, and determine second sensor data, second elevation map data, and RGB image data;

A first obtaining module 140, configured to input the second sensor data, the second elevation map data, and the second target control signal to a first target controller, and obtain target position information of each joint of the quadruped robot;

a second obtaining module 150, configured to construct training data according to the target position information of each joint, the RGB image data, the second sensor data, and the second target control signal;

And the second training module 160 is configured to train the second controller to be trained according to the training data, so as to obtain a trained second target controller, where the second target controller includes a second target external perception network, and a second target controller network.

According to an embodiment of the present application, the first training module 120 is further configured to input the first sensor data to a first internal sensing network in the first controller to obtain a first encoded vector, input the first elevation map data to a first external sensing network in the first controller to obtain a second encoded vector, splice a first target vector corresponding to a first target control signal, the first encoded vector and the second encoded vector to obtain a first spliced vector, input the first spliced vector to a first controller network in the first controller to obtain first predicted position information of each joint of the quadruped robot and an actual control signal of the quadruped robot, and train the first controller according to the actual control signal and the first target control signal to obtain the first target controller.

In one embodiment of the present application, the first training module 120 is further configured to obtain a reward function of the first controller according to the actual control signal and the first target control signal, adjust a parameter of the first controller according to the reward function until the reward function meets a training end condition, and determine the first controller after the last adjustment of the parameter as the first target controller.

In one embodiment of the present application, the first training module 120 is further configured to obtain a target speed and a target angular speed from the first target control signal, obtain an actual speed, an actual angular speed, and an actual torque from the actual control signal, determine a speed reward function according to the target speed and the actual speed, and obtain a speed reward weight of the speed reward function, determine an angular speed reward function according to the target angular speed and the actual angular speed, and obtain an angular speed reward weight of the angular speed reward function, determine a torque reward function according to the actual torque, and obtain a torque reward weight of the torque reward function, and obtain a reward function of the first controller according to the speed reward function and the speed reward weight, the angular speed reward function and the angular speed reward weight, the torque reward function, and the torque reward weight.

In one embodiment of the present application, the second obtaining module 150 is further configured to correlate the target position information of each joint, the RGB image data, the second sensor data, and the second target control signal to construct training data.

In one embodiment of the present application, the second training module 160 is further configured to input the RGB image data, the second sensor data, and the second target control signal to the second controller, obtain the predicted position information of each joint of the quadruped robot, and train the second controller according to the predicted position information of each joint and the target position information of each joint, to obtain a trained second target controller.

In one embodiment of the present application, the second training module 160 is further configured to obtain a loss function according to the predicted position information of each joint and the target position information of each joint, adjust the parameters of the second controller according to the loss function until the loss function meets the training ending condition, and determine the second controller after the last adjustment of the parameters as the second target controller.

The second training module 160 is further configured to input the second sensor data to a second internal sensing network in the second controller to obtain a third encoded vector, input the RGB image data to a second external sensing network in the second controller to obtain a fourth encoded vector, splice a second target vector corresponding to a second target control signal, the third encoded vector and the fourth encoded vector to obtain a second spliced vector, and input the second spliced vector to a second controller network in the second controller to obtain predicted position information of each joint of the quadruped robot.

The training device of the four-foot robot controller based on reinforcement learning provided by the application executes a first target control signal by controlling the four-foot robot in a simulation environment, determines first sensor data and first elevation chart data, trains the first controller according to the first sensor data, the first elevation chart data and the first target control signal to obtain a trained first target controller, wherein the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network, controls the four-foot robot in the simulation environment to execute a second target control signal, determines second sensor data, second elevation chart data and RGB image data, inputs the second sensor data, the second elevation chart data and the second target control signal into the first target controller, acquiring target position information of each joint of the quadruped robot, constructing training data according to the target position information of each joint, RGB image data, second sensor data and second target control signals, training a second controller to be trained according to the training data to obtain a trained second target controller, wherein the second target controller comprises a second target external perception network, a first target external perception network and a first target controller network, thereby being more suitable for complex illumination and obstacle environments by using the RGB image data as visual perception information, acquiring more ground information, controlling the quadruped robot more accurately and efficiently by inputting the RGB image data into the second target controller as visual perception information, reducing complexity and cost for acquiring the visual perception information, a solid foundation is laid for the follow-up four-foot robot to safely and smoothly execute tasks.

In order to implement the above embodiment, the present application also proposes an electronic device 2000, as shown in fig. 6, including a memory 210, a processor 220, and a computer program stored in the memory 210 and executable on the processor 220, where the processor implements the training method of the four-foot robot controller based on reinforcement learning according to the first aspect when executing the program.

To achieve the above-mentioned embodiments, the present application proposes a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the training method of the reinforcement learning-based quadruped robot controller according to the first aspect.

To achieve the above embodiments, the present application also proposes a computer program product comprising a computer program which, when executed by a processor, implements the reinforcement learning based four-legged robot controller training method of the first aspect.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A training method for a quadruped robot controller based on reinforcement learning, characterized in that the method comprises:

Controlling the quadruped robot to execute a first target control signal in a simulation environment to determine first sensor data and first elevation map data;

According to the first sensor data, the first elevation map data and the first target control signal, a first controller is trained to obtain a trained first target controller, wherein the first target controller includes a first target internal perception network, a first target external perception network and a first target controller network;

Controlling the quadruped robot to execute a second target control signal in a simulation environment, and determining second sensor data, second elevation map data, and RGB image data;

Inputting the second sensor data, the second elevation map data and the second target control signal into the first target controller to obtain target position information of each joint of the quadruped robot;

Constructing training data according to the target position information of each joint, the RGB image data, the second sensor data and the second target control signal;

The second controller to be trained is trained according to the training data to obtain a trained second target controller, wherein the second target controller includes a second target external perception network, a second target external perception network, and a second target controller network.

2. The method according to claim 1, characterized in that the step of training the first controller according to the first sensor data, the first elevation map data and the first target control signal to obtain a trained first target controller comprises:

Inputting the first sensor data into a first internal perception network in the first controller to obtain a first encoding vector, and inputting the first elevation map data into a first external perception network in the first controller to obtain a second encoding vector;

splicing a first target vector corresponding to a first target control signal, the first encoding vector, and the second encoding vector to obtain a first splicing vector, inputting the first splicing vector into a first controller network in the first controller, and obtaining first predicted position information of each joint of the quadruped robot and an actual control signal of the quadruped robot;

The first controller is trained according to the actual control signal and the first target control signal to obtain the first target controller.

3. The method according to claim 2, characterized in that the step of training the first controller according to the actual control signal and the first target control signal to obtain the first target controller comprises:

Acquire a reward function of the first controller according to the actual control signal and the first target control signal;

According to the reward function, the parameters of the first controller are adjusted until the reward function meets the training end condition, and the first controller after the last parameter adjustment is determined as the first target controller.

4. The method according to claim 3, characterized in that the step of obtaining a reward function of the first controller according to the actual control signal and the first target control signal comprises:

Obtaining a target speed and a target angular speed from the first target control signal, and obtaining an actual speed, an actual angular speed and an actual torque from the actual control signal;

Determine a speed reward function according to the target speed and the actual speed, and obtain a speed reward weight of the speed reward function;

Determining an angular velocity reward function according to the target angular velocity and the actual angular velocity, and obtaining an angular velocity reward weight of the angular velocity reward function;

Determine a torque reward function according to the actual torque, and obtain a torque reward weight of the torque reward function;

The reward function of the first controller is obtained according to the speed reward function and the speed reward weight, the angular velocity reward function and the angular velocity reward weight, the torque reward function and the torque reward weight.

5. The method according to claim 1, characterized in that the step of constructing training data according to the target position information of each joint, the RGB image data, the second sensor data and the second target control signal comprises:

The target position information of each joint, the RGB image data, the second sensor data and the second target control signal are associated to construct training data.

6. The method according to claim 5, characterized in that the step of training the second controller to be trained according to the training data to obtain a trained second target controller comprises:

Inputting the RGB image data, the second sensor data, and the second target control signal into the second controller to obtain predicted position information of each joint of the quadruped robot;

The second controller is trained according to the predicted position information of each joint and the target position information of each joint to obtain a trained second target controller.

7. The method according to claim 6, characterized in that the step of training the second controller according to the predicted position information of each joint and the target position information of each joint to obtain a trained second target controller comprises:

Acquire a loss function according to the predicted position information of each joint and the target position information of each joint;

According to the loss function, the parameters of the second controller are adjusted until the loss function meets the training end condition, and the second controller after the last parameter adjustment is determined as the second target controller.

8. The method according to claim 6, characterized in that the step of inputting the RGB image data, the second sensor data, and the second target control signal into the second controller to obtain the predicted position information of each joint of the quadruped robot comprises:

Input the second sensor data into a second internal perception network in the second controller to obtain a third encoding vector, and input the RGB image data into a second external perception network in the second controller to obtain a fourth encoding vector;

The second target vector corresponding to the second target control signal, the third encoding vector and the fourth encoding vector are spliced to obtain a second spliced vector, and the second spliced vector is input into the second controller network in the second controller to obtain the predicted position information of each joint of the quadruped robot.

9. A training device for a quadruped robot controller based on reinforcement learning, characterized in that the device comprises:

A first determination module is used to control the quadruped robot to execute a first target control signal in a simulation environment, and determine first sensor data and first elevation map data;

A first training module is used to train a first controller according to the first sensor data, the first elevation map data and the first target control signal to obtain a trained first target controller, wherein the first target controller includes a first target internal perception network, a first target external perception network and a first target controller network;

A second determination module is used to control the quadruped robot to execute a second target control signal in a simulation environment, and determine second sensor data, second elevation map data, and RGB image data;

A first acquisition module, used for inputting the second sensor data, the second elevation map data and the second target control signal into a first target controller to acquire target position information of each joint of the quadruped robot;

A second acquisition module, used for constructing training data according to the target position information of each joint, the RGB image data, the second sensor data and the second target control signal;

The second training module is used to train the second controller to be trained according to the training data to obtain a trained second target controller, wherein the second target controller includes a second target external perception network, a second target external perception network and a second target controller network.

10. An electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the method described in any one of claims 1-8.