Detailed Description
Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The training method of the four-foot robot controller based on reinforcement learning of the present application will be described in detail with reference to the following examples.
Fig. 1 is a flow chart of a training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application.
As shown in fig. 1, the training method of the four-foot robot controller based on reinforcement learning provided in the embodiment specifically includes the following steps:
s101, controlling the quadruped robot to execute a first target control signal in a simulation environment, and determining first sensor data and first elevation map data.
Wherein the first target control signal includes a target speedAnd a target angular velocity
In each simulation step t, motion of the quadruped robot and contact dynamics of the quadruped robot with the ground are simulated in a simulation environment, and first sensor data obtained by executing a first target control signal by the quadruped robot in the simulation environment are obtainedFirst elevation map data。
Wherein the first sensor data includes, but is not limited to, inertial measurement unit (Inertial measurement unit, IMU) data and joint encoder data.
It should be noted that IMU data includes, but is not limited to, fuselage velocityAngular velocity of fuselage: and body pose The body pose is represented using a projection of the gravity vector under the robot's own coordinate system.
It should be noted that the joint encoder data includes, but is not limited to, joint angles: And joint angular velocity 。
It should be noted that the specific manner of acquiring the first elevation map data is not limited in the present application, and alternatively, the first elevation map data may be acquired by a depth camera or a laser radar.
S102, training the first controller according to the first sensor data, the first elevation chart data and the first target control signal to obtain a trained first target controller, wherein the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network.
In the embodiment of the application, the first sensor data can be input to a first internal perception network in a first controller to obtain a first coding vector, the first elevation map data is input to a first external perception network in the first controller to obtain a second coding vector, the first target vector, the first coding vector and the second coding vector corresponding to a first target control signal are spliced to obtain a first splicing vector, the first splicing vector is input to the first controller network in the first controller to obtain first prediction position information of each joint of the quadruped robot and actual control signals of the quadruped robot, and the first controller is trained according to the actual control signals and the first target control signals to obtain the first target controller.
S103, controlling the quadruped robot to execute a second target control signal in the simulation environment, and determining second sensor data, second elevation map data and RGB image data.
In the related art, visual perception information is obtained based on a depth camera and a laser radar, the depth camera is easy to fail in a strong light or low light environment, especially when a strong reflection or shadow area is encountered, the visual perception information is often inaccurate, a smooth, transparent or high reflection material can cause the depth camera to not accurately obtain the depth information of the surface, a larger error is generated, in complex environments filled with dust, smoke or water mist and the like, signals of the depth camera are easy to be interfered, image blurring or incomplete depth data are caused, especially in scenes such as mines, fire rescue and the like, the detection range of the laser radar is limited, especially in multi-obstacle or complex environments, blind areas are easy to generate, in addition, a laser beam is influenced when penetrating through the obstacle, so that a ranging error is caused, the depth camera and the laser radar can provide geometric structures of the topography, however, the physical and mechanical properties of the ground cannot be evaluated, for example, when the quadruped robot walks on a grass or slippery ground, parameters such as friction coefficient, hardness and the like of the ground can directly influence the motion and stability of the quadruped robot, and for soft or unstable ground such as sandy ground, snowfield and the like, the existing sensing means lack the capability of identifying the collapse risks of the terrain, which can cause the quadruped robot to enter a dangerous area by mistake and influence task execution, so that in order to compensate the limitation of a single sensor, multi-sensor information such as a depth camera and a laser radar is generally required to be fused, the complexity and the calculation burden of the system are increased, especially in real-time application, the delay of data processing can cause the untimely response of the quadruped robot, the hardware cost of the depth camera and the laser radar is higher, especially the high-precision laser radar, and the large-scale deployment equipment is difficult to realize in some application scenes.
In the embodiment of the application, the camera on the quadruped robot can be used for collecting Red Green Blue (RGB for short) image data, and the camera on the quadruped robot is used for directly simulating human visual perception so as to be suitable for more complicated illumination and obstacle environments, and meanwhile, more ground information can be obtained.
Alternatively, a physical engine with rendering capabilities may be used as a building basis for the simulation environment, as RGB image data needs to be acquired.
For example, isaac Lab or physical engine+rendering engine approaches, such as Multi-Joint DYNAMICS WITH Contact MuJoCo) + UnReal Engine, may be used.
Wherein the second elevation map data and the RGB image data correspond to the same time.
S104, inputting the second sensor data, the second elevation map data and the second target control signal into the first target controller, and acquiring target position information of each joint of the quadruped robot.
In the embodiment of the application, the second sensor data can be input into the first target internal perception network to obtain an internal coding vector, the second elevation map data is input into the first target external perception network to obtain an external coding vector, the target vector, the internal coding vector and the external coding vector corresponding to the second target control signal are spliced to obtain a spliced vector, and the spliced vector is input into the first target controller network to output the target position information of each joint of the quadruped robot.
For example, taking a four-legged robot with 12 degrees of freedom as an example, the target position information of each joint of the four-legged robot is output。
S105, training data is constructed according to the target position information, RGB image data, second sensor data and second target control signals of each joint.
In the embodiment of the application, the first target controller takes the elevation map data calculated based on the depth camera or the laser radar data as input, but the second controller takes the RGB image data acquired based on the RGB camera as input, takes the first target controller as a label as output data of the first target controller, and adopts a second controller taking RGB image data as input in simulated learning training.
In the embodiment of the present application, the target position information, RGB image data, second sensor data, and second target control signal of each joint may be correlated to construct training data.
S106, training the second controller to be trained according to the training data to obtain a trained second target controller, wherein the second target controller comprises a second target external perception network, a second target external perception network and a second target controller network.
In the embodiment of the application, after the training data is obtained, the second controller to be trained can be trained according to the training data, so as to obtain the second target controller after the training is completed.
According to the training method of the four-foot robot controller based on reinforcement learning, the four-foot robot is controlled to execute the first target control signal in the simulation environment, the first sensor data and the first elevation chart data are determined, the first controller is trained according to the first sensor data, the first elevation chart data and the first target control signal, the first target controller after training is obtained, wherein the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network, the four-foot robot is controlled to execute the second target control signal in the simulation environment, the second sensor data, the second elevation chart data and the RGB image data are determined, the second sensor data, the second elevation chart data and the second target control signal are input into the first target controller, the target position information of each joint of the four-foot robot is obtained, the training is carried out according to the target position information, the RGB image data, the second sensor data and the second target control signal of each joint, the first target controller is constructed, the training data is carried out according to the training data, the second target controller is more accurately perceived by the second target controller, the second target controller is more accurately perceived by the aid of the four-foot robot, the four-foot robot controller can realize the external perception of the four-foot robot controller, the four-foot robot controller can realize the visual perception of the visual obstacle control system, the four-foot robot controller can realize the visual perception of the four-stage robot controller based on the four-vision system, and the four-robot controller can realize the visual perception, and the visual perception.
Fig. 2 is a flow chart of a training method of a four-foot robot controller based on reinforcement learning according to an embodiment of the present application.
As shown in fig. 2, the training method of the four-foot robot controller based on reinforcement learning provided in this embodiment specifically includes the following steps:
s201, controlling the quadruped robot to execute a first target control signal in a simulation environment, and determining first sensor data and first elevation map data.
Any implementation manner of the embodiments of the present application may be adopted for this step S201, and will not be described herein.
S202, inputting first sensor data to a first internal sensing network in a first controller to obtain a first coding vector, and inputting first elevation map data to a first external sensing network in the first controller to obtain a second coding vector.
In the embodiment of the application, after the first sensor data is acquiredAnd first elevation map dataFirst sensor dataInputting the sensor data into a first internal perception network, and encoding the sensor data by the first internal perception network to obtain a first encoding vectorFirst elevation map dataInputting the first data into a first external perception network, and encoding the first elevation map data by the first external perception network to obtain a second encoding vector。
S203, splicing the first target vector, the first coding vector and the second coding vector corresponding to the first target control signal to obtain a first spliced vector, and inputting the first spliced vector into a first controller network in a first controller to obtain first prediction position information of each joint of the quadruped robot and actual control signals of the quadruped robot.
S204, training the first controller according to the actual control signal and the first target control signal to obtain the first target controller.
In the embodiment of the application, the reward function of the first controller can be obtained according to the actual control signal and the first target control signal, the parameters of the first controller are adjusted according to the reward function until the reward function meets the training ending condition, and the first controller after the parameters are adjusted last time is determined to be the first target controller.
Optionally, when the bonus function value of the first controller reaches a preset threshold, it is determined that the bonus function meets the training end condition.
In the embodiment of the application, the target speed can be obtained from the first target control signalAnd a target angular velocityAnd obtain the first speed from the actual control signalFirst angular velocityAnd moment of forceAccording to the target speedAnd actual speedDetermining a speed rewarding functionAnd obtain the speed rewarding weight of the speed rewarding functionAccording to the target angular velocityAnd actual angular velocityDetermining angular velocity bonus functionsAnd obtain the angular velocity rewarding weight of the angular velocity rewarding functionAccording to the actual momentDetermining a moment rewarding functionAnd obtain the moment rewarding weight of the moment rewarding functionAnd obtaining the rewarding function of the first controller according to the speed rewarding function and the speed rewarding weight, the angular speed rewarding function and the angular speed rewarding weight, and the moment rewarding function and the moment rewarding weight.
For example, the bonus function of the first controller may be determined according to the following formula:
Wherein, A bonus function for the first controller,Is a speed rewarding function,Awarding weight for speed,Awarding functions for angular velocityBonus weight for angular velocity,For moment rewarding functionsWeights are awarded for the moments.
S205, controlling the quadruped robot to execute a second target control signal in the simulation environment, and determining second sensor data, second elevation map data and RGB image data.
S206, inputting the second sensor data, the second elevation map data and the second target control signal into the first target controller, and acquiring target position information of each joint of the quadruped robot.
S207, training data is constructed according to the target position information, RGB image data, second sensor data and second target control signals of each joint.
Any implementation manner of each embodiment of the present application may be adopted for the steps S205-S207, and will not be described herein.
S208, inputting the RGB image data, the second sensor data and the second target control signal into a second controller to obtain the predicted position information of each joint of the quadruped robot.
In the embodiment of the present application, as shown in fig. 3, the second sensor data may be input to a second internal sensing network in the second controller to obtain a third encoding vector, the RGB image data is input to a second external sensing network in the second controller to obtain a fourth encoding vector, the second target vector, the third encoding vector and the fourth encoding vector corresponding to the second target control signal are spliced to obtain a second spliced vector, and the second spliced vector is input to a second controller network in the second controller to obtain the predicted position information of each joint of the quadruped robot.
S209, training the second controller according to the predicted position information of each joint and the target position information of each joint to obtain a trained second target controller.
In the embodiment of the application, the loss function can be obtained according to the predicted position information of each joint and the target position information of each joint, the parameters of the second controller are adjusted according to the loss function until the loss function meets the training ending condition, and the second controller after the parameters are adjusted last time is determined to be the second target controller.
For example, when the loss function value reaches a preset threshold, it is determined that the loss function satisfies the training end condition.
After the trained second target controller is obtained, the quadruped robot may be controlled based on the second target controller.
For example, as shown in fig. 4, the four-legged robot sensor data, RGB image data, and target speed are input to the second target controller, the second target controller outputs positional information of each joint, and the four-legged robot is controlled based on the positional information of each joint.
The application provides a training method of a four-foot robot controller based on reinforcement learning, which is characterized in that a four-foot robot is controlled to execute a first target control signal in a simulation environment, first sensor data and first elevation map data are determined, the first sensor data are input into a first internal perception network in the first controller to obtain a first coding vector, the first elevation map data are input into a first external perception network in the first controller to obtain a second coding vector, the first target vector, the first coding vector and the second coding vector corresponding to the first target control signal are spliced to obtain a first splicing vector, the first splicing vector is input into a first controller network of the first controller to obtain first predicted position information of each joint of the four-foot robot and the actual control signal of the four-foot robot, the first controller is trained according to the actual control signal and the first target control signal to obtain a first target controller, the first target controller is trained according to the actual control signal and the first target control signal, the first controller is trained to obtain a first target controller, the first target position information of each joint is obtained from the second elevation map data, the second sensor is input into each joint of the four-foot robot, the predicted position information of each joint is obtained according to the second RGB (RGB) control signal, the position information of each joint of the four-foot robot is input into the second joint position information, the four-foot robot is trained according to the position information of each joint of the four-foot robot, the training-completed second target controller is obtained, so that the training-completed second target controller is more suitable for complex illumination and obstacle environments by using RGB image data as visual perception information, more ground information can be acquired, the four-legged robot can be controlled more accurately and efficiently by inputting the RGB image data into the second target controller as visual perception information, the complexity and cost for acquiring the visual perception information are reduced, and a solid foundation is laid for the follow-up four-legged robot to safely and smoothly execute tasks.
In order to achieve the above embodiments, the present embodiment provides a training device of a four-foot robot controller based on reinforcement learning, and fig. 5 is a schematic structural diagram of the training device of the four-foot robot controller based on reinforcement learning according to the embodiment of the present application.
As shown in fig. 5, the training device 1000 of the reinforcement learning-based quadruped robot controller includes a first determining module 110, a first training module 120, a second determining module 130, a first acquiring module 140, a second acquiring module 150, and a second training module 160.
A first determining module 110, configured to control the quadruped robot to execute a first target control signal in a simulation environment, and determine first sensor data and first elevation map data;
The first training module 120 is configured to train the first controller according to the first sensor data, the first elevation map data, and the first target control signal, so as to obtain a trained first target controller, where the first target controller includes a first target internal perception network, a first target external perception network, and a first target controller network;
A second determining module 130, configured to control the quadruped robot to execute a second target control signal in the simulation environment, and determine second sensor data, second elevation map data, and RGB image data;
A first obtaining module 140, configured to input the second sensor data, the second elevation map data, and the second target control signal to a first target controller, and obtain target position information of each joint of the quadruped robot;
a second obtaining module 150, configured to construct training data according to the target position information of each joint, the RGB image data, the second sensor data, and the second target control signal;
And the second training module 160 is configured to train the second controller to be trained according to the training data, so as to obtain a trained second target controller, where the second target controller includes a second target external perception network, and a second target controller network.
According to an embodiment of the present application, the first training module 120 is further configured to input the first sensor data to a first internal sensing network in the first controller to obtain a first encoded vector, input the first elevation map data to a first external sensing network in the first controller to obtain a second encoded vector, splice a first target vector corresponding to a first target control signal, the first encoded vector and the second encoded vector to obtain a first spliced vector, input the first spliced vector to a first controller network in the first controller to obtain first predicted position information of each joint of the quadruped robot and an actual control signal of the quadruped robot, and train the first controller according to the actual control signal and the first target control signal to obtain the first target controller.
In one embodiment of the present application, the first training module 120 is further configured to obtain a reward function of the first controller according to the actual control signal and the first target control signal, adjust a parameter of the first controller according to the reward function until the reward function meets a training end condition, and determine the first controller after the last adjustment of the parameter as the first target controller.
In one embodiment of the present application, the first training module 120 is further configured to obtain a target speed and a target angular speed from the first target control signal, obtain an actual speed, an actual angular speed, and an actual torque from the actual control signal, determine a speed reward function according to the target speed and the actual speed, and obtain a speed reward weight of the speed reward function, determine an angular speed reward function according to the target angular speed and the actual angular speed, and obtain an angular speed reward weight of the angular speed reward function, determine a torque reward function according to the actual torque, and obtain a torque reward weight of the torque reward function, and obtain a reward function of the first controller according to the speed reward function and the speed reward weight, the angular speed reward function and the angular speed reward weight, the torque reward function, and the torque reward weight.
In one embodiment of the present application, the second obtaining module 150 is further configured to correlate the target position information of each joint, the RGB image data, the second sensor data, and the second target control signal to construct training data.
In one embodiment of the present application, the second training module 160 is further configured to input the RGB image data, the second sensor data, and the second target control signal to the second controller, obtain the predicted position information of each joint of the quadruped robot, and train the second controller according to the predicted position information of each joint and the target position information of each joint, to obtain a trained second target controller.
In one embodiment of the present application, the second training module 160 is further configured to obtain a loss function according to the predicted position information of each joint and the target position information of each joint, adjust the parameters of the second controller according to the loss function until the loss function meets the training ending condition, and determine the second controller after the last adjustment of the parameters as the second target controller.
The second training module 160 is further configured to input the second sensor data to a second internal sensing network in the second controller to obtain a third encoded vector, input the RGB image data to a second external sensing network in the second controller to obtain a fourth encoded vector, splice a second target vector corresponding to a second target control signal, the third encoded vector and the fourth encoded vector to obtain a second spliced vector, and input the second spliced vector to a second controller network in the second controller to obtain predicted position information of each joint of the quadruped robot.
The training device of the four-foot robot controller based on reinforcement learning provided by the application executes a first target control signal by controlling the four-foot robot in a simulation environment, determines first sensor data and first elevation chart data, trains the first controller according to the first sensor data, the first elevation chart data and the first target control signal to obtain a trained first target controller, wherein the first target controller comprises a first target internal perception network, a first target external perception network and a first target controller network, controls the four-foot robot in the simulation environment to execute a second target control signal, determines second sensor data, second elevation chart data and RGB image data, inputs the second sensor data, the second elevation chart data and the second target control signal into the first target controller, acquiring target position information of each joint of the quadruped robot, constructing training data according to the target position information of each joint, RGB image data, second sensor data and second target control signals, training a second controller to be trained according to the training data to obtain a trained second target controller, wherein the second target controller comprises a second target external perception network, a first target external perception network and a first target controller network, thereby being more suitable for complex illumination and obstacle environments by using the RGB image data as visual perception information, acquiring more ground information, controlling the quadruped robot more accurately and efficiently by inputting the RGB image data into the second target controller as visual perception information, reducing complexity and cost for acquiring the visual perception information, a solid foundation is laid for the follow-up four-foot robot to safely and smoothly execute tasks.
In order to implement the above embodiment, the present application also proposes an electronic device 2000, as shown in fig. 6, including a memory 210, a processor 220, and a computer program stored in the memory 210 and executable on the processor 220, where the processor implements the training method of the four-foot robot controller based on reinforcement learning according to the first aspect when executing the program.
To achieve the above-mentioned embodiments, the present application proposes a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the training method of the reinforcement learning-based quadruped robot controller according to the first aspect.
To achieve the above embodiments, the present application also proposes a computer program product comprising a computer program which, when executed by a processor, implements the reinforcement learning based four-legged robot controller training method of the first aspect.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.
The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.