Disclosure of Invention
The application discloses a control method, a device and a storage medium for walking of a straight knee of a humanoid robot, which are used for improving the walking performance of the humanoid robot in straight knee of different scenes.
In order to meet higher requirements of the humanoid robot on naturalness, attractiveness and human consistency in a high-simulation walking task and break through the technical bottlenecks of the conventional model driving control and the conventional reinforcement learning method in terms of strategy expression capability, control stability, training guidance and the like, the application particularly provides a generation countermeasure simulation learning method (Wasserstein-GAIL) based on optimal transmission distance optimization, which is used for realizing a humanoid robot walking control strategy of natural straight knee gait.
The first aspect of the application discloses a control method for walking of a straight knee of a humanoid robot, which comprises the following steps:
acquiring human gait data, wherein the human gait data are gait data of human straight knee walking;
performing action redirection processing on the human gait data, and converting the human gait data into reference motion sequence data of the target humanoid robot;
modeling a target humanoid robot walking task as a Markov decision process of a speed condition, constructing an countermeasure network under the modeling, and generating a maximum expected discount return function according to environment variable distribution and a decision strategy;
Constructing a speed tracking reward function according to the centroid linear speed and the angular speed of the target humanoid robot;
Constructing a soft boundary Wasserstein loss function of the discriminator by using gradient penalty items of real data distribution, generated data distribution, specific distribution and specific distribution sampling, and constructing a style rewarding function by using the output of the discriminator;
Constructing a PD controller so that the PD controller converts the action output by the decision strategy into a torque signal of the target humanoid robot;
After simulation learning is performed by using the constructed countermeasure network, selecting physical parameters of the target humanoid robot in the straight knee walking training according to the correlation degree, and setting prior distribution for each physical parameter;
performing live-action test movement for the target humanoid robot, and collecting operation data corresponding to physical parameters;
and combining the collected operation data with the prior distribution corresponding to the physical parameters, feeding back the updated posterior distribution to the imitation learning stage, and optimizing the control strategy of the imitation learning stage.
Optionally, performing motion redirection processing on the human gait data, and converting the human gait data into reference motion sequence data of the target humanoid robot, including:
performing skeleton topology unification and skeleton binding treatment on a human skeleton and a humanoid robot skeleton to generate an original skeleton, wherein the original skeleton is provided with a plurality of key joints;
Carrying out coordinate system unified processing and root standardized processing according to the human gait data and the original skeleton to generate key joint data of the humanoid robot;
and carrying out multi-objective inverse kinematics optimization solution on the key joint data to generate a reference motion sequence, wherein the multi-objective inverse kinematics optimization solution is used for mapping Cartesian positions of the key joints and the pose of the end effector to corresponding joint angular directions.
Optionally, the step of performing skeleton topology unification and skeleton binding processing on the human skeleton and the humanoid robot skeleton to generate an original skeleton includes:
constructing a kinematic tree of a human skeleton and a human-shaped robot skeleton;
The method comprises the steps of performing skeleton merging on kinematic trees of a human skeleton and a human-shaped robot skeleton to generate an original skeleton, and reserving one skeleton between key joints;
The key joints were selected and the length of each bone segment was recorded.
Optionally, the step of performing multi-objective inverse kinematics optimization solution on the key joint data to generate a reference motion sequence includes:
constructing a key joint position matching loss function according to the actual position of a key joint of the humanoid robot and the expected target position of the key joint in key joint data;
constructing an end effector posture matching loss function according to the actual posture of the end effector of the humanoid robot and the expected target posture in the key joint data;
constructing a joint minimum displacement loss function according to joint angle data between adjacent frames in the key joint data;
generating an objective function according to the key joint position matching loss function, the end effector gesture matching loss function and the joint minimum displacement loss function;
and carrying out multi-objective inverse kinematics optimization solution on the key joint data according to the objective function, and generating a reference motion sequence.
Optionally, the step of generating the objective function from the key joint position matching loss function, the end effector pose matching loss function, and the joint minimum displacement loss function comprises:
determining acquisition scene parameters of human gait data;
performing texture contrast analysis and inclination contrast analysis on the acquired scene parameters and the land scene parameters to generate gait deformation data;
determining key joints with deformation on an original skeleton according to gait deformation data;
generating key joint position weights for the deformed key joints;
determining pose change degree parameters according to gait deformation data, and generating tail end pose weights according to the pose change degree parameters;
Determining a minimum displacement loss weight;
And generating an objective function according to the key joint position matching loss function, the end effector gesture matching loss function, the joint minimum displacement loss function, the key joint position weight, the end gesture weight and the minimum displacement loss weight.
Optionally, after the step of performing multi-objective inverse kinematics optimization solution on the key joint data to generate the reference motion sequence and before the step of modeling the objective humanoid robot walking task as a markov decision process of a speed condition, building an countermeasure network under the modeling, and generating a maximal desired discount return function according to the environmental variable distribution and the decision strategy, the control method further comprises:
And carrying out track quality optimization processing on the reference motion sequence.
The second aspect of the application discloses a control device for walking of a straight knee of a humanoid robot, which comprises:
the first acquisition unit is used for acquiring human gait data, wherein the human gait data are gait data of human straight knee walking;
the redirection unit is used for performing action redirection processing on the human gait data and converting the human gait data into reference motion sequence data of the target humanoid robot;
The first construction unit is used for modeling the walking task of the target humanoid robot into a Markov decision process with a speed condition, constructing an countermeasure network under the modeling, and generating a maximum expected discount return function according to the environment variable distribution and the decision strategy;
the second construction unit is used for constructing a speed tracking reward function according to the centroid linear speed and the angular speed of the target humanoid robot;
The third construction unit is used for constructing a soft boundary Wasserstein loss function of the discriminator by using gradient penalty items of real data distribution, generated data distribution, specific distribution and specific distribution sampling, and constructing a style rewarding function by using the output of the discriminator;
a fourth construction unit for constructing the PD controller so that the PD controller converts the motion output by the decision strategy into a torque signal of the target humanoid robot;
the setting unit is used for selecting physical parameters of the target humanoid robot in the straight knee walking training according to the correlation degree after the simulation learning is completed by using the constructed countermeasure network, and setting prior distribution for each physical parameter;
the second acquisition unit is used for performing live-action test motion on the target humanoid robot and acquiring operation data corresponding to physical parameters;
And the optimizing unit is used for combining the collected operation data with the prior distribution corresponding to the physical parameters, feeding back the updated posterior distribution to the imitation learning stage, and optimizing the control strategy of the imitation learning stage.
Optionally, the redirecting unit includes:
The first generation module is used for carrying out skeleton topology unification and skeleton binding treatment on the human skeleton and the humanoid robot skeleton to generate an original skeleton, and a plurality of key joints are arranged on the original skeleton;
the second generation module is used for carrying out coordinate system unified processing and root standardized processing according to the human gait data and the original skeleton to generate key joint data of the humanoid robot;
The third generation module is used for carrying out multi-objective inverse kinematics optimization solution on the key joint data to generate a reference motion sequence, and the multi-objective inverse kinematics optimization solution is used for mapping Cartesian positions of the key joints and the pose of the end effector to corresponding joint angular directions.
Optionally, the first generating module includes:
constructing a kinematic tree of a human skeleton and a human-shaped robot skeleton;
The method comprises the steps of performing skeleton merging on kinematic trees of a human skeleton and a human-shaped robot skeleton to generate an original skeleton, and reserving one skeleton between key joints;
The key joints were selected and the length of each bone segment was recorded.
Optionally, the third generating module includes:
The first construction submodule is used for constructing a key joint position matching loss function according to the actual position of a key joint of the humanoid robot and the expected target position of the key joint in key joint data;
The second construction submodule is used for constructing an end effector posture matching loss function according to the actual posture of the end effector of the humanoid robot and the expected target posture in key joint data;
The third construction submodule is used for constructing a joint minimum displacement loss function according to joint angle data between adjacent frames in the key joint data;
The first generation submodule is used for generating an objective function according to the key joint position matching loss function, the end effector posture matching loss function and the joint minimum displacement loss function;
And the second generation submodule is used for carrying out multi-objective inverse kinematics optimization solution on the key joint data according to the objective function and generating a reference motion sequence.
Optionally, the first generating sub-module includes:
determining acquisition scene parameters of human gait data;
performing texture contrast analysis and inclination contrast analysis on the acquired scene parameters and the land scene parameters to generate gait deformation data;
determining key joints with deformation on an original skeleton according to gait deformation data;
generating key joint position weights for the deformed key joints;
determining pose change degree parameters according to gait deformation data, and generating tail end pose weights according to the pose change degree parameters;
Determining a minimum displacement loss weight;
And generating an objective function according to the key joint position matching loss function, the end effector gesture matching loss function, the joint minimum displacement loss function, the key joint position weight, the end gesture weight and the minimum displacement loss weight.
Optionally, after the third generating module and before the first building unit, the control device further comprises:
and the optimizing module is used for carrying out track quality optimizing processing on the reference motion sequence.
The third aspect of the present application provides a control device for walking with a straight knee of a humanoid robot, comprising:
a processor, a memory, an input-output unit, and a bus;
The processor is connected with the memory, the input/output unit and the bus;
The memory holds a program that the processor invokes to perform any of the alternative control methods as in the first aspect as well as the first aspect.
A fourth aspect of the application provides a computer-readable storage medium having a program stored thereon, which when executed on a computer performs any of the alternative control methods as in the first aspect and the first aspect.
From the above technical solutions, the embodiment of the present application has the following advantages:
In the application, human gait data is acquired firstly, and the human gait data is gait data of human straight knee walking. And performing action redirection processing on the human gait data, and converting the human gait data into reference motion sequence data of the target humanoid robot. Modeling a target humanoid robot walking task as a Markov decision process of a speed condition, constructing an countermeasure network under the modeling, and generating a maximum expected discount return function according to environment variable distribution and a decision strategy. And constructing a speed tracking reward function according to the centroid linear speed and the angular speed of the target humanoid robot. And constructing a soft boundary Wasserstein loss function of the discriminator by using gradient penalty items of real data distribution, generated data distribution, specific distribution and specific distribution sampling, and constructing a style rewarding function by using the output of the discriminator. The PD controller is configured such that the PD controller converts the actions output by the decision strategy into torque signals for the target humanoid robot. After simulation learning is performed by using the constructed countermeasure network, selecting physical parameters of the target humanoid robot in the straight knee walking training according to the correlation degree, and setting prior distribution for each physical parameter. And performing live-action test movement for the target humanoid robot, and collecting operation data corresponding to the physical parameters. And combining the collected operation data with the prior distribution corresponding to the physical parameters, feeding back the updated posterior distribution to the imitation learning stage, and optimizing the control strategy of the imitation learning stage.
By introducing the Generated Antagonism Imitation Learning (GAIL) framework and taking the real human gait data of human as demonstration, the decision strategy of the training humanoid robot approximates human gait characteristics in the behavioral space, so that the natural straight knee walking capability can be directly obtained without rewarding function design. On the basis, in order to further improve the stability and the precision of the simulation effect, a Wasserstein distance based on an optimal transmission theory is introduced, and an improved GAIL discriminator is constructed. The Wasserstein distance has better gradient property and stronger distribution fitting capability in the training process, and can improve the sensitivity of strategy learning to detail motion difference. The improvement ensures that the robot not only can capture the macroscopic morphological characteristics of human gait, but also can restore the dynamic coordination and the subtle rhythm sense more accurately. On the basis, the physical parameters are presumed according to priori knowledge and the operation data of the humanoid robot, and the influence of simulation and reality differences is reduced. The method is cooperated with the existing improvement, provides more accurate parameter support for the generation of the straight knee gait, and further improves the walking performance of the humanoid robot in different real environments.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in various places throughout this specification are not necessarily all referring to the same embodiment, but mean "one or more but not an embodiment" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
In the prior art, most of the current mainstream humanoid robot control schemes are model driving methods and reinforcement learning-based strategy learning methods. Model-based Control, a typical method is a gait generating controller based on a Linear Inverted Pendulum Model (LIPM), and in order to ensure Model calculation simplicity, a hypothetical condition that the height of the centroid is constant is often set. However, this assumption directly limits the robot to achieve a straight knee gait, so that the robot has to maintain walking stability in the form of a knee bend. If the centroid height limit is tried to be relaxed, the gait generating model presents nonlinear characteristics, so that the system solving complexity is increased sharply, and even movement singularities (such as loss of control freedom caused by complete knee joint extension) occur, and the stability and the control precision are affected.
Secondly, the strategy learning method based on reinforcement learning can directly optimize the control strategy in a high-dimensional control space, and has good adaptability. However, in learning a specific target motion such as walking with straight knees, a large number of motion strategies are available in the search space, which are physically feasible but have extremely large differences in behavior characteristics, due to the large redundancy of the human-shaped robot itself. If the learning process is not guided by the targeted reward function, the learning process is easy to fall into local optimum, and the learning process is difficult to converge to natural and straight knee gait. Manually designing high quality bonus functions is not only costly but also lacks versatility.
In summary, the two humanoid robots cannot train the humanoid robot walking with the straight knee well in the control scheme of the humanoid robot, so that the performance of the humanoid robot walking with the straight knee in the real environment is reduced.
In order to meet higher requirements of the humanoid robot on naturalness, attractiveness and human consistency in a high-simulation walking task and break through the technical bottlenecks of the conventional model driving control and the conventional reinforcement learning method in terms of strategy expression capability, control stability, training guidance and the like, the application provides a generation confrontation simulation learning method (Wasserstein-GAIL) based on optimal transmission distance optimization, which is used for realizing a humanoid robot walking control strategy of natural straight knee gait.
The technical scheme of the application introduces a thinking of imitating learning aiming at emerging business scenes with higher requirements on robot action expressive force such as T-stage display, immersive welcome and stage performance, takes real gait data of a human body as a reference template, and enables the action track of the human-shaped robot to be close to human beings in time sequence characteristics, gesture styles and dynamic rhythms by learning and training of human-shaped robot strategies, thereby meeting scene requirements of anthropomorphic and natural fluency.
According to the technical scheme, an improved GAIL framework is adopted as a core means, and Wasserstein distance is introduced to replace a traditional classification discrimination loss function, so that training stability of imitation learning and strategy distribution fitting capacity are improved. The method not only effectively avoids the dependence on a high-quality reward function in the traditional reinforcement learning, but also obviously enhances the naturalness and the expression tension of the robot for generating the straight knee gait.
The method breaks through towards the prior art, is difficult to generate the straight knee gait under the limit of the centroid height assumption in order to solve the problems of the prior model driving method, the Wasserstein-GAIL imitates a learning framework as a brand new control strategy generation mode, and has the following breakthrough points:
1. the model simplification constraint is eliminated, the model simplification dynamic model such as a linear inverted pendulum is not relied on any more, and gait unnaturalness caused by model assumption is fundamentally avoided;
2. Avoiding the bonus function design, namely performing strategy alignment by simulating human demonstration data to replace the complex and tedious manual bonus item design process in the traditional reinforcement learning;
3. The learning stability and generalization capability are improved, namely, after the Wasserstein distance is introduced, the difference measurement between the human and robot strategy distribution is more continuous and has robustness, so that the convergence efficiency and the strategy generation quality in the training process are effectively improved.
In conclusion, by constructing the Wasserstein-GAIL strategy training mechanism, the application forms a straight knee gait generation method which does not depend on modeling simplification and external rewarding guidance and has high humanoid characteristics and good adaptability, can meet the technical requirements of high-end service, performance interaction and other typical scenes, and has remarkable engineering practical value and commercial application potential.
Based on the above, the application discloses a control method, a device and a storage medium for walking of a straight knee of a humanoid robot, which are used for improving the walking performance of the humanoid robot in straight knee of different scenes.
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The method of the present application may be applied to a server, a device, a terminal, or other devices having logic processing capabilities, and the present application is not limited thereto. For convenience of description, the following description will take an execution body as an example of a terminal.
Referring to fig. 1, the present application provides an embodiment of a control method for walking with a straight knee of a humanoid robot, including:
101. Acquiring human gait data, wherein the human gait data are gait data of human straight knee walking;
102. Performing action redirection processing on the human gait data, and converting the human gait data into reference motion sequence data of the target humanoid robot;
In the embodiment of the application, the gait data of the human body walking with the straight knee is acquired, and then the action redirection processing is carried out. Specifically, the objective of the link is to generate high-dimensional human gait data obtained based on the motion capture device, and then generate standard reference track data which is adaptive to the humanoid robot in terms of structure, size and motion style through a series of processing flows, so as to be used by a follow-up imitation learning module. Specific steps of the action redirection process are described in the following examples. The human gait data can be acquired from an SMPL model, AMASS database or MoCap system, usually a sequence of 25-52 3D joints, and the structure is inconsistent with the robot, but has a certain similarity.
103. Modeling a target humanoid robot walking task as a Markov decision process of a speed condition, constructing an countermeasure network under the modeling, and generating a maximum expected discount return function according to environment variable distribution and a decision strategy;
In this embodiment, a large core point is the use of wasperstein against humanoid robot walking control strategy that mimics learning. Specifically, referring to fig. 9, fig. 9 is a schematic diagram of a humanoid robot walking control strategy based on waserstein challenge simulation learning. Wherein human demonstrations represents natural motion reference data of the humanoid robot after motion redirection obtained from the previous step, learned robot motions represents motion data of the humanoid robot, which is generated in the training process, and then discriminator module scores similarity of the two groups of data, if the motion data generated by the humanoid robot is closer to the machine ginseng reference data, higher score rewards are obtained, so that behavior generated by the robot in the learning process is encouraged to be more similar to natural motion parameter data, and a natural straight knee walking effect is achieved, and specific details are respectively described later. user commands is a user instruction input to an Agent, and robot states is a state of a humanoid robot input to the Agent. SIMILARITY REWARD are similarity rewards returned to the agent, generated by corresponding rewards functions.
In this embodiment, the terminal models the humanoid robot walking task as a markov decision process for one speed condition. Formalized definitions include state space, action space, target space, policy functions, and rewards functions.
The state space comprises the current gesture state (such as joint angle, speed, centroid position, root gesture and the like) of the robot. The action space is the target position of each joint output by the strategy network. The target space is a linear velocity and angular velocity command input by a user. The policy function represents the distribution of actions at the state and desired speed. The bonus function consists of two sub-items, one being a velocity tracking bonus function and the other being a style mimicking bonus function.
The training goal is to maximize the conditional desired jackpot (maximize the desired discount rewards function) J (pi):
Where this function represents the expected total prize J (pi) for a given policy pi. V-p (v) in the function J (pi) represents sampling a specific environmental parameter from the distribution of environmental variables (p (v), such as task configuration), and τ -p (|pi, v) represents generating a track according to the strategy pi in the environment. The sum Σ_tγ ζ (s_t, a_t, v) on the right is the discounted accumulation of this track. Rewards, where γ is a discount factor, r is a rewards function, dependent on the current state s_t, action a_t and environment variable v. Overall, J (pi) measures the average performance of a policy in various possible environments.
104. Constructing a speed tracking reward function according to the centroid linear speed and the angular speed of the target humanoid robot;
in this embodiment, a speed tracking reward function r V needs to be constructed, and the present embodiment adopts an exponential decay function to measure the deviation degree of the current actual speed and the target speed, which is specifically as follows:
Wherein, the AndFor the current robot mass core line speed and angular speed,AndIs the remote control target centroid linear velocity and angular velocity, from external input.、Is a set of super-parameters, for controlling the importance of each tracking error,AndIs another set of super parameters for adjusting the tracking accuracy of the corresponding item.
105. Constructing a soft boundary Wasserstein loss function of the discriminator by using gradient penalty items of real data distribution, generated data distribution, specific distribution and specific distribution sampling, and constructing a style rewarding function by using the output of the discriminator;
In this embodiment, to make the robot behavior style approach to the reference data (e.g., human real gait), this embodiment introduces WASSERSTEIN CRITIC a network to score the style, and this module is constructed based on the improved wasperstein-1 distance, and a soft boundary wasperstein loss function is constructed, as described in detail below.
1) The input and feature extraction modes of the discriminator network are as follows:
The reference data distribution is extracted from continuous N frames of state data in the action data;
the strategy generation distribution is that N frames of states are extracted as well;
and (3) a characteristic construction function, namely extracting local motion style related characteristics such as joint speed distribution, step frequency, mass center offset and the like from the whole motion data.
2) The design of the arbiter loss function (soft boundary wasperstein loss function):
In this embodiment, the infinite output range problem in the original WGAN-GP is improved, the output range compression is introduced, the extreme of the rewarding value is prevented, the training stability is improved, and then the following wasperstein loss function with soft boundary is proposed:
In this embodiment, θ is set as the arbiter parameter, x is taken from the true data distribution P r, x-from the generated data distribution P g, x-is sampled from the specific distribution P x^. D θ (x) is the output of the arbiter for x, E is the expectation, tanh is the activation function, η and λ are the hyper-parameters, ∇ x^ is the gradient for x.
It can be seen that the loss function is WGAN-GP form, which is associated with Wasserstein in that it guarantees that the discriminators Lipschitz are continuous to approximate Wasserstein distance, and this function forces the discriminators gradient norms to approach 1 to meet Lipschitz conditions by adding gradient penalty terms instead of weight clipping, thereby stabilizing the approximate Wasserstein distance, guiding generator optimization. The style rewards are calculated, and the final imitated style rewards are given by the following functions:
The style function rewarding function is used for ensuring that if the generated action is scored as 'approaching to the reference distribution' by the discriminator, higher style rewards are obtained, and the robot is stimulated to show the natural walking strategy of the anthropomorphic straight knee. The final target rewards function is generated from the velocity tracking rewards function and the imitation style rewards by weighted superposition.
106. Constructing a PD controller so that the PD controller converts the action output by the decision strategy into a torque signal of the target humanoid robot;
in this embodiment, the control execution mechanism of the terminal is PD controller torque map. The action output by the decision strategy network is the expected position of each key joint, and the PD controller is adopted to convert the expected position into a torque signal which can be used for driving the humanoid robot:
Where τ generally represents torque or control input. θ≡typically represents an estimate of the angle. θt is the target angle value. k p is a proportional gain coefficient for adjusting the intensity of the control action in relation to the angle error. θt ̇ is the derivative of the target angle θt with time, i.e., the rate of change of the target angle. k d is a differential gain coefficient for adjusting the control effort intensity in relation to the target angle change rate. The whole formula can be used for calculating control torque in a control theory, and the control quantity is comprehensively determined through angle errors and target angle change rates.
107. After simulation learning is performed by using the constructed countermeasure network, selecting physical parameters of the target humanoid robot in the straight knee walking training according to the correlation degree, and setting prior distribution for each physical parameter;
108. performing live-action test movement for the target humanoid robot, and collecting operation data corresponding to physical parameters;
109. and combining the collected operation data with the prior distribution corresponding to the physical parameters, feeding back the updated posterior distribution to the imitation learning stage, and optimizing the control strategy of the imitation learning stage.
In the process of applying the humanoid robot from the simulation environment to the real environment, the difference of physical parameters is a key factor for preventing the performance of the humanoid robot. In this embodiment, a bayesian method is used to process the sim2real problem.
(1) Determining physical parameters and a priori distributions
Firstly, the physical parameters of the humanoid robot which are required to be optimized are definitely defined, and in the embodiment, the physical parameters of the humanoid robot cover parameters which have great influence on the straight knee walking motion of the humanoid robot, such as joint friction coefficient, motor torque constant, mass distribution parameters and the like, based on the item of the straight knee walking gait. Based on training experience, research data of the same type of robots and preliminary simulation experiment results, reasonable prior distribution is set for each physical parameter determined. For example, for joint friction coefficients, it is assumed that the mechanical components follow a uniform distribution in a certain interval according to the materials and designs of the mechanical components, and for motor torque constants, a normal distribution is used for describing a possible value range of the motor torque constants by referring to technical specifications of the motor.
(2) True machine test and data acquisition
A series of well-designed test movements were carried out on a real humanoid robot. These test movements need to include single joint movements, such as knee flexion and extension, shoulder rotation, for separately acquiring information about physical parameters associated with each joint, and complex overall walking movements to comprehensively analyze the influence of multiple physical parameters on the overall motion of the robot. In the testing process, abundant data are collected in real time by means of various sensors built in the robot, such as a joint position sensor, a force sensor, an acceleration sensor and the like. The data types comprise the change of the angle of the joints with time, the current consumption of a motor, the motion track of the mass center of the robot, the force and torque applied to each joint and the like.
(3) Bayesian reasoning and parameter updating
And combining the acquired real data with preset prior distribution by using a Bayesian inference algorithm, and iteratively updating posterior distribution of the physical parameters. The Bayesian formula is P (θ|D) =P (D|θ) P (θ) where P (θ) is an a priori distribution of the physical parameter θ, P (D|θ) is a likelihood function that data D is observed under the parameter θ, P (D) is an edge probability of data D, and P (θ|D) is an updated posterior distribution. In actual calculation, the posterior distribution is obtained by calculating likelihood functions under different parameter values and combining prior distribution. For example, using the joint angle and motor current data while the robot is walking, the probability of the data appearing under different friction coefficient and torque constant assumptions is calculated, and the posterior distribution of these parameters is updated.
(4) Parameter feedback and policy optimization
And feeding the updated physical parameters back to the imitation learning stage. In the challenge-type simulation learning based on the wasperstein distance, the control strategy is re-optimized using new physical parameters. For example, when the dynamic model of the robot is calculated, the updated mass distribution parameters and joint friction coefficients are adopted, so that the straight knee walking motion generated by the strategy network is more in line with the physical characteristics of the real robot. Meanwhile, in the calculation of the speed tracking rewarding function and the style imitating rewarding function, new physical parameters are considered, the weight and the calculation mode of rewarding are adjusted, and the robot is guided to learn a straight knee walking strategy which is more suitable for the real environment. The optimal physical parameters and the control strategy in the real environment are gradually approximated by continuously repeating the processes of real machine test, data acquisition, bayesian reasoning and strategy optimization, so that the robot is ensured to realize stable and natural straight knee walking in the real scene.
In this embodiment, human gait data, which is gait data of walking with the right knee of the human body, is acquired first. And performing action redirection processing on the human gait data, and converting the human gait data into reference motion sequence data of the target humanoid robot. Modeling a target humanoid robot walking task as a Markov decision process of a speed condition, constructing an countermeasure network under the modeling, and generating a maximum expected discount return function according to environment variable distribution and a decision strategy. And constructing a speed tracking reward function according to the centroid linear speed and the angular speed of the target humanoid robot. And constructing a soft boundary Wasserstein loss function of the discriminator by using gradient penalty items of real data distribution, generated data distribution, specific distribution and specific distribution sampling, and constructing a style rewarding function by using the output of the discriminator. The PD controller is configured such that the PD controller converts the actions output by the decision strategy into torque signals for the target humanoid robot. After simulation learning is performed by using the constructed countermeasure network, selecting physical parameters of the target humanoid robot in the straight knee walking training according to the correlation degree, and setting prior distribution for each physical parameter. And performing live-action test movement for the target humanoid robot, and collecting operation data corresponding to the physical parameters. And combining the collected operation data with the prior distribution corresponding to the physical parameters, feeding back the updated posterior distribution to the imitation learning stage, and optimizing the control strategy of the imitation learning stage.
By introducing the Generated Antagonism Imitation Learning (GAIL) framework and taking the real human gait data of human as demonstration, the decision strategy of the training humanoid robot approximates human gait characteristics in the behavioral space, so that the natural straight knee walking capability can be directly obtained without rewarding function design. On the basis, in order to further improve the stability and the precision of the simulation effect, a Wasserstein distance based on an optimal transmission theory is introduced, and an improved GAIL discriminator is constructed. The Wasserstein distance has better gradient property and stronger distribution fitting capability in the training process, and can improve the sensitivity of strategy learning to detail motion difference. The improvement ensures that the robot not only can capture the macroscopic morphological characteristics of human gait, but also can restore the dynamic coordination and the subtle rhythm sense more accurately. On the basis, the physical parameters are presumed according to priori knowledge and the operation data of the humanoid robot, and the influence of simulation and reality differences is reduced. The method is cooperated with the existing improvement, provides more accurate parameter support for the generation of the straight knee gait, and further improves the walking performance of the humanoid robot in different real environments.
Secondly, the embodiment also has the following beneficial effects:
1. The comprehensive process of redirecting human body to robot motion integrates a plurality of modules such as skeleton topological alignment, coordinate conversion, inverse kinematics optimization, time sequence reconstruction and the like, and forms a complete process of converting human body motion to robot motion.
2. Design of the multi-objective inverse kinematics optimizer, which simultaneously considers end precision, joint direction, physical limitation and motion smoothness, ensures that the output motion is natural and executable.
3. The Wasserstein antagonism simulation framework is introduced to replace the JS divergence loss function in the traditional GAIL, so that the problems of unstable training and gradient disappearance are effectively relieved.
4. The soft boundary Wasserstein loss function is designed to control the output range of rewards, prevent training failure caused by zero rewards or severe fluctuation, and improve the stability and expressive force of simulated learning.
5. And the style rewards and speed rewards combined training mechanism is used for enabling the robot to simulate the action style of human and accurately track the target speed by fusing the simulated rewards and the speed tracking rewards.
Referring to fig. 2, the present application provides an embodiment of a method for action redirection processing, including:
201. performing skeleton topology unification and skeleton binding treatment on a human skeleton and a humanoid robot skeleton to generate an original skeleton, wherein the original skeleton is provided with a plurality of key joints;
In this embodiment, after the terminal obtains gait data of walking on the straight knee of the human body, the gait data need to be unified according to the skeleton structures of the human body and the humanoid robot, so that the terminal can adapt to the humanoid robot during subsequent data processing, and the terminal performs skeleton topology unification and skeleton binding processing on the human skeleton and the humanoid robot skeleton to generate an original skeleton, wherein the original skeleton is provided with a plurality of key joints. The specific manner in which the original skeleton is generated is described in the following examples.
202. Carrying out coordinate system unified processing and root standardized processing according to the human gait data and the original skeleton to generate key joint data of the humanoid robot;
in this embodiment, the human gait data is generally represented in a world coordinate system, and the humanoid robot control needs local coordinates, so that the terminal needs to perform coordinate system unified processing and root standardized processing according to the human gait data and the original skeleton, and generates key joint data of the humanoid robot. First, each frame of human body posture data is converted into local coordinates with pelvis as a root by using SE (3) homogeneous transformation. Next, the rotation data is expressed using quaternions or rotation vectors to ensure interpolation smoothness. Finally, all joint rotations are expressed relative to the parent nodes of the skeleton, so that the joint rotations can be directly used as target values for subsequent kinematic solutions.
203. And carrying out multi-objective inverse kinematics optimization solution on the key joint data to generate a reference motion sequence, wherein the multi-objective inverse kinematics optimization solution is used for mapping Cartesian positions of the key joints and the pose of the end effector to corresponding joint angular directions.
In this embodiment, the terminal performs multi-objective inverse kinematics optimization solution on the key joint data to generate a corresponding reference motion sequence, where the multi-objective inverse kinematics optimization solution is used to map the cartesian position of the key joint and the pose of the end effector to a corresponding joint angular direction, and the specific mode is described in the subsequent embodiment.
Referring to fig. 3, the present application provides an embodiment of a method for generating an original skeleton, including:
301. constructing a kinematic tree of a human skeleton and a human-shaped robot skeleton;
302. The method comprises the steps of performing skeleton merging on kinematic trees of a human skeleton and a human-shaped robot skeleton to generate an original skeleton, and reserving one skeleton between key joints;
303. the key joints were selected and the length of each bone segment was recorded.
In this embodiment, the input human gait data is usually a sequence of 25-52 3D joints, and the structure is inconsistent with the robot, but has a certain similarity. The human skeleton and the humanoid robot skeleton can be abstracted into a homoembryo figure. By taking advantage of this property, in this embodiment we propose an intermediate skeleton structure, the original skeleton, which retains the geometric and hierarchical properties of the skeleton. On the basis, a kinematic tree of the human skeleton and the robot skeleton is firstly constructed, then the human skeleton and the robot skeleton are combined to generate a unified original skeleton, and only one skeleton among key nodes is reserved. The user manually selects n key joints, e.g., legs with "hip-knee-ankle", arms with "shoulder-elbow-wrist", etc., and records the ratio of the length of each segment of bone.
Referring to fig. 4, the present application provides an embodiment of a method of generating a reference motion sequence, comprising:
401. Constructing a key joint position matching loss function according to the actual position of a key joint of the humanoid robot and the expected target position of the key joint in key joint data;
402. constructing an end effector posture matching loss function according to the actual posture of the end effector of the humanoid robot and the expected target posture in the key joint data;
403. Constructing a joint minimum displacement loss function according to joint angle data between adjacent frames in the key joint data;
404. Generating an objective function according to the key joint position matching loss function, the end effector gesture matching loss function and the joint minimum displacement loss function;
405. and carrying out multi-objective inverse kinematics optimization solution on the key joint data according to the objective function, and generating a reference motion sequence.
In order to map the Cartesian positions of key joints and the pose of the end effector to the corresponding joint angular orientations, the whole-body inverse kinematics is modeled as a gradient-based multi-objective optimization problem in this embodiment. The optimization problem includes the following three goals:
The three targets C 1、C2 and C 3 are specifically defined as follows:
C 1 is a key joint position matching penalty (smaller, better) for ensuring that the position of the key joint of the humanoid robot is as close as possible to the target position in the human gait data. Where rPk represents the desired target position of the critical joints (elbows, knees, shoulders, etc.) of the robot obtained from the human body data, p k (θ) represents the actual positions of the critical joints of the robot calculated from the current robot joint angle, and k is the number of critical joints.
C 2 is the end effector pose matching loss (the smaller and the better), and is used to control the position pose of the end such as the hand, sole and head to reproduce the original motion as much as possible, wherein rPe represents the expected target position and pose of the end position of the robot obtained from the human body data, and p e (θ) represents the actual pose of the end effector calculated according to the current robot joint angle.
C 3 is the joint minimum displacement loss, which functions to encourage the variation in joint angle θ between adjacent frames to be as small as possible to maintain motion continuity and smoothness. For humanoid robots with higher redundancy, C 3 helps to select a more natural sequence of actions in the presence of multiple sets of solutions.
Finally, the generated objective function C is a weighted sum of three objective functions, the weight k needs to be adjusted according to specific conditions (the following specific description of the embodiment is performed), and then the joint angle is optimized continuously and iteratively to make the (error) as small as possible, and the form is as follows:
wherein k i is the weight of each sub-target and corresponds to the key joint position, the end pose and the minimum displacement loss respectively. In the optimization process, the preset range of the joint angle and the joint speed constraint condition are required to be met, so that the generated data is ensured to be more fit with the motion of the humanoid robot, and the state is also matched with the human gait data.
Referring to fig. 5, the present application provides an embodiment of a method for generating an objective function, including:
501. determining acquisition scene parameters of human gait data;
In this embodiment, the scene when the gait data of the human body is acquired needs to be determined first, and some corresponding environmental parameters are recorded, and the horizontal wooden ground and the windless environment are used as standard scenes. And (3) environmental parameter calibration, namely measuring the ground flatness by a laser range finder, wherein the error is controlled within +/-0.5 mm.
In this embodiment, the body height range of the subject of human gait data is 160-190cm, the weight is 50-90kg, and the tight sports suit is worn to reduce the interference of clothes. Dynamic warm-up takes 10 minutes before acquisition and 3 standardized gait tests (e.g., straight knee walking, cornering, etc.) are completed to establish baseline data.
502. Performing texture contrast analysis and inclination contrast analysis on the acquired scene parameters and the land scene parameters to generate gait deformation data;
Then, the terminal performs texture contrast analysis and inclination contrast analysis on the acquired scene parameters and the land scene parameters to generate gait deformation data. Specifically, firstly, performing texture contrast analysis, taking a wood floor as a standard texture, establishing a ground friction coefficient model (μ=f_friction/f_normal), and measuring friction characteristics of different materials (such as carpet μ=0.3 and tile μ=0.6) by a pressure distribution tester.
And then, slope comparison analysis is carried out, a slope geometric model is constructed by taking the horizontal ground as a standard, and the slope inclination angle (theta) is obtained through three-dimensional laser scanning.
The human gait deformation data are generated through texture contrast analysis and inclination contrast analysis, specifically, the deformation data can be obtained by looking up a table directly according to empirical data, the human gait deformation can be caused by different textures and inclinations, and the human gait deformation data can be obtained through historical data.
In this embodiment, in order to ensure accuracy, the acquired human gait data is subjected to label generation, specifically, the texture comparison analysis result and the inclination comparison analysis result are used as labels, then the labeled human gait data are input into a gait biomechanical model (such as OpenSim) trained in advance, and the moment change of the joints is calculated through reverse dynamics, so that different key joint deformation data including rotational deformation and angular velocity deformation are generated.
503. Determining key joints with deformation on an original skeleton according to gait deformation data;
504. generating key joint position weights for the deformed key joints;
next, the terminal determines the deformed key joint on the original skeleton according to the key joint deformation data (gait deformation data), generates key joint position weight for the deformed key joint, and the larger the deformation is, the larger the adjusted weight is, the deformation of each key joint under scene transformation is analyzed through a gait biomechanical model, the weights of the joints are adjusted accordingly, so that the adaptation degree of the subsequently generated data and the truly acquired scene is increased, and the subsequent training effect is improved.
505. Determining pose change degree parameters according to gait deformation data, and generating tail end pose weights according to the pose change degree parameters;
In this embodiment, the terminal determines the pose change degree parameter according to the gait deformation data, and similarly, the pose change degree parameter needs to analyze the change of the human gait data under the same action and the standard scene after the frame action is finished, and generates the terminal pose weight according to the pose change degree parameter.
506. Determining a minimum displacement loss weight;
The minimum displacement loss weight is not affected by the scene, so the same weight parameter is adopted, and no change is needed.
507. And generating an objective function according to the key joint position matching loss function, the end effector gesture matching loss function, the joint minimum displacement loss function, the key joint position weight, the end gesture weight and the minimum displacement loss weight.
Generating an objective function according to the key joint position matching loss function, the end effector gesture matching loss function, the joint minimum displacement loss function, the key joint position weight, the end pose weight and the minimum displacement loss weight, wherein the formula is similar to step 405, and the description is omitted here. In this embodiment, the deformation amount of the straight knee gait data of different scenes is determined by analyzing the scene changes, and then the deformation amount of the straight knee gait data is decomposed into the deformation of the joints and the deformation of the pose. And finally, the corresponding weight parameters are adjusted, so that the follow-up multi-objective inverse kinematics optimization solution can be more relevant to the scene, and the follow-up training effect is improved.
Referring to fig. 6, the present application provides an embodiment of a method of processing a reference motion sequence, comprising:
601. And carrying out track quality optimization processing on the reference motion sequence.
In this embodiment, after solving the whole joint angle sequence, in order to ensure the quality of data, the terminal needs to further optimize the track quality by the following steps:
first, the linear velocity and angular velocity of the root and each joint are calculated using the difference between adjacent frames.
Linear interpolation is then used for position and spherical linear interpolation (Slerp) is used for direction to generate a continuous track between discrete frames.
And finally, smoothing the joint position and the joint speed by using an exponential moving average filter (EMA), eliminating jump and abnormal peak, and improving the action naturalness and the control robustness.
Referring to fig. 7, the present application provides an embodiment of a control device for walking with a straight knee of a humanoid robot, including:
the first acquisition unit 701 is configured to acquire human gait data, where the human gait data is gait data of walking with the straight knee of the human body;
a redirecting unit 702, configured to perform action redirecting processing on the human gait data, and convert the human gait data into reference motion sequence data of the target humanoid robot;
optionally, the redirecting unit 702 includes:
The first generation module is used for carrying out skeleton topology unification and skeleton binding treatment on the human skeleton and the humanoid robot skeleton to generate an original skeleton, and a plurality of key joints are arranged on the original skeleton;
Optionally, the first generation module comprises the steps of constructing a kinematic tree of a human skeleton and a human-shaped robot skeleton, carrying out skeleton combination on the kinematic tree of the human skeleton and the human-shaped robot skeleton to generate an original skeleton, reserving one skeleton among key joints, selecting the key joints and recording the length of each skeleton.
The second generation module is used for carrying out coordinate system unified processing and root standardized processing according to the human gait data and the original skeleton to generate key joint data of the humanoid robot;
The third generation module is used for carrying out multi-objective inverse kinematics optimization solution on the key joint data to generate a reference motion sequence, and the multi-objective inverse kinematics optimization solution is used for mapping Cartesian positions of the key joints and the pose of the end effector to corresponding joint angular directions.
Optionally, the third generating module includes:
The first construction submodule is used for constructing a key joint position matching loss function according to the actual position of a key joint of the humanoid robot and the expected target position of the key joint in key joint data;
The second construction submodule is used for constructing an end effector posture matching loss function according to the actual posture of the end effector of the humanoid robot and the expected target posture in key joint data;
The third construction submodule is used for constructing a joint minimum displacement loss function according to joint angle data between adjacent frames in the key joint data;
The first generation submodule is used for generating an objective function according to the key joint position matching loss function, the end effector posture matching loss function and the joint minimum displacement loss function;
The first generation submodule comprises the steps of determining acquisition scene parameters of human gait data, carrying out texture contrast analysis and inclination contrast analysis on the acquisition scene parameters and the flat scene parameters to generate gait deformation data, determining deformed key joints on an original skeleton according to the gait deformation data, generating key joint position weights for the deformed key joints, determining pose change degree parameters according to the gait deformation data, generating tail end pose weights according to the pose change degree parameters, determining minimum displacement loss weights, and generating objective functions according to a key joint position matching loss function, an end effector pose matching loss function, a joint minimum displacement loss function, key joint position weights, tail end pose weights and minimum displacement loss weights.
And the second generation submodule is used for carrying out multi-objective inverse kinematics optimization solution on the key joint data according to the objective function and generating a reference motion sequence.
And the optimizing module is used for carrying out track quality optimizing processing on the reference motion sequence.
A first construction unit 703, configured to model the target humanoid robot walking task as a markov decision process with a speed condition, construct an countermeasure network under the modeling, and generate a maximum expected discount return function according to the environmental variable distribution and the decision strategy;
A second construction unit 704, configured to construct a speed tracking reward function with a centroid linear speed and an angular speed of the target humanoid robot;
a third construction unit 705, configured to construct a soft boundary wasperstein loss function of the discriminator by using gradient penalty terms of real data distribution, generated data distribution, specific distribution, and specific distribution sampling, and construct a style rewarding function by using an output of the discriminator;
A fourth construction unit 706, configured to construct the PD controller such that the PD controller converts the action output by the decision strategy into a torque signal of the target humanoid robot;
A setting unit 707, configured to select physical parameters of the target humanoid robot in the straight knee walking training according to the correlation after completing the simulation learning using the constructed countermeasure network, and set a priori distribution for each physical parameter;
The second acquisition unit 708 is used for performing live-action test motion for the target humanoid robot and acquiring operation data corresponding to physical parameters;
The optimizing unit 709 is configured to combine the collected operation data with a priori distribution corresponding to the physical parameter, feed back the updated posterior distribution to the learning simulation stage, and optimize the control strategy of the learning simulation stage.
Referring to fig. 8, the present application provides a control device for walking with a straight knee of a humanoid robot, comprising:
A processor 801, a memory 802, an input output unit 803, and a bus 804.
The processor 801 is connected to a memory 802, an input/output unit 803, and a bus 804.
The memory 802 holds a program, and the processor 801 calls the program to execute the control method as in fig. 1,2 and 3, 4, 5 and 6.
The present application provides a computer-readable storage medium having a program stored thereon, which when executed on a computer performs the control method as in fig. 1,2 and 3, 4, 5 and 6.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM, random access memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.