CN113670306A

CN113670306A - Unmanned vehicle navigation method based on deep reinforcement learning

Info

Publication number: CN113670306A
Application number: CN202010416877.4A
Authority: CN
Inventors: 卜祥津; 许松枝; 苗成生; 修彩靖; 钟国旗
Original assignee: Guangzhou Automobile Group Co Ltd
Current assignee: Guangzhou Automobile Group Co Ltd
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2021-11-19

Abstract

The invention provides a navigation method for an unmanned vehicle based on deep reinforcement learning, which comprises obtaining a depth image through a depth camera on the unmanned vehicle, sampling the obtained depth image, and then performing quadratic linear interpolation processing to form a depth image matrix; In the depth image matrix, the relative positioning of the starting point is calculated by the wheel speed odometer of the unmanned vehicle to form a second depth image matrix representing the state of the unmanned vehicle; the values in the second depth image matrix are compared one by one to calculate The minimum value of a certain value is compared with the set threshold. When it is greater than the set threshold, the motion of the unmanned vehicle is controlled by kinematics. When it is less than the set threshold, the second depth image is input into deep learning In the network, the next action is determined randomly or according to the deep learning network. The invention can make the network learning more efficient, and the error convergence value is smaller, so that the obstacle avoidance effect of the unknown environment is better, and the map collection efficiency is improved.

Description

Unmanned vehicle navigation method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of unmanned vehicles, in particular to a navigation method of an unmanned vehicle based on deep reinforcement learning.

Background

Navigation of unmanned vehicles is a variety of techniques for avoiding obstacles and reaching a target location, however navigating in an unknown environment is much more difficult than in known environments, where the robot's motion relies heavily on data collected from sensors and the efficiency of algorithms to find a good path, and sensors located on the mobile robot will help detect obstacles and map the environment during navigation to navigate to the target location. When the deep reinforcement learning method is applied to the research of the navigation problem, a successful training model is highly dependent on the training set information, so that the acquisition of training data inevitably needs a large amount of time, and how to effectively reduce the training time also becomes the research focus in the field.

The existing invention takes vision as input and has two obvious defects, 1. the influence of illumination is large, and the image recognition effect under different illumination conditions is poor; 2. compared with the input of a depth image, the adaptive capacity to the environment is poor, and the universality from a training environment to another unknown environment is poor; the required training time is longer.

Disclosure of Invention

The embodiment of the invention provides a deep reinforcement learning-based unmanned vehicle navigation method, which aims to solve the technical problems that the existing unmanned vehicle navigation method is poor in adaptability to the environment, poor in universality from a training environment to another unknown environment and long in required training time.

In one aspect of the present invention, a method for navigating an unmanned vehicle based on deep reinforcement learning is provided, including:

step S1, obtaining a Depth image through an RGBD Depth camera on the unmanned vehicle, sampling the obtained Depth image to obtain an image with the resolution of 160 × 120, then performing secondary linear interpolation processing to obtain a Depth image with the size of 80 × 1, and forming a Depth image matrix by all Depth images with the size of 80 × 1;

step S2, calculating the relative positioning between the wheel speed odometer of the unmanned vehicle and the starting point in the depth image matrix, taking the positioning x coordinate as the first row of the second depth image and the y coordinate as the first column of the second depth image, and further integrating the second depth image to form a second depth image matrix representing the state of the unmanned vehicle;

and step S3, comparing the values in the second depth image matrix one by one, calculating the minimum value of a certain value in the second depth pixel matrix by using a quick sorting algorithm, comparing the minimum value with a set threshold value, controlling the motion of the unmanned vehicle in a kinematic mode when the minimum value is larger than the set threshold value, inputting the second depth image into a deep learning network when the minimum value is smaller than the set threshold value, constructing a Markov state space, deciding the next action randomly or according to the deep learning network, and comparing the minimum value with the threshold value again until the minimum value is larger than the set threshold value.

Further, in step S1, the specific process of sampling the acquired depth image to obtain an image with a resolution of 160 × 120 and then performing secondary linear interpolation to obtain a depth image with a size of 80 × 1 includes smoothing the image by using a gaussian pyramid algorithm, retaining all boundary features of the image, obtaining an image with a resolution of 160 × 120 by gradient down-sampling, and then processing the down-sampled 160 × 120 image by using an image secondary linear interpolation method to obtain a depth image with a size of 80 × 80 1.

Further, in the present invention,

the image secondary linear interpolation method is used for processing the 160 × 120 image after the down sampling to obtain the depth image with the size of 80 × 1, and the specific process is that the pixels are linearly interpolated in one direction in the image matrix according to the following formula, and then the pixels are linearly interpolated in the other direction:

wherein x is the coordinate coefficient of the pixel on the x axis in the image matrix, and y is the coordinate coefficient of the pixel on the y axis in the image matrix.

Further, in step S3, the set threshold is adjusted according to the actual vehicle speed, and the set threshold is adjusted to be larger when the turning radius of the unmanned vehicle is larger, and to be smaller when the turning radius of the unmanned vehicle is smaller; if the set threshold is too large, the training time becomes long, and if the set threshold is too small, the vehicle collides with an obstacle.

Further, in the present invention,

the specific calculation process for controlling the motion of the unmanned vehicle in a kinematic constraint mode is calculated according to the following formula:

wherein x is_gAnd y_gThe coordinate of the target point in a Cartesian coordinate system is shown, and K is a first scale coefficient;

ω＝K_ω(θ_gΘθ)

wherein, theta_gIs the direction of the target point, theta is the direction of the current point, theta is the difference between the two angles of the target point and the current point, K_ωIs the second scaling factor.

Further, the method can be used for preparing a novel material

In step S3, the deep learning network is a convolutional neural network including four convolutional layers and two fully-connected layers, and the deep learning network is applied to a policy function pi according to the following formula_θ(s, a) performing gradient descent processing:

wherein theta is a parameter of the neural network, A(s) is an advantage function for evaluating strategy gradient updating, and pi is a circumferential rate;

the deep learning network evaluates the function V (s, theta) according to the following formula_v) Performing gradient descent treatment:

wherein, R is the corresponding reward value, gamma is the greedy coefficient, V is the state value function, and V is the speed value of the unmanned vehicle.

Further, the reward penalty value R is a penalty value close to the obstacle, the final penalty value of a single round is the sum of all penalty values, and the penalty value specifically includes a collision penalty value, a straight-going or turning penalty value, a driving penalty value towards a target point, a deviation penalty value from the target point, and a penalty value close to the obstacle.

Further, in the present invention,

the punishment value of the straight line or the curve is calculated according to the following formula:

(0.1*v)/(|ω|+0.1)

wherein v is the velocity value of the unmanned vehicle, and omega is the angular velocity of the unmanned vehicle;

the penalty value for approaching an obstacle is calculated according to the following formula:

-1/(x-0.4)

where x · is the minimum value within the second depth image matrix.

Further, in step S3, the markov state space is composed of a plurality of arrays, and each array at least includes current state data of the unmanned vehicle, current motion data of the unmanned vehicle, reward value data corresponding to the current unmanned vehicle, and next state data of the unmanned vehicle.

Further, in step S1, the method further includes preprocessing the depth image to reduce the bright and dark dot noise between black and white in the image, where the preprocessing includes at least median filtering, image cropping, and fast-marching restoration.

In summary, the embodiment of the invention has the following beneficial effects:

according to the navigation method of the unmanned vehicle based on the deep reinforcement learning, the state space construction of the robot in the early stage is optimized by combining a kinematic constraint model, and under the same training time, the state space constructed based on the training mode provided by the text is more reasonable and effective, so that the network learning efficiency is higher, the error convergence value is smaller, and the obstacle avoidance effect of the unknown environment is better;

the problem of unmanned vehicle navigation in an unknown environment is solved, and an end-to-end motion decision navigation mode using a map is omitted; meanwhile, the invention is used for drawing construction work in an unknown environment, so that the trouble of manually controlling equipment to collect the map is avoided, and the map collecting efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is within the scope of the present invention for those skilled in the art to obtain other drawings based on the drawings without inventive exercise.

Fig. 1 is a schematic diagram of a motion decision model of the depth-enhanced learning-based unmanned vehicle navigation method provided by the invention.

Fig. 2 is a main flow diagram of the navigation method of the unmanned vehicle based on deep reinforcement learning according to the present invention.

Fig. 3 is a logic diagram of the navigation method of the unmanned vehicle based on deep reinforcement learning according to the present invention.

Fig. 4 is a reward and punishment rule chart of the unmanned vehicle navigation method based on deep reinforcement learning provided by the invention.

Fig. 5 is a schematic top view of a training environment of the deep reinforcement learning-based unmanned vehicle navigation method provided by the invention.

Fig. 6 is a schematic diagram of an error value curve of the navigation method of the unmanned vehicle based on deep reinforcement learning according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the navigation method of the unmanned vehicle based on deep reinforcement learning provided by the invention provides a training mode based on minimum depth-of-field information, and optimizes the state space construction of the robot in an early stage by combining a kinematics constraint model, that is, reduces training time by means of artificial guidance. Under the same training time, the state space constructed based on the training mode provided by the text is more reasonable and effective, the network learning efficiency is higher, the error convergence value is smaller, and the obstacle avoidance effect of realizing an unknown environment is better; the method overcomes the limitation that the DQN algorithm only can enable the robot to output limited execution actions, and enables the robot to output the execution actions in continuous speed and corner numerical value intervals.

Fig. 2 is a schematic diagram of an embodiment of a deep reinforcement learning-based unmanned vehicle navigation method according to the present invention. In this embodiment, the method comprises the steps of:

in a specific embodiment, the specific process of sampling the acquired depth image to obtain an image with a resolution of 160 × 120 and then performing secondary linear interpolation to obtain a depth image with a size of 80 × 1 includes the steps of smoothing the image by using a gaussian pyramid algorithm, retaining all boundary characteristic values of the image, and obtaining an image with a resolution of 160 × 120 through gradient down-sampling; the Gaussian pyramid is an existing algorithm, is commonly used in image downsampling, and can smoothly process an image on the premise of better retaining the characteristics of the image; the boundary characteristic value is a characteristic point in computer vision, refers to places with sharp changes of corners and textures and the like in an image, and particularly refers to a pixel with a large first-order derivative in a pixel matrix, and can refer to an SIFT operator detection algorithm; and then processing the 160 × 120 image after the down sampling by an image quadratic linear interpolation method to obtain a depth image with the size of 80 × 1 as an observed state, wherein the larger the size, the more the GPU memory is spent in the deep learning process, the longer the learning time is required, but the smaller the size, the boundary information in the image cannot be fully reserved, and the learning result is influenced. In one embodiment, this value may be set or changed based on the GPU capability of the computer.

Specifically, the processing of the down-sampled 160 × 120 image by the image quadratic linear interpolation method specifically includes performing linear interpolation on pixels in one direction in the image matrix according to the following formula, and then performing linear interpolation in the other direction:

Step S2, calculating the relative positioning between the wheel speed odometer of the unmanned vehicle and the starting point in the depth image matrix, taking the positioning x coordinate as the first row of the second depth image and the y coordinate as the first column of the second depth image, and further integrating the second depth image to form a second depth image matrix representing the state of the unmanned vehicle; establishing coordinate conversion, forming unified coordinate conversion by corresponding coordinates in an actual environment with acquired image coordinates, adjusting the position of a camera according to the position relation of the unmanned vehicle in an actual space, and converting the position of the camera into global coordinates of a map; and finally, calibrating the positioning precision of the wheel speed odometer, realizing the positioning unification of the actual space position and the position in the image, and improving the precision.

And step S3, comparing the values in the second depth image matrix one by one, calculating the minimum value of a certain value in the second depth pixel matrix by using a quick sorting algorithm, comparing the minimum value with a set threshold value, controlling the motion of the unmanned vehicle in a kinematic mode when the minimum value is larger than the set threshold value, inputting the second depth image into a deep learning network when the minimum value is smaller than the set threshold value, constructing a Markov state space, deciding the next action randomly or according to the deep learning network, and repeatedly comparing the minimum value with the threshold value until the minimum value is larger than the set threshold value.

In a specific embodiment, the unmanned vehicle state is obtained through the previous processing of the image, and the unmanned vehicle state comprises positioning and Depth images. The training speed of the model is improved based on the selected training mode of the minimum value of the depth image, as shown in fig. 3, the minimum value of a certain value in the second depth pixel matrix is calculated through value-by-value comparison, the existing mature algorithm is used, and the specific method can refer to a quick sequencing algorithm and the like; when the minimum value is greater than the previously set threshold value, specifically 0.7m in this embodiment, the motion of the robot is controlled in a point-to-point kinematics constraint manner, so that the robot smoothly moves to a target point, in the moving process, once the minimum value in the depth image is less than the threshold value, the depth image is input into a deep learning network, a markov state space is constructed, the next action is determined randomly or according to the network, if the minimum value is greater than the threshold value again, the next action of the robot is constrained again in kinematics, and the process is repeated in a circulating manner;

specifically, the controlling the motion of the unmanned vehicle in the kinematic constraint mode is to control the robot to move smoothly from the current point to the target point in the kinematic constraint mode, and further calculate the motion parameters from the target point to the current point according to the following formula:

wherein x is_gAnd y_gThe coordinate of the target point in a Cartesian coordinate system is defined, K is a first scale coefficient, the parameter is used for calibration due to different kinematic parameters of the unmanned vehicle platform, specific values can be adjusted and calibrated according to specific application conditions, and v is the movement speed of the unmanned vehicle;

ω＝K_ω(θ_gΘθ)

wherein, theta_gIs the direction of the target point, theta is the direction of the current point, theta is the difference between two angles of the target point and the current point, and belongs to (-pi, pi)]A value of (A), K_ωAnd omega is the angular velocity of the unmanned vehicle motion for a second proportionality coefficient calibrated in a specific experiment or an embodiment.

Specifically, the minimum depth threshold value needs to be set according to the actual vehicle speed, and if the turning radius of the unmanned vehicle is increased, the value needs to be properly increased, and vice versa; if the setting is too large, it may result in a longer training time, and if it is too small, it may result in a collision with an obstacle, which is also a very important place in the later inspection.

In this embodiment, the purpose of inputting the second depth image into the deep learning network is to perform learning by using the A3C algorithm, and the neural network established by the present invention is used to perform the policy function pi_θ(s, a) and an evaluation function V (s, θ)_v) So as to evaluate whether the decision in the invention is reasonable;

the deep learning network is a convolutional neural network comprising four convolutional layers and two fully-connected layers, and parameters such as the number of built layers, the learning rate, the greedy learning rate and the like of the neural network need to be controlled; determining the number of layers of the convolutional neural network according to the size of the processed image, namely the depth image of 80 × 1, and building a four-layer convolutional neural network to better extract the image details of each layer; the learning rate can not be set too low or too high, too long learning time of primary school can be too long, too high learning time can cause convergence to local optimum, and the learning rate is adjusted to 10 according to the actual learning process^-6；

The deep learning network is used for strategy function pi according to the following formula_θ(s, a) performing gradient descent processing:

the deep learning networkEvaluating function V (s, theta) according to the following formula_v) Performing gradient descent treatment:

In this embodiment, when learning is performed in a deep learning network, a reward penalty value, that is, a reward value, is introduced, and a specific reward penalty rule is shown in fig. 4, where v is a speed range of [0.1,0.6] of the robot, ω is an angular speed of the robot, and is a value range of [ -1,1], and the reward value is larger when the unmanned vehicle moves straight, and the reward value when turning is performed is smaller; when the minimum value x · below 0.7m of the depth image is detected, a reward penalty value R close to the obstacle, that is, a penalty value close to the obstacle, is considered, and the final penalty value of the single round is the sum of all penalty values, where the penalty values specifically include a collision penalty value, a straight-going or turning penalty value, a penalty value driving to the target point, a penalty value deviating from the target point, and a penalty value close to the obstacle.

The impact has a penalty of-20, the direction to the target point has a penalty of 4, the deviation from the target point has a penalty of-2,

(0.1*v)/(|ω|+0.1)

-1/(x-0.4)

where x · is the minimum value within the second depth image matrix.

In this embodiment, the markov state space is composed of a plurality of arrays, a single array at least includes current state data of the unmanned vehicle, current motion data of the unmanned vehicle, current reward value data corresponding to the unmanned vehicle, and next state data of the unmanned vehicle, and the markov state space provides a data set for model training.

According to the navigation method of the unmanned vehicle based on deep reinforcement learning, in order to enable the mobile robot to obtain better obstacle avoidance capability, a simulation training environment needing to be designed should have a certain complexity. The environment should include narrow passable road sections, walls, barriers with edges and smooth barriers, as shown in fig. 5, so that model learning needs to be performed in a training environment to accumulate sufficient data, which can improve decision speed in practical application, but navigation can be realized if the training strategy is used in the actual navigation process; the training strategy specifically includes performing accumulated training with a training amount of about 3 thousands of steps and a training time of about 6 hours, respectively, and a final error value curve pair is shown in fig. 6, where a horizontal coordinate axis represents iteration times and a vertical coordinate axis represents an error value, where a curve 1 (direct training) is a result obtained by filtering a curve 2 (training rule of the present invention) by using median average.

In this embodiment, in order to verify the training effect, 7 points may be designated in the test environment for navigation, the robot sequentially passes through positions 1 to 7 in a kinematic constraint manner, and when the robot is too close (less than 0.6m) to an obstacle, obstacle avoidance control is performed by using the model in the deep reinforcement learning network of the present invention, so as to search an unknown environment and pass through the path planning capability of at least two training modes.

According to the depth reinforcement learning-based unmanned vehicle navigation method, if unmanned vehicle navigation is performed in an indoor environment, in order to improve the effect and feasibility of an algorithm, the depth image acquired by a real depth camera needs to be preprocessed in step S1 by considering that the depth image has some black and white alternating bright and dark point noises; then, step S2 is performed to establish coordinate transformation, and the position of the camera is transformed into the global coordinate of the map; calibrating the positioning precision of the wheel speed odometer; and putting the trained model into a real unmanned vehicle for navigation in a real environment.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. a navigation method for an unmanned vehicle based on deep reinforcement learning, is characterized in that, comprises the following steps:

Step S1, obtaining a first depth image through an RGBD depth camera on the unmanned vehicle, and after performing sampling and quadratic linear interpolation processing on the obtained first depth image, a first depth image matrix is obtained;

Step S2, in the first depth image matrix, the relative positioning to the starting point is calculated by the wheel speed odometer of the unmanned vehicle, and the positioning x coordinate is used as the first row of the second depth image, and the y coordinate is used as the second depth image. The first column of , further gather the second depth images to form a second depth image matrix representing the state of the unmanned vehicle;

Step S3, traverse the second depth image matrix, calculate the minimum value of a certain value in the second depth pixel matrix, compare the minimum value with the set threshold, and when the minimum value is greater than the set threshold, pass the kinematics method. When the minimum value is less than the set threshold, the second depth image is input into the deep learning network to construct a Markov state space, and the next action is determined randomly or according to the deep learning network. Repeat the comparison between the minimum value and the threshold value until the minimum value is greater than the set threshold value.

2. The method according to claim 1, characterized in that, in step S1, the depth image obtained is sampled, and then a quadratic linear interpolation process is performed to obtain the depth image. The specific process is, using a Gaussian pyramid algorithm for smoothing processing Image, retain all the boundary eigenvalues of the image, obtain an image with a resolution of 160*120 through gradient downsampling, and then process the downsampled 160*120 image through image quadratic linear interpolation to obtain a size of 80*80*1 depth image.

3. The method according to claim 2, wherein, the specific process of processing downsampling by the quadratic linear interpolation of the image is, according to the following formula, linearly interpolate the pixels in one direction in the image matrix, and then perform linear interpolation in another direction. Linear interpolation in one direction:

Among them, x is the coordinate coefficient of the pixel on the x-axis in the image matrix, and y is the coordinate coefficient of the pixel on the y-axis in the image matrix.

4. The method of claim 1, wherein in step S3, the set threshold is adjusted according to the actual vehicle speed, and when the turning radius of the unmanned vehicle becomes larger, the set threshold is adjusted When the turning radius of the unmanned vehicle becomes smaller, the set threshold value is reduced.

5. The method according to claim 4, wherein the control of the motion of the unmanned vehicle by means of kinematic constraints is specifically, using the means of kinematic constraints to control the robot to smoothly move from the current point to the target point, Further calculate the motion parameters from the target point to the current point according to the following formula:

ω=K _ω (θ _g Θθ)

Among them, x _g and y _g are the coordinates of the target point in the Cartesian coordinate system, K is the first proportional coefficient, v is the speed of the unmanned vehicle; θ _g is the direction of the target point, θ is the direction of the current point, Θ is the difference between the two angles between the target point and the current point, K _ω is the second proportional coefficient, and ω is the angular velocity of the unmanned vehicle.

6. The method of claim 5, wherein, in step S3, the deep learning network is a convolutional neural network comprising four convolutional layers and two fully connected layers, and the deep learning network is based on The following formula performs gradient descent on the policy function _πθ (,s,a,):

Among them, θ is the parameter of the neural network, A(s) is the advantage function of evaluating the gradient update of the strategy, and π is the pi;

The deep learning network performs gradient descent on the evaluation function V(s, θV) according to the following formula:

Among them, R is the corresponding reward value, γ is the greedy coefficient, V is the state value function, and v is the speed value of the unmanned vehicle.

7. method as claimed in claim 6 is characterized in that, described reward and punishment value R is the punishment value close to obstacle, the final punishment value of single round is the sum of all punishment values, and described punishment value specifically comprises collision punishment. value, the penalty value for going straight or turning, the penalty value for driving to the target point, the penalty value for deviating from the target point, the penalty value for approaching the obstacle.

8. The method of claim 7, wherein the penalty value for going straight or turning is calculated according to the following formula:

(0.1*v)/(|ω|+0.1)

Among them, v is the value of the speed of the unmanned vehicle, and ω is the angular velocity of the unmanned vehicle;

-1/(x-0.4)

where x is the minimum value in the second depth image matrix.

9. The method according to claim 1, characterized in that, in step S3, the Markov state space is composed of a plurality of arrays, and a single array at least includes the current state data of the unmanned vehicle, the unmanned vehicle The action data of the car this time, the reward value data corresponding to the unmanned car this time, and the status data of the next time the unmanned car gets off.

10. The method according to claim 1, characterized in that, in step S1, the method further comprises performing preprocessing on the depth image to reduce the black and white bright and dark point noise in the image, and the preprocessing includes at least a median value Filtering, image cropping, fast-moving repairs.