Disclosure of Invention
      The embodiment of the invention provides a deep reinforcement learning-based unmanned vehicle navigation method, which aims to solve the technical problems that the existing unmanned vehicle navigation method is poor in adaptability to the environment, poor in universality from a training environment to another unknown environment and long in required training time.
      In one aspect of the present invention, a method for navigating an unmanned vehicle based on deep reinforcement learning is provided, including:
      step S1, obtaining a Depth image through an RGBD Depth camera on the unmanned vehicle, sampling the obtained Depth image to obtain an image with the resolution of 160 × 120, then performing secondary linear interpolation processing to obtain a Depth image with the size of 80 × 1, and forming a Depth image matrix by all Depth images with the size of 80 × 1;
      step S2, calculating the relative positioning between the wheel speed odometer of the unmanned vehicle and the starting point in the depth image matrix, taking the positioning x coordinate as the first row of the second depth image and the y coordinate as the first column of the second depth image, and further integrating the second depth image to form a second depth image matrix representing the state of the unmanned vehicle;
      and step S3, comparing the values in the second depth image matrix one by one, calculating the minimum value of a certain value in the second depth pixel matrix by using a quick sorting algorithm, comparing the minimum value with a set threshold value, controlling the motion of the unmanned vehicle in a kinematic mode when the minimum value is larger than the set threshold value, inputting the second depth image into a deep learning network when the minimum value is smaller than the set threshold value, constructing a Markov state space, deciding the next action randomly or according to the deep learning network, and comparing the minimum value with the threshold value again until the minimum value is larger than the set threshold value.
      Further, in step S1, the specific process of sampling the acquired depth image to obtain an image with a resolution of 160 × 120 and then performing secondary linear interpolation to obtain a depth image with a size of 80 × 1 includes smoothing the image by using a gaussian pyramid algorithm, retaining all boundary features of the image, obtaining an image with a resolution of 160 × 120 by gradient down-sampling, and then processing the down-sampled 160 × 120 image by using an image secondary linear interpolation method to obtain a depth image with a size of 80 × 80 1.
      Further, in the present invention,
      the image secondary linear interpolation method is used for processing the 160 × 120 image after the down sampling to obtain the depth image with the size of 80 × 1, and the specific process is that the pixels are linearly interpolated in one direction in the image matrix according to the following formula, and then the pixels are linearly interpolated in the other direction:
      
      wherein x is the coordinate coefficient of the pixel on the x axis in the image matrix, and y is the coordinate coefficient of the pixel on the y axis in the image matrix.
      Further, in step S3, the set threshold is adjusted according to the actual vehicle speed, and the set threshold is adjusted to be larger when the turning radius of the unmanned vehicle is larger, and to be smaller when the turning radius of the unmanned vehicle is smaller; if the set threshold is too large, the training time becomes long, and if the set threshold is too small, the vehicle collides with an obstacle.
      Further, in the present invention,
      the specific calculation process for controlling the motion of the unmanned vehicle in a kinematic constraint mode is calculated according to the following formula:
      
      wherein x isgAnd ygThe coordinate of the target point in a Cartesian coordinate system is shown, and K is a first scale coefficient;
      ω=Kω(θgΘθ)
      wherein, thetagIs the direction of the target point, theta is the direction of the current point, theta is the difference between the two angles of the target point and the current point, KωIs the second scaling factor.
      Further, the method can be used for preparing a novel material
      In step S3, the deep learning network is a convolutional neural network including four convolutional layers and two fully-connected layers, and the deep learning network is applied to a policy function pi according to the following formulaθ(s, a) performing gradient descent processing:
      
      wherein theta is a parameter of the neural network, A(s) is an advantage function for evaluating strategy gradient updating, and pi is a circumferential rate;
      the deep learning network evaluates the function V (s, theta) according to the following formulav) Performing gradient descent treatment:
      
      wherein, R is the corresponding reward value, gamma is the greedy coefficient, V is the state value function, and V is the speed value of the unmanned vehicle.
      Further, the reward penalty value R is a penalty value close to the obstacle, the final penalty value of a single round is the sum of all penalty values, and the penalty value specifically includes a collision penalty value, a straight-going or turning penalty value, a driving penalty value towards a target point, a deviation penalty value from the target point, and a penalty value close to the obstacle.
      Further, in the present invention,
      the punishment value of the straight line or the curve is calculated according to the following formula:
      (0.1*v)/(|ω|+0.1)
      wherein v is the velocity value of the unmanned vehicle, and omega is the angular velocity of the unmanned vehicle;
      the penalty value for approaching an obstacle is calculated according to the following formula:
      -1/(x-0.4)
      where x · is the minimum value within the second depth image matrix.
      Further, in step S3, the markov state space is composed of a plurality of arrays, and each array at least includes current state data of the unmanned vehicle, current motion data of the unmanned vehicle, reward value data corresponding to the current unmanned vehicle, and next state data of the unmanned vehicle.
      Further, in step S1, the method further includes preprocessing the depth image to reduce the bright and dark dot noise between black and white in the image, where the preprocessing includes at least median filtering, image cropping, and fast-marching restoration.
      In summary, the embodiment of the invention has the following beneficial effects:
      according to the navigation method of the unmanned vehicle based on the deep reinforcement learning, the state space construction of the robot in the early stage is optimized by combining a kinematic constraint model, and under the same training time, the state space constructed based on the training mode provided by the text is more reasonable and effective, so that the network learning efficiency is higher, the error convergence value is smaller, and the obstacle avoidance effect of the unknown environment is better;
      the problem of unmanned vehicle navigation in an unknown environment is solved, and an end-to-end motion decision navigation mode using a map is omitted; meanwhile, the invention is used for drawing construction work in an unknown environment, so that the trouble of manually controlling equipment to collect the map is avoided, and the map collecting efficiency is improved.
    
    
      Detailed Description
      In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.
      As shown in fig. 1, the navigation method of the unmanned vehicle based on deep reinforcement learning provided by the invention provides a training mode based on minimum depth-of-field information, and optimizes the state space construction of the robot in an early stage by combining a kinematics constraint model, that is, reduces training time by means of artificial guidance. Under the same training time, the state space constructed based on the training mode provided by the text is more reasonable and effective, the network learning efficiency is higher, the error convergence value is smaller, and the obstacle avoidance effect of realizing an unknown environment is better; the method overcomes the limitation that the DQN algorithm only can enable the robot to output limited execution actions, and enables the robot to output the execution actions in continuous speed and corner numerical value intervals.
      Fig. 2 is a schematic diagram of an embodiment of a deep reinforcement learning-based unmanned vehicle navigation method according to the present invention. In this embodiment, the method comprises the steps of:
      step S1, obtaining a Depth image through an RGBD Depth camera on the unmanned vehicle, sampling the obtained Depth image to obtain an image with the resolution of 160 × 120, then performing secondary linear interpolation processing to obtain a Depth image with the size of 80 × 1, and forming a Depth image matrix by all Depth images with the size of 80 × 1;
      in a specific embodiment, the specific process of sampling the acquired depth image to obtain an image with a resolution of 160 × 120 and then performing secondary linear interpolation to obtain a depth image with a size of 80 × 1 includes the steps of smoothing the image by using a gaussian pyramid algorithm, retaining all boundary characteristic values of the image, and obtaining an image with a resolution of 160 × 120 through gradient down-sampling; the Gaussian pyramid is an existing algorithm, is commonly used in image downsampling, and can smoothly process an image on the premise of better retaining the characteristics of the image; the boundary characteristic value is a characteristic point in computer vision, refers to places with sharp changes of corners and textures and the like in an image, and particularly refers to a pixel with a large first-order derivative in a pixel matrix, and can refer to an SIFT operator detection algorithm; and then processing the 160 × 120 image after the down sampling by an image quadratic linear interpolation method to obtain a depth image with the size of 80 × 1 as an observed state, wherein the larger the size, the more the GPU memory is spent in the deep learning process, the longer the learning time is required, but the smaller the size, the boundary information in the image cannot be fully reserved, and the learning result is influenced. In one embodiment, this value may be set or changed based on the GPU capability of the computer.
      Specifically, the processing of the down-sampled 160 × 120 image by the image quadratic linear interpolation method specifically includes performing linear interpolation on pixels in one direction in the image matrix according to the following formula, and then performing linear interpolation in the other direction:
      
      wherein x is the coordinate coefficient of the pixel on the x axis in the image matrix, and y is the coordinate coefficient of the pixel on the y axis in the image matrix.
      Step S2, calculating the relative positioning between the wheel speed odometer of the unmanned vehicle and the starting point in the depth image matrix, taking the positioning x coordinate as the first row of the second depth image and the y coordinate as the first column of the second depth image, and further integrating the second depth image to form a second depth image matrix representing the state of the unmanned vehicle; establishing coordinate conversion, forming unified coordinate conversion by corresponding coordinates in an actual environment with acquired image coordinates, adjusting the position of a camera according to the position relation of the unmanned vehicle in an actual space, and converting the position of the camera into global coordinates of a map; and finally, calibrating the positioning precision of the wheel speed odometer, realizing the positioning unification of the actual space position and the position in the image, and improving the precision.
      And step S3, comparing the values in the second depth image matrix one by one, calculating the minimum value of a certain value in the second depth pixel matrix by using a quick sorting algorithm, comparing the minimum value with a set threshold value, controlling the motion of the unmanned vehicle in a kinematic mode when the minimum value is larger than the set threshold value, inputting the second depth image into a deep learning network when the minimum value is smaller than the set threshold value, constructing a Markov state space, deciding the next action randomly or according to the deep learning network, and repeatedly comparing the minimum value with the threshold value until the minimum value is larger than the set threshold value.
      In a specific embodiment, the unmanned vehicle state is obtained through the previous processing of the image, and the unmanned vehicle state comprises positioning and Depth images. The training speed of the model is improved based on the selected training mode of the minimum value of the depth image, as shown in fig. 3, the minimum value of a certain value in the second depth pixel matrix is calculated through value-by-value comparison, the existing mature algorithm is used, and the specific method can refer to a quick sequencing algorithm and the like; when the minimum value is greater than the previously set threshold value, specifically 0.7m in this embodiment, the motion of the robot is controlled in a point-to-point kinematics constraint manner, so that the robot smoothly moves to a target point, in the moving process, once the minimum value in the depth image is less than the threshold value, the depth image is input into a deep learning network, a markov state space is constructed, the next action is determined randomly or according to the network, if the minimum value is greater than the threshold value again, the next action of the robot is constrained again in kinematics, and the process is repeated in a circulating manner;
      specifically, the controlling the motion of the unmanned vehicle in the kinematic constraint mode is to control the robot to move smoothly from the current point to the target point in the kinematic constraint mode, and further calculate the motion parameters from the target point to the current point according to the following formula:
      
      wherein x isgAnd ygThe coordinate of the target point in a Cartesian coordinate system is defined, K is a first scale coefficient, the parameter is used for calibration due to different kinematic parameters of the unmanned vehicle platform, specific values can be adjusted and calibrated according to specific application conditions, and v is the movement speed of the unmanned vehicle;
      ω=Kω(θgΘθ)
      wherein, thetagIs the direction of the target point, theta is the direction of the current point, theta is the difference between two angles of the target point and the current point, and belongs to (-pi, pi)]A value of (A), KωAnd omega is the angular velocity of the unmanned vehicle motion for a second proportionality coefficient calibrated in a specific experiment or an embodiment.
      Specifically, the minimum depth threshold value needs to be set according to the actual vehicle speed, and if the turning radius of the unmanned vehicle is increased, the value needs to be properly increased, and vice versa; if the setting is too large, it may result in a longer training time, and if it is too small, it may result in a collision with an obstacle, which is also a very important place in the later inspection.
      In this embodiment, the purpose of inputting the second depth image into the deep learning network is to perform learning by using the A3C algorithm, and the neural network established by the present invention is used to perform the policy function piθ(s, a) and an evaluation function V (s, θ)v) So as to evaluate whether the decision in the invention is reasonable;
      the deep learning network is a convolutional neural network comprising four convolutional layers and two fully-connected layers, and parameters such as the number of built layers, the learning rate, the greedy learning rate and the like of the neural network need to be controlled; determining the number of layers of the convolutional neural network according to the size of the processed image, namely the depth image of 80 × 1, and building a four-layer convolutional neural network to better extract the image details of each layer; the learning rate can not be set too low or too high, too long learning time of primary school can be too long, too high learning time can cause convergence to local optimum, and the learning rate is adjusted to 10 according to the actual learning process-6;
      The deep learning network is used for strategy function pi according to the following formulaθ(s, a) performing gradient descent processing:
      
      wherein theta is a parameter of the neural network, A(s) is an advantage function for evaluating strategy gradient updating, and pi is a circumferential rate;
      the deep learning networkEvaluating function V (s, theta) according to the following formulav) Performing gradient descent treatment:
      
      wherein, R is the corresponding reward value, gamma is the greedy coefficient, V is the state value function, and V is the speed value of the unmanned vehicle.
      In this embodiment, when learning is performed in a deep learning network, a reward penalty value, that is, a reward value, is introduced, and a specific reward penalty rule is shown in fig. 4, where v is a speed range of [0.1,0.6] of the robot, ω is an angular speed of the robot, and is a value range of [ -1,1], and the reward value is larger when the unmanned vehicle moves straight, and the reward value when turning is performed is smaller; when the minimum value x · below 0.7m of the depth image is detected, a reward penalty value R close to the obstacle, that is, a penalty value close to the obstacle, is considered, and the final penalty value of the single round is the sum of all penalty values, where the penalty values specifically include a collision penalty value, a straight-going or turning penalty value, a penalty value driving to the target point, a penalty value deviating from the target point, and a penalty value close to the obstacle.
      The impact has a penalty of-20, the direction to the target point has a penalty of 4, the deviation from the target point has a penalty of-2,
      the punishment value of the straight line or the curve is calculated according to the following formula:
      (0.1*v)/(|ω|+0.1)
      wherein v is the velocity value of the unmanned vehicle, and omega is the angular velocity of the unmanned vehicle;
      the penalty value for approaching an obstacle is calculated according to the following formula:
      -1/(x-0.4)
      where x · is the minimum value within the second depth image matrix.
      In this embodiment, the markov state space is composed of a plurality of arrays, a single array at least includes current state data of the unmanned vehicle, current motion data of the unmanned vehicle, current reward value data corresponding to the unmanned vehicle, and next state data of the unmanned vehicle, and the markov state space provides a data set for model training.
      According to the navigation method of the unmanned vehicle based on deep reinforcement learning, in order to enable the mobile robot to obtain better obstacle avoidance capability, a simulation training environment needing to be designed should have a certain complexity. The environment should include narrow passable road sections, walls, barriers with edges and smooth barriers, as shown in fig. 5, so that model learning needs to be performed in a training environment to accumulate sufficient data, which can improve decision speed in practical application, but navigation can be realized if the training strategy is used in the actual navigation process; the training strategy specifically includes performing accumulated training with a training amount of about 3 thousands of steps and a training time of about 6 hours, respectively, and a final error value curve pair is shown in fig. 6, where a horizontal coordinate axis represents iteration times and a vertical coordinate axis represents an error value, where a curve 1 (direct training) is a result obtained by filtering a curve 2 (training rule of the present invention) by using median average.
      In this embodiment, in order to verify the training effect, 7 points may be designated in the test environment for navigation, the robot sequentially passes through positions 1 to 7 in a kinematic constraint manner, and when the robot is too close (less than 0.6m) to an obstacle, obstacle avoidance control is performed by using the model in the deep reinforcement learning network of the present invention, so as to search an unknown environment and pass through the path planning capability of at least two training modes.
      According to the depth reinforcement learning-based unmanned vehicle navigation method, if unmanned vehicle navigation is performed in an indoor environment, in order to improve the effect and feasibility of an algorithm, the depth image acquired by a real depth camera needs to be preprocessed in step S1 by considering that the depth image has some black and white alternating bright and dark point noises; then, step S2 is performed to establish coordinate transformation, and the position of the camera is transformed into the global coordinate of the map; calibrating the positioning precision of the wheel speed odometer; and putting the trained model into a real unmanned vehicle for navigation in a real environment.
      In summary, the embodiment of the invention has the following beneficial effects:
      according to the navigation method of the unmanned vehicle based on the deep reinforcement learning, the state space construction of the robot in the early stage is optimized by combining a kinematic constraint model, and under the same training time, the state space constructed based on the training mode provided by the text is more reasonable and effective, so that the network learning efficiency is higher, the error convergence value is smaller, and the obstacle avoidance effect of the unknown environment is better;
      the problem of unmanned vehicle navigation in an unknown environment is solved, and an end-to-end motion decision navigation mode using a map is omitted; meanwhile, the invention is used for drawing construction work in an unknown environment, so that the trouble of manually controlling equipment to collect the map is avoided, and the map collecting efficiency is improved.
      While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.