Man-machine co-fusion automatic driving decision method based on speed system
Technical Field
The invention relates to the field of automatic driving and artificial intelligence, in particular to a driving decision method based on cooperative work of a deep reinforcement learning algorithm and a large language model.
Background
With the continuous evolution of artificial intelligence and autopilot technology, people have placed increasing demands on the autonomous decision making capability of vehicles in complex traffic environments. The current stage of automatic driving decision is mainly focused on two major methods of rule-driven expert systems and data-driven expert systems (such as end-to-end systems based on deep learning or deep reinforcement learning, etc.). However, both of these methods have respective limitations in practical applications:
Rule-driven expert systems are highly controllable but inflexible in that conventional autopilot systems typically rely on a large number of rules or state machine based decision logic that are pre-written. In the case of urban roads or highways, the system may perform operations according to fixed trigger conditions. Although the method has higher interpretability and controllability, the safety and legal driving of the system in a common scene can be ensured, but the coping of the sudden event or long tail scene often lacks enough self-adaptive capacity. When the external environment exceeds the design range, reasonable decision making is difficult to be carried out in real time by simply relying on pre-written rules, and the robustness and the universality of the system are obviously reduced.
The expert system driven by data has high flexibility but lacks controllability and interpretability, and the end-to-end automatic driving decision system based on deep reinforcement learning, which is emerging in recent years, can learn more excellent driving strategies in changeable traffic environments by means of the strong characterization capability of the neural network, and has higher adaptability to environmental changes. However, such methods often suffer from a "black box" problem, which makes it difficult to visualize or interpret their internal decision logic. Once a security threat occurs in a real road or a decision is made that is inconsistent with the driver's expectations, there is no clear means to intervene or quickly revise. In addition, data-driven algorithms often cannot actively "understand" the high-level preferences or intent of a human occupant, such as the subjective need that the occupant wants to "arrive as soon as possible" or "enjoy scenery along the way", and it is difficult to implement a "co-mingled" driving pattern.
The requirements of man-machine cooperation and individuation are increasingly highlighted, and as the automatic driving technology gradually goes from testing to practical application, people pay attention to the safety and efficiency of the system in a single scene, and also pay more and more attention to the individuation experience of passengers and the control sense of vehicle driving strategies. Different users may have different demands on the same road section, and it is often difficult to consider these subjective demands well with pure algorithm driven or pure rule driven modes.
Disclosure of Invention
The invention provides an automatic driving decision method based on man-machine co-fusion of a speed system, which aims to solve the problems of lack of fusion of a human driving target and intention and insufficient interpretability and safety in the prior art. The automatic driving decision is formed by a fast system based on deep reinforcement learning (Deep Reinforcement Learning, DRL) and a slow system based on a large language model (Large Language Model, LLM), wherein the fast system is responsible for real-time driving decision and control and can quickly respond in a short-time high-frequency dynamic traffic environment, and the slow system makes high-level decision and target lane selection by understanding and analyzing instructions of human users and combining environment perception information and transmits the information to the fast system for execution. By introducing target lane and human instruction information into the observation space of the express system and designing corresponding network structures and rewarding functions, the system can flexibly listen to human instructions while ensuring safe driving, and realize 'man-machine co-fusion' automatic driving.
The aim of the invention is achieved by the following technical scheme:
A man-machine co-fusion automatic driving decision method based on a speed system comprises the following steps:
s1, data acquisition and environment perception are carried out, and real-time vehicle and environment state information is acquired:
Deploying a multi-mode sensor on the vehicle, acquiring key information ① of the vehicle self state including the position through real-time sensing of the external environment and the vehicle self state Speed and velocity ofAnd accelerationAnd ② surrounding environment states including adjacent vehicle positions and speeds, lane line information, traffic signals and obstacle positions.
Specifically, the multimode sensor comprises a camera, a millimeter wave radar, a laser radar and a GPS/IMU.
The identified information is then fused and synchronized to the vehicle coordinate system or global coordinate system to obtain the key elements of vehicle positionVehicle speedRelative position of surrounding vehicles or obstaclesAnd relative velocity. At the moment of timeOrganizing the preprocessed information into state vectorsThe vector contains the following elements:
Wherein the method comprises the steps of Is the number of other vehicles within the vehicle sensing range. The time-series synchronized state vectorWill be one of the important inputs for the subsequent system to make high-level parsing and underlying control decisions.
S2, analyzing a slow system (LLM) and generating a high-level instruction:
the Large Language Model (LLM) is used as a slow system, and the information input to the LLM includes identification positioning information, real-time vehicle and environment state information and human instruction information of the LLM.
The LLM identity positioning information is used for enabling the LLM to confirm the identity positioning of the LLM;
For example:
"you are a large language model, please play a mature driver assistant now, can provide accurate and correct advice and guidance for the driver in the complex urban driving scene, you will get detailed driving scene description and human intention indication in the current scene, you need to fully understand the human intention and give appropriate desired lanes and driving style in combination with the current scene.
The human instruction information is abstracted or specific driving instructions are input by a user (driver or passenger) through voice or texts, such as 'I want to drive away time', 'want to enjoy scenery along the way', 'want to bypass construction road sections', and the like, if voice is input, voice is transcribed into texts through a voice recognition (ASR) module, and if text is input, the voice is directly obtained through a vehicle-mounted human-computer interaction interface.
The real-time vehicle and environment state information is obtained in step S1, and is converted into a standard expression conforming to the natural language rule.
In the real-time vehicle and environment state information, the LLM can firstly screen the surrounding scene state and carry out classification judgment. The LLM is then informed of the location of the autonomous vehicle itself, including vehicle location information, vehicle speed information, acceleration information. And then, based on the road topology information, acquiring the position, the speed and the acceleration information of the vehicle corresponding to the LLM, and informing the LLM if other vehicles are not around, wherein the vehicle possibly collides with the automatic driving vehicle and the vehicle with the nearest surrounding distance.
The information is input into a Large Language Model (LLM), human intention and external environment constraint are comprehensively analyzed by utilizing natural language understanding and reasoning capability of the LLM, abstract instructions (such as 'safety priority' and 'minimum lane change') are structurally mapped by semantic vector representation in the large language model, and high-level strategy information is generated.
In order to constrain the output format of LLM and require it to enhance decision quality through reasoning, LLM is required to output its decision content in a fixed output format, for example, the LLM is required to output in a format of "reason-repetition of output reason until a decision is obtained, after the decision is obtained, # # # # < desired lane >, # # # # < desired driving style >", is added to the system information.
To achieve effective guidance of the fast system (DRL, deep Reinforcement Learning), the slow system needs to output clear driving strategy elements including target lane #) And running mode [ ]) High-level instruction output by slow systemThe following are provided:
Wherein, the
Target lane [ ]) According to semantic analysis and road information, designating a lane which should be selected or kept by a vehicle preferentially;
Running mode [ ] ) Such as "quick", "COMFORT", "energy saving (ECO)";
If the LLM does not return to the corresponding driving strategy normally, the LLM is required to re-think and output the corresponding decision, it is emphasized that the output should be in the format of # # # < desired lane >, # # # < desired driving style > ".
The slow system will periodically (or upon detection of a user instruction or environmental change) recalculate and update the high level policy information to ensure that the user needs are continuously met in a dynamic scenario.
S3, real-time decision and control execution of a fast system (DRL):
The fast system is constructed based on a Deep Reinforcement Learning (DRL) method.
High-level instruction to output slow systemWith vehicle/environment statusCombined to form an extended observation space for a fast system (DRL):
Including when the vehicle is at timeIs extracted from the integrated perception information (S1)) And instruction elements such as a target lane, a driving mode and the like given by the slow system at the current moment.
The observation information obtained by the final vehicle and the high-level instruction together form an observation space matrix, and each row of the matrix represents information of one vehicle, including the corresponding position, speed, acceleration and high-level instruction information of the vehicle. In addition, in order to enable the DRL agent to clarify the self-vehicle information, the first row of the matrix is the self-vehicle information, and the remaining rows are arranged according to the euclidean distance between the ambient vehicle and the self-driving curtain. The last two columns of the matrix are high-level instruction information, which are respectively the ordinate value of the center line of the expected lane and the corresponding numbers of different driving styles. For the surrounding environment vehicle, no corresponding high-level instruction exists, and the vehicle longitudinal coordinate value and the default driving style in the current state are directly adopted.
The invention adopts a Deep Reinforcement Learning (DRL) network pairMaking real-time decisions to obtain bottom control actions:
Wherein the method comprises the steps ofIndicating the steering angle of the steering wheel,Indicating the speed of addition (reduction) or the throttle brake control amount.
Training of Deep Reinforcement Learning (DRL) network adopts strategy gradient class algorithm to define strategyAnd optimizing the following desired return function:
Wherein, the A parameter representing the policy network,Is shown in the stateTake action downwardsThe awards obtained; State access distribution for policy induction, for a given agent policy When the agent is distributed from the initial stateStarting from, according to the discount factorRunning endless steps, its discounted dominant state distribution is defined as:
representing policy-induced state distribution Randomly extracting a state;Representing policy-based distribution in stateRandom extraction action;Indicating the desire under the above conditions;
wherein, the rewards are designed as follows:
Wherein, the
In connection with safe driving, if a positive reward is given when a safe vehicle distance is kept, no collision or violation occurs, punishment occurs when a danger occurs;
Awarding a slow system command (e.g., target lane, travel mode) compliance, awarding a negative award if the current actual lane or speed of the vehicle differs significantly from the target demand;
related to factors such as driving efficiency, comfort, etc., such as reducing unnecessary lane changes, avoiding frequent acceleration and deceleration, etc. And the weighting coefficients are allocated according to different scenes or requirements.
By training, the fast system network can gradually learn how to perform optimal driving decisions on the premise of ensuring safety and obeying instructions, and output control signals at high frequency (100 Hz) during actual operation.
In addition, the underlying actions that will be output by the fast system network during executionAnd if the system detects the potential danger or the command output violating the traffic rules, a safety filtering module can be introduced to cut or alarm the actions so as to ensure the running safety of the vehicle.
S4, man-machine co-fusion and dynamic feedback:
dynamic analysis of slow system, when external environment or user instruction is changed, slow system can make semantic analysis and high-level planning again, and utilizes current state of vehicle And new user requirements, generating a new roundSuch as switching a target lane, adjusting a travel mode, etc.
Immediate response of the fast System is acquiring a new oneWill be incorporated into the extended observation spaceFor example, if the slow system gives an instruction of 'reducing the speed of a vehicle and attempting to change the road to avoid congestion' when congestion is caused by construction of a certain road section, the fast system updates the lane changing and speed control actions to enter a proper lane in the shortest time and keep a safe distance.
And the system can interact with the user for multiple times in the whole process. If the user is not satisfied with the selected scheme, the user can input the adjusting instruction again, the slow system re-plans the route or the vehicle speed, and the fast system immediately executes the new high-level instruction.
Advantageous effects
Compared with the prior art, the invention has the following advantages:
(1) The method has controllability and flexibility, realizes the controllability of algorithm behavior by incorporating information such as user instructions, target lanes and the like into the observation space of the DRL, simultaneously reserves the high-efficiency learning and quick response capability of the DRL to unknown complex scenes, and overcomes the limitation of lack of adaptability of a pure rule method.
(2) The method and the system enhance the interpretability and man-machine interaction, wherein the slow system can be based on a large language model, can be used for interpretative description of how to select a certain lane, how to balance safety and time and the like according to the user requirement or a use field Jing Shengcheng, and can be used for adjusting or inquiring a high-level instruction through multiple interactions, so that man-machine co-fusion in a real sense is realized.
(3) Safety and robustness, namely designing a reward function with safety priority in a training stage, timely correcting potential dangerous actions through a safety filtering mechanism and rule constraint in an execution stage, and adaptively adjusting long-tail scenes or emergency events, thereby improving the overall robustness of the system.
(4) The framework is suitable for a high-level unmanned system and can be applied to a driving auxiliary scene requiring man-machine cooperation, a large language model part can be replaced or upgraded according to user requirements or service scenes, and a DRL model of a fast system can be coupled with other AI algorithms, so that the framework has good expandability.
Drawings
FIG. 1 is a process flow diagram of the method of the present invention;
FIG. 2 is a schematic diagram of the overall architecture of the method of the present invention;
FIG. 3 is a schematic diagram of a slow system process flow of the method of the present invention;
FIG. 4 is a schematic diagram of the observation space and training process in the system of the present invention;
FIG. 5 is a schematic diagram of an embodiment of the present invention simulating a two-lane road environment;
FIG. 6 is a graph showing the effect test and comparison of the method of the present invention and the comparison method in various scenes.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
Examples
The automatic driving decision method based on the man-machine co-fusion of the speed system has the overall flow shown in figure 1 and comprises the following steps:
S1, acquiring real-time vehicle and environment state information by data acquisition and environment perception
And S11, building a test environment, namely building a virtual simulation environment or a closed test field with various typical road conditions (high-speed multilane, a junction area, opposite double lanes and the like) before the test is started, and configuring traffic participation elements such as surrounding vehicles, pedestrians, traffic lights and the like to simulate a real road environment. Next, a specific case of the two-lane is described in fig. 5.
S12, initializing a vehicle and a sensor, namely, the vehicle (hereinafter referred to as an automatic driving vehicle to be tested) is provided with a camera, millimeter wave radar, laser radar, GPS/IMU and other vehicle-mounted sensing devices, is also provided with a vehicle-mounted communication device and a computing unit, can share data with a cloud controller or an edge server, starts the vehicle and calibrates the sensor to obtain initial state information (vehicle position, speed, lane information and the like), and simultaneously confirms whether network communication is normal.
S2 slow System (LLM) resolution and higher level instruction generation
S21, user instructions are acquired, namely the tested automatic driving vehicle receives high-level instructions of a driver or a passenger, the high-level instructions can be in the form of voice or texts, such as 'I want to arrive at a company as soon as possible', if the high-level instructions are voice input, the high-level instructions are converted into texts which can be analyzed by a large language model through a voice recognition (ASR) module, and the text mode can be directly input to a vehicle-mounted human-computer interaction interface.
And S22, analyzing and understanding the slow system, namely transmitting a user instruction and the current vehicle/environment state to the slow system (large language model, LLM) together, as shown in figure 3, and generating corresponding high-level decision information by the slow system through semantic analysis and reasoning and combining preset or real-time acquired road information (speed limit, construction information, traffic flow and the like).
S23, data packaging and visualization, wherein the slow system packages the analysis result into a data packet with a specified format, records high-level intention, for example:
s3, real-time decision and control execution of a fast system (DRL)
And S31, constructing an observation space, wherein the high-level decision information (such as a target lane) output by the slow system is included in the observation space of the fast system, as shown in fig. 4, and the observation space also comprises the state (speed, position and acceleration) of the vehicle, surrounding traffic elements and the like.
Recording the state vector of the vehicle at the time t asCan be expressed as:
S32, defining an action space, namely outputting a bottom driving control instruction by a fast system through a deep reinforcement learning network and combining a bottom PID controller:
Wherein the method comprises the steps of Indicating acceleration (or deceleration),Indicating the steering wheel angle.
And S33, training and deploying, wherein the rapid system can be trained by adopting a Policy Gradient (Policy Gradient) or a Value-based (Value-based) DRL algorithm. In the training stage, strategy parameters are continuously optimized through a large number of simulation interactions (or combined with closed field tests), so that the system has good safety and compliance to slow system instructions in a complex traffic environment.
Prize function design and collaborative decision-making the prize function is a vital component in Deep Reinforcement Learning (DRL) by which the DRL model can learn how to make appropriate decisions to achieve the desired behavior. In the present invention, the objective of the bonus function design is to compromise the safety of the vehicle, the driving efficiency, and compliance with slow system commands.
Wherein, the rewarding function is formed by the invention in order to consider the safety, the efficiency and the compliance degree to human instructionsThe method can be divided into the following parts:
Wherein, the The method is characterized in that the method is used for rewarding safely, and if the distance between the vehicle and the front vehicle is kept safe and no collision exists, positive rewarding is obtained; efficiency rewards, if the running speed of the vehicle is stable and meets the requirements of a fast or energy-saving mode, the rewards are higher; The higher the matching degree of the vehicle and a target lane or speed interval issued by a slow system is, the larger the obtained reward is. Weighting of Can be set or dynamically adjusted according to actual requirements.
And the weight of each part、AndThen it is used to balance the importance of the different bonus items. The setting of the weight coefficient can be adjusted according to the actual application scene, so that proper balance among different targets (such as safety, efficiency and instruction compliance) is achieved.
Taking the opposite double lanes as an example, the specific structure of the rewarding function is as follows:
efficiency rewarding item ):
The high speed rewards of the vehicle aim to encourage the vehicle to maintain a high driving speed, meeting the command requirements of a "fast" driving mode. Based on the current speed and the desired speed range of the vehicle, the bonus function first calculates a "normalized" value for the current vehicle speed:
Where forward_speed is the actual forward speed of the vehicle (calculated from the cosine of the vehicle's speed vector and the direction of the head) and is mapped into the range of [ -1, 1 ];
The forward_speed_range is a defined ideal speed range, including a corresponding desired minimum speed min_target_speed and a desired maximum speed max_target_speed;
lmap functions are calculated as follows:
If the speed of the vehicle is high and meets the desired range, the prize value will increase. The specific rewarding formula is as follows:
Wherein SCALED SPEED is the calculated "normalized speed value", For restricting the elements in the array to a specified range (NumPy library functions). If an element is outside this range, the np-clip will truncate it to the boundary value of the range. In particular, the function of the np can be summarized as:
wherein x is an input value to be processed, min is a lower bound of a specified value, and is truncated to min if the input value is smaller than the lower bound, and max is an upper bound of the specified value, and is truncated to max if the input value is larger than the upper bound. In the present invention, np is used to limit the range of "normalized speed values" (normalized_speed) to remain between [ -1,1 ]. The normalized speed value (scaled_speed) is obtained by linearly mapping the actual running speed of the vehicle, and functions to convert the vehicle speed value into a uniform scale (-1 to 1) for facilitating subsequent bonus calculations.
Safety rewarding item @):
Collision rewards are used for punishing the collision situation of vehicles, and are guarantees for automatic driving safety behaviors. If the vehicle collides, the self.vehicle.crashed is 1, the reward is negative, and if the vehicle does not collide, the reward is zero.
The weight coefficient of the bonus term is very large (-10), so that the vehicle can avoid collision as much as possible.
Preference rewarding item):
The preference rewards are used to rewards vehicles to select behaviors that meet the user's preferences. For example, a positive reward is given when the desired distance to the target lane is approached, and a negative reward is given if the vehicle deviates from the target lane. This reward ensures that the vehicle selects the best path as possible according to the user instructions.
In this way, the system ensures that the vehicle is able to follow the user-set driving objective and encourages the vehicle to execute the preference policy.
Comprehensive rewards calculation:
Finally, comprehensively considering the weights of all the rewarding items, and calculating the final rewarding value by the system. The prize value is a weighted sum of the plurality of sub-prizes:
Wherein, the 、AndRepresenting the bonus items relating to security, efficiency, and instruction compliance, respectively.
To ensure comparability of the prize values in different scenarios, we normalize the prizes. The specific way is to map the bonus value to the interval of [0, 1] to ensure that all bonus items are in the same scale range:
And (3) performing linear mapping, and mapping the range of the reward values from the configured minimum value to the configured maximum value to the range of [0, 1] to ensure that the reward values output by the system work under a uniform scale. And S32, the fast system and the slow system cooperate to continuously output high-level information according to the user instruction and the traffic situation, so that the target lane or the driving mode can be updated in real time, and the fast system rapidly deduces based on the new observation state in each decision period (such as 0.1S or less) and outputs decision information.
S4, man-machine co-fusion and dynamic feedback
S41, in the case of countermeasure or conflict, if the user changes the instruction (such as switching from 'fast driving' to 'safe first'), the slow system re-plans the high-level information and updates the target lane and driving mode, and in the man-machine interaction process, the system can dynamically display the reason or expected effect of the decision made by the vehicle.
And S42, carrying out multi-wheel interaction and user feedback, wherein if the user is found to send out an instruction conflicting with the road traffic regulations in the running process of the vehicle, the slow system can prompt refusal or suggestion modification in time through a text or interface even if the user is found, so that dangerous behavior is avoided, and the user can provide a new instruction for the vehicle in the driving process.
The whole flow of the invention is carried out in a cyclic iteration mode in actual operation:
1) S1, continuously monitoring user instructions and environmental changes by a slow system in the step S1;
2) S2 and S3, the express system makes a decision according to the updated observation information and the rewarding function;
3) S4, dynamically regulating and controlling system strategies and rewarding distribution at the man-machine interaction level;
4) This process is repeated until the vehicle trip is completed or the user command is completed.
The simulation environment for training is built by adopting Highway-Env and Gymnasium, wherein the position and the orientation of the vehicle are controlled by a closed-loop PID:
Wherein the method comprises the steps of Is the relative lateral distance of the vehicle with respect to the corresponding target lane centerline,Is a lateral velocity control command that is directed to,Is a control instruction for controlling the rotation angle of the vehicle.
Wherein, the Is the course of the lane,Is the target heading for the heading and location of the desired lane,Is a lateral control rate command that is directed to,Is the rotation angle control quantity of the front wheel,AndThe control gains for position and heading angle, respectively.
Wherein the motion control of the vehicle is realized according to the formula, whereinIs the location of the vehicle and,Is the forward speed of the vehicle and,Is an acceleration command of the vehicle,Is the slip angle at the center of gravity. The environmental vehicle longitudinally adopts an IDM algorithm, and the transverse control adopts a MOBIL lane change strategy.
And selecting a plurality of scenes, including a high-speed following scene, a ramp changing scene and a road borrowing overtaking scene, testing, selecting an SOTA algorithm (Dilu) based on LLM, a DQN (digital video camera) based on a DRL algorithm of a value and a PPO (digital video camera) based on a DRL algorithm of a strategy, and comparing performance effects, wherein the result is shown in figure 6, and the model achieves the highest success rate under the various scenes.
Performing a set of statistical results analysis, the data results are shown in the following table, and the model is found to be capable of achieving the best balance between safety, efficiency and compliance with human guidance in a plurality of scenarios:
the above description is only illustrative of the preferred embodiments of the application and is not intended to limit the scope of the application in any way. Any alterations or modifications of the application, which are obvious to those skilled in the art based on the teachings disclosed above, are intended to be equally effective embodiments, and are intended to be within the scope of the appended claims.