[go: up one dir, main page]

CN120066281B - A human-machine collaborative autonomous driving decision-making method based on fast and slow systems - Google Patents

A human-machine collaborative autonomous driving decision-making method based on fast and slow systems

Info

Publication number
CN120066281B
CN120066281B CN202510536060.3A CN202510536060A CN120066281B CN 120066281 B CN120066281 B CN 120066281B CN 202510536060 A CN202510536060 A CN 202510536060A CN 120066281 B CN120066281 B CN 120066281B
Authority
CN
China
Prior art keywords
vehicle
information
human
speed
fast
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510536060.3A
Other languages
Chinese (zh)
Other versions
CN120066281A (en
Inventor
孙剑
徐成凯
杭鹏
刘佳琦
房世玉
郭翼成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202510536060.3A priority Critical patent/CN120066281B/en
Publication of CN120066281A publication Critical patent/CN120066281A/en
Application granted granted Critical
Publication of CN120066281B publication Critical patent/CN120066281B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W50/08Interaction between the driver and the control system
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W2050/0001Details of the control system
    • B60W2050/0019Control system elements or transfer functions
    • B60W2050/0028Mathematical models, e.g. for simulation
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W2050/0001Details of the control system
    • B60W2050/0043Signal treatments, identification of variables or parameters, parameter estimation or state estimation

Landscapes

  • Engineering & Computer Science (AREA)
  • Automation & Control Theory (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Traffic Control Systems (AREA)

Abstract

本发明公开了一种基于快慢系统的人机共融的自动驾驶决策方法,旨在兼顾自动驾驶场景下安全性、灵活性和可控性。该模型由基于深度强化学习的快系统和基于大语言模型的慢系统共同构成:其中,快系统负责实时驾驶决策与控制,能够在短时高频的动态交通环境中快速响应;慢系统通过理解并解析人类用户的指令,结合环境感知信息做出高层次决策和目标车道选择,并将该信息传递给快系统执行。通过在快系统的观测空间中引入目标车道与人类指令信息,并设计相应的网络结构及奖励函数,系统能够在保证安全驾驶的同时,听从人类指令,实现“人机共融”自动驾驶。本发明具有良好的通用性、可扩展性及可解释性。

The present invention discloses a human-machine collaborative autonomous driving decision-making method based on a fast and slow system, aiming to take into account safety, flexibility and controllability in autonomous driving scenarios. The model is composed of a fast system based on deep reinforcement learning and a slow system based on a large language model: the fast system is responsible for real-time driving decision-making and control, and can respond quickly in a short-term and high-frequency dynamic traffic environment; the slow system understands and parses the instructions of human users, combines environmental perception information to make high-level decisions and target lane selection, and passes the information to the fast system for execution. By introducing the target lane and human command information in the observation space of the fast system, and designing the corresponding network structure and reward function, the system can obey human instructions while ensuring safe driving, thereby realizing "human-machine collaborative" autonomous driving. The present invention has good versatility, scalability and interpretability.

Description

Man-machine co-fusion automatic driving decision method based on speed system
Technical Field
The invention relates to the field of automatic driving and artificial intelligence, in particular to a driving decision method based on cooperative work of a deep reinforcement learning algorithm and a large language model.
Background
With the continuous evolution of artificial intelligence and autopilot technology, people have placed increasing demands on the autonomous decision making capability of vehicles in complex traffic environments. The current stage of automatic driving decision is mainly focused on two major methods of rule-driven expert systems and data-driven expert systems (such as end-to-end systems based on deep learning or deep reinforcement learning, etc.). However, both of these methods have respective limitations in practical applications:
Rule-driven expert systems are highly controllable but inflexible in that conventional autopilot systems typically rely on a large number of rules or state machine based decision logic that are pre-written. In the case of urban roads or highways, the system may perform operations according to fixed trigger conditions. Although the method has higher interpretability and controllability, the safety and legal driving of the system in a common scene can be ensured, but the coping of the sudden event or long tail scene often lacks enough self-adaptive capacity. When the external environment exceeds the design range, reasonable decision making is difficult to be carried out in real time by simply relying on pre-written rules, and the robustness and the universality of the system are obviously reduced.
The expert system driven by data has high flexibility but lacks controllability and interpretability, and the end-to-end automatic driving decision system based on deep reinforcement learning, which is emerging in recent years, can learn more excellent driving strategies in changeable traffic environments by means of the strong characterization capability of the neural network, and has higher adaptability to environmental changes. However, such methods often suffer from a "black box" problem, which makes it difficult to visualize or interpret their internal decision logic. Once a security threat occurs in a real road or a decision is made that is inconsistent with the driver's expectations, there is no clear means to intervene or quickly revise. In addition, data-driven algorithms often cannot actively "understand" the high-level preferences or intent of a human occupant, such as the subjective need that the occupant wants to "arrive as soon as possible" or "enjoy scenery along the way", and it is difficult to implement a "co-mingled" driving pattern.
The requirements of man-machine cooperation and individuation are increasingly highlighted, and as the automatic driving technology gradually goes from testing to practical application, people pay attention to the safety and efficiency of the system in a single scene, and also pay more and more attention to the individuation experience of passengers and the control sense of vehicle driving strategies. Different users may have different demands on the same road section, and it is often difficult to consider these subjective demands well with pure algorithm driven or pure rule driven modes.
Disclosure of Invention
The invention provides an automatic driving decision method based on man-machine co-fusion of a speed system, which aims to solve the problems of lack of fusion of a human driving target and intention and insufficient interpretability and safety in the prior art. The automatic driving decision is formed by a fast system based on deep reinforcement learning (Deep Reinforcement Learning, DRL) and a slow system based on a large language model (Large Language Model, LLM), wherein the fast system is responsible for real-time driving decision and control and can quickly respond in a short-time high-frequency dynamic traffic environment, and the slow system makes high-level decision and target lane selection by understanding and analyzing instructions of human users and combining environment perception information and transmits the information to the fast system for execution. By introducing target lane and human instruction information into the observation space of the express system and designing corresponding network structures and rewarding functions, the system can flexibly listen to human instructions while ensuring safe driving, and realize 'man-machine co-fusion' automatic driving.
The aim of the invention is achieved by the following technical scheme:
A man-machine co-fusion automatic driving decision method based on a speed system comprises the following steps:
s1, data acquisition and environment perception are carried out, and real-time vehicle and environment state information is acquired:
Deploying a multi-mode sensor on the vehicle, acquiring key information ① of the vehicle self state including the position through real-time sensing of the external environment and the vehicle self state Speed and velocity ofAnd accelerationAnd ② surrounding environment states including adjacent vehicle positions and speeds, lane line information, traffic signals and obstacle positions.
Specifically, the multimode sensor comprises a camera, a millimeter wave radar, a laser radar and a GPS/IMU.
The identified information is then fused and synchronized to the vehicle coordinate system or global coordinate system to obtain the key elements of vehicle positionVehicle speedRelative position of surrounding vehicles or obstaclesAnd relative velocity. At the moment of timeOrganizing the preprocessed information into state vectorsThe vector contains the following elements:
Wherein the method comprises the steps of Is the number of other vehicles within the vehicle sensing range. The time-series synchronized state vectorWill be one of the important inputs for the subsequent system to make high-level parsing and underlying control decisions.
S2, analyzing a slow system (LLM) and generating a high-level instruction:
the Large Language Model (LLM) is used as a slow system, and the information input to the LLM includes identification positioning information, real-time vehicle and environment state information and human instruction information of the LLM.
The LLM identity positioning information is used for enabling the LLM to confirm the identity positioning of the LLM;
For example:
"you are a large language model, please play a mature driver assistant now, can provide accurate and correct advice and guidance for the driver in the complex urban driving scene, you will get detailed driving scene description and human intention indication in the current scene, you need to fully understand the human intention and give appropriate desired lanes and driving style in combination with the current scene.
The human instruction information is abstracted or specific driving instructions are input by a user (driver or passenger) through voice or texts, such as 'I want to drive away time', 'want to enjoy scenery along the way', 'want to bypass construction road sections', and the like, if voice is input, voice is transcribed into texts through a voice recognition (ASR) module, and if text is input, the voice is directly obtained through a vehicle-mounted human-computer interaction interface.
The real-time vehicle and environment state information is obtained in step S1, and is converted into a standard expression conforming to the natural language rule.
In the real-time vehicle and environment state information, the LLM can firstly screen the surrounding scene state and carry out classification judgment. The LLM is then informed of the location of the autonomous vehicle itself, including vehicle location information, vehicle speed information, acceleration information. And then, based on the road topology information, acquiring the position, the speed and the acceleration information of the vehicle corresponding to the LLM, and informing the LLM if other vehicles are not around, wherein the vehicle possibly collides with the automatic driving vehicle and the vehicle with the nearest surrounding distance.
The information is input into a Large Language Model (LLM), human intention and external environment constraint are comprehensively analyzed by utilizing natural language understanding and reasoning capability of the LLM, abstract instructions (such as 'safety priority' and 'minimum lane change') are structurally mapped by semantic vector representation in the large language model, and high-level strategy information is generated.
In order to constrain the output format of LLM and require it to enhance decision quality through reasoning, LLM is required to output its decision content in a fixed output format, for example, the LLM is required to output in a format of "reason-repetition of output reason until a decision is obtained, after the decision is obtained, # # # # < desired lane >, # # # # < desired driving style >", is added to the system information.
To achieve effective guidance of the fast system (DRL, deep Reinforcement Learning), the slow system needs to output clear driving strategy elements including target lane #) And running mode [ ]) High-level instruction output by slow systemThe following are provided:
Wherein, the
Target lane [ ]) According to semantic analysis and road information, designating a lane which should be selected or kept by a vehicle preferentially;
Running mode [ ] ) Such as "quick", "COMFORT", "energy saving (ECO)";
If the LLM does not return to the corresponding driving strategy normally, the LLM is required to re-think and output the corresponding decision, it is emphasized that the output should be in the format of # # # < desired lane >, # # # < desired driving style > ".
The slow system will periodically (or upon detection of a user instruction or environmental change) recalculate and update the high level policy information to ensure that the user needs are continuously met in a dynamic scenario.
S3, real-time decision and control execution of a fast system (DRL):
The fast system is constructed based on a Deep Reinforcement Learning (DRL) method.
High-level instruction to output slow systemWith vehicle/environment statusCombined to form an extended observation space for a fast system (DRL):
Including when the vehicle is at timeIs extracted from the integrated perception information (S1)) And instruction elements such as a target lane, a driving mode and the like given by the slow system at the current moment.
The observation information obtained by the final vehicle and the high-level instruction together form an observation space matrix, and each row of the matrix represents information of one vehicle, including the corresponding position, speed, acceleration and high-level instruction information of the vehicle. In addition, in order to enable the DRL agent to clarify the self-vehicle information, the first row of the matrix is the self-vehicle information, and the remaining rows are arranged according to the euclidean distance between the ambient vehicle and the self-driving curtain. The last two columns of the matrix are high-level instruction information, which are respectively the ordinate value of the center line of the expected lane and the corresponding numbers of different driving styles. For the surrounding environment vehicle, no corresponding high-level instruction exists, and the vehicle longitudinal coordinate value and the default driving style in the current state are directly adopted.
The invention adopts a Deep Reinforcement Learning (DRL) network pairMaking real-time decisions to obtain bottom control actions:
Wherein the method comprises the steps ofIndicating the steering angle of the steering wheel,Indicating the speed of addition (reduction) or the throttle brake control amount.
Training of Deep Reinforcement Learning (DRL) network adopts strategy gradient class algorithm to define strategyAnd optimizing the following desired return function:
Wherein, the A parameter representing the policy network,Is shown in the stateTake action downwardsThe awards obtained; State access distribution for policy induction, for a given agent policy When the agent is distributed from the initial stateStarting from, according to the discount factorRunning endless steps, its discounted dominant state distribution is defined as:
representing policy-induced state distribution Randomly extracting a state;Representing policy-based distribution in stateRandom extraction action;Indicating the desire under the above conditions;
wherein, the rewards are designed as follows:
Wherein, the
In connection with safe driving, if a positive reward is given when a safe vehicle distance is kept, no collision or violation occurs, punishment occurs when a danger occurs;
Awarding a slow system command (e.g., target lane, travel mode) compliance, awarding a negative award if the current actual lane or speed of the vehicle differs significantly from the target demand;
related to factors such as driving efficiency, comfort, etc., such as reducing unnecessary lane changes, avoiding frequent acceleration and deceleration, etc. And the weighting coefficients are allocated according to different scenes or requirements.
By training, the fast system network can gradually learn how to perform optimal driving decisions on the premise of ensuring safety and obeying instructions, and output control signals at high frequency (100 Hz) during actual operation.
In addition, the underlying actions that will be output by the fast system network during executionAnd if the system detects the potential danger or the command output violating the traffic rules, a safety filtering module can be introduced to cut or alarm the actions so as to ensure the running safety of the vehicle.
S4, man-machine co-fusion and dynamic feedback:
dynamic analysis of slow system, when external environment or user instruction is changed, slow system can make semantic analysis and high-level planning again, and utilizes current state of vehicle And new user requirements, generating a new roundSuch as switching a target lane, adjusting a travel mode, etc.
Immediate response of the fast System is acquiring a new oneWill be incorporated into the extended observation spaceFor example, if the slow system gives an instruction of 'reducing the speed of a vehicle and attempting to change the road to avoid congestion' when congestion is caused by construction of a certain road section, the fast system updates the lane changing and speed control actions to enter a proper lane in the shortest time and keep a safe distance.
And the system can interact with the user for multiple times in the whole process. If the user is not satisfied with the selected scheme, the user can input the adjusting instruction again, the slow system re-plans the route or the vehicle speed, and the fast system immediately executes the new high-level instruction.
Advantageous effects
Compared with the prior art, the invention has the following advantages:
(1) The method has controllability and flexibility, realizes the controllability of algorithm behavior by incorporating information such as user instructions, target lanes and the like into the observation space of the DRL, simultaneously reserves the high-efficiency learning and quick response capability of the DRL to unknown complex scenes, and overcomes the limitation of lack of adaptability of a pure rule method.
(2) The method and the system enhance the interpretability and man-machine interaction, wherein the slow system can be based on a large language model, can be used for interpretative description of how to select a certain lane, how to balance safety and time and the like according to the user requirement or a use field Jing Shengcheng, and can be used for adjusting or inquiring a high-level instruction through multiple interactions, so that man-machine co-fusion in a real sense is realized.
(3) Safety and robustness, namely designing a reward function with safety priority in a training stage, timely correcting potential dangerous actions through a safety filtering mechanism and rule constraint in an execution stage, and adaptively adjusting long-tail scenes or emergency events, thereby improving the overall robustness of the system.
(4) The framework is suitable for a high-level unmanned system and can be applied to a driving auxiliary scene requiring man-machine cooperation, a large language model part can be replaced or upgraded according to user requirements or service scenes, and a DRL model of a fast system can be coupled with other AI algorithms, so that the framework has good expandability.
Drawings
FIG. 1 is a process flow diagram of the method of the present invention;
FIG. 2 is a schematic diagram of the overall architecture of the method of the present invention;
FIG. 3 is a schematic diagram of a slow system process flow of the method of the present invention;
FIG. 4 is a schematic diagram of the observation space and training process in the system of the present invention;
FIG. 5 is a schematic diagram of an embodiment of the present invention simulating a two-lane road environment;
FIG. 6 is a graph showing the effect test and comparison of the method of the present invention and the comparison method in various scenes.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
Examples
The automatic driving decision method based on the man-machine co-fusion of the speed system has the overall flow shown in figure 1 and comprises the following steps:
S1, acquiring real-time vehicle and environment state information by data acquisition and environment perception
And S11, building a test environment, namely building a virtual simulation environment or a closed test field with various typical road conditions (high-speed multilane, a junction area, opposite double lanes and the like) before the test is started, and configuring traffic participation elements such as surrounding vehicles, pedestrians, traffic lights and the like to simulate a real road environment. Next, a specific case of the two-lane is described in fig. 5.
S12, initializing a vehicle and a sensor, namely, the vehicle (hereinafter referred to as an automatic driving vehicle to be tested) is provided with a camera, millimeter wave radar, laser radar, GPS/IMU and other vehicle-mounted sensing devices, is also provided with a vehicle-mounted communication device and a computing unit, can share data with a cloud controller or an edge server, starts the vehicle and calibrates the sensor to obtain initial state information (vehicle position, speed, lane information and the like), and simultaneously confirms whether network communication is normal.
S2 slow System (LLM) resolution and higher level instruction generation
S21, user instructions are acquired, namely the tested automatic driving vehicle receives high-level instructions of a driver or a passenger, the high-level instructions can be in the form of voice or texts, such as 'I want to arrive at a company as soon as possible', if the high-level instructions are voice input, the high-level instructions are converted into texts which can be analyzed by a large language model through a voice recognition (ASR) module, and the text mode can be directly input to a vehicle-mounted human-computer interaction interface.
And S22, analyzing and understanding the slow system, namely transmitting a user instruction and the current vehicle/environment state to the slow system (large language model, LLM) together, as shown in figure 3, and generating corresponding high-level decision information by the slow system through semantic analysis and reasoning and combining preset or real-time acquired road information (speed limit, construction information, traffic flow and the like).
S23, data packaging and visualization, wherein the slow system packages the analysis result into a data packet with a specified format, records high-level intention, for example:
s3, real-time decision and control execution of a fast system (DRL)
And S31, constructing an observation space, wherein the high-level decision information (such as a target lane) output by the slow system is included in the observation space of the fast system, as shown in fig. 4, and the observation space also comprises the state (speed, position and acceleration) of the vehicle, surrounding traffic elements and the like.
Recording the state vector of the vehicle at the time t asCan be expressed as:
S32, defining an action space, namely outputting a bottom driving control instruction by a fast system through a deep reinforcement learning network and combining a bottom PID controller:
Wherein the method comprises the steps of Indicating acceleration (or deceleration),Indicating the steering wheel angle.
And S33, training and deploying, wherein the rapid system can be trained by adopting a Policy Gradient (Policy Gradient) or a Value-based (Value-based) DRL algorithm. In the training stage, strategy parameters are continuously optimized through a large number of simulation interactions (or combined with closed field tests), so that the system has good safety and compliance to slow system instructions in a complex traffic environment.
Prize function design and collaborative decision-making the prize function is a vital component in Deep Reinforcement Learning (DRL) by which the DRL model can learn how to make appropriate decisions to achieve the desired behavior. In the present invention, the objective of the bonus function design is to compromise the safety of the vehicle, the driving efficiency, and compliance with slow system commands.
Wherein, the rewarding function is formed by the invention in order to consider the safety, the efficiency and the compliance degree to human instructionsThe method can be divided into the following parts:
Wherein, the The method is characterized in that the method is used for rewarding safely, and if the distance between the vehicle and the front vehicle is kept safe and no collision exists, positive rewarding is obtained; efficiency rewards, if the running speed of the vehicle is stable and meets the requirements of a fast or energy-saving mode, the rewards are higher; The higher the matching degree of the vehicle and a target lane or speed interval issued by a slow system is, the larger the obtained reward is. Weighting of Can be set or dynamically adjusted according to actual requirements.
And the weight of each partAndThen it is used to balance the importance of the different bonus items. The setting of the weight coefficient can be adjusted according to the actual application scene, so that proper balance among different targets (such as safety, efficiency and instruction compliance) is achieved.
Taking the opposite double lanes as an example, the specific structure of the rewarding function is as follows:
efficiency rewarding item ):
The high speed rewards of the vehicle aim to encourage the vehicle to maintain a high driving speed, meeting the command requirements of a "fast" driving mode. Based on the current speed and the desired speed range of the vehicle, the bonus function first calculates a "normalized" value for the current vehicle speed:
Where forward_speed is the actual forward speed of the vehicle (calculated from the cosine of the vehicle's speed vector and the direction of the head) and is mapped into the range of [ -1, 1 ];
The forward_speed_range is a defined ideal speed range, including a corresponding desired minimum speed min_target_speed and a desired maximum speed max_target_speed;
lmap functions are calculated as follows:
If the speed of the vehicle is high and meets the desired range, the prize value will increase. The specific rewarding formula is as follows:
Wherein SCALED SPEED is the calculated "normalized speed value", For restricting the elements in the array to a specified range (NumPy library functions). If an element is outside this range, the np-clip will truncate it to the boundary value of the range. In particular, the function of the np can be summarized as:
wherein x is an input value to be processed, min is a lower bound of a specified value, and is truncated to min if the input value is smaller than the lower bound, and max is an upper bound of the specified value, and is truncated to max if the input value is larger than the upper bound. In the present invention, np is used to limit the range of "normalized speed values" (normalized_speed) to remain between [ -1,1 ]. The normalized speed value (scaled_speed) is obtained by linearly mapping the actual running speed of the vehicle, and functions to convert the vehicle speed value into a uniform scale (-1 to 1) for facilitating subsequent bonus calculations.
Safety rewarding item @):
Collision rewards are used for punishing the collision situation of vehicles, and are guarantees for automatic driving safety behaviors. If the vehicle collides, the self.vehicle.crashed is 1, the reward is negative, and if the vehicle does not collide, the reward is zero.
The weight coefficient of the bonus term is very large (-10), so that the vehicle can avoid collision as much as possible.
Preference rewarding item):
The preference rewards are used to rewards vehicles to select behaviors that meet the user's preferences. For example, a positive reward is given when the desired distance to the target lane is approached, and a negative reward is given if the vehicle deviates from the target lane. This reward ensures that the vehicle selects the best path as possible according to the user instructions.
In this way, the system ensures that the vehicle is able to follow the user-set driving objective and encourages the vehicle to execute the preference policy.
Comprehensive rewards calculation:
Finally, comprehensively considering the weights of all the rewarding items, and calculating the final rewarding value by the system. The prize value is a weighted sum of the plurality of sub-prizes:
Wherein, the AndRepresenting the bonus items relating to security, efficiency, and instruction compliance, respectively.
To ensure comparability of the prize values in different scenarios, we normalize the prizes. The specific way is to map the bonus value to the interval of [0, 1] to ensure that all bonus items are in the same scale range:
And (3) performing linear mapping, and mapping the range of the reward values from the configured minimum value to the configured maximum value to the range of [0, 1] to ensure that the reward values output by the system work under a uniform scale. And S32, the fast system and the slow system cooperate to continuously output high-level information according to the user instruction and the traffic situation, so that the target lane or the driving mode can be updated in real time, and the fast system rapidly deduces based on the new observation state in each decision period (such as 0.1S or less) and outputs decision information.
S4, man-machine co-fusion and dynamic feedback
S41, in the case of countermeasure or conflict, if the user changes the instruction (such as switching from 'fast driving' to 'safe first'), the slow system re-plans the high-level information and updates the target lane and driving mode, and in the man-machine interaction process, the system can dynamically display the reason or expected effect of the decision made by the vehicle.
And S42, carrying out multi-wheel interaction and user feedback, wherein if the user is found to send out an instruction conflicting with the road traffic regulations in the running process of the vehicle, the slow system can prompt refusal or suggestion modification in time through a text or interface even if the user is found, so that dangerous behavior is avoided, and the user can provide a new instruction for the vehicle in the driving process.
The whole flow of the invention is carried out in a cyclic iteration mode in actual operation:
1) S1, continuously monitoring user instructions and environmental changes by a slow system in the step S1;
2) S2 and S3, the express system makes a decision according to the updated observation information and the rewarding function;
3) S4, dynamically regulating and controlling system strategies and rewarding distribution at the man-machine interaction level;
4) This process is repeated until the vehicle trip is completed or the user command is completed.
The simulation environment for training is built by adopting Highway-Env and Gymnasium, wherein the position and the orientation of the vehicle are controlled by a closed-loop PID:
Wherein the method comprises the steps of Is the relative lateral distance of the vehicle with respect to the corresponding target lane centerline,Is a lateral velocity control command that is directed to,Is a control instruction for controlling the rotation angle of the vehicle.
Wherein, the Is the course of the lane,Is the target heading for the heading and location of the desired lane,Is a lateral control rate command that is directed to,Is the rotation angle control quantity of the front wheel,AndThe control gains for position and heading angle, respectively.
Wherein the motion control of the vehicle is realized according to the formula, whereinIs the location of the vehicle and,Is the forward speed of the vehicle and,Is an acceleration command of the vehicle,Is the slip angle at the center of gravity. The environmental vehicle longitudinally adopts an IDM algorithm, and the transverse control adopts a MOBIL lane change strategy.
And selecting a plurality of scenes, including a high-speed following scene, a ramp changing scene and a road borrowing overtaking scene, testing, selecting an SOTA algorithm (Dilu) based on LLM, a DQN (digital video camera) based on a DRL algorithm of a value and a PPO (digital video camera) based on a DRL algorithm of a strategy, and comparing performance effects, wherein the result is shown in figure 6, and the model achieves the highest success rate under the various scenes.
Performing a set of statistical results analysis, the data results are shown in the following table, and the model is found to be capable of achieving the best balance between safety, efficiency and compliance with human guidance in a plurality of scenarios:
the above description is only illustrative of the preferred embodiments of the application and is not intended to limit the scope of the application in any way. Any alterations or modifications of the application, which are obvious to those skilled in the art based on the teachings disclosed above, are intended to be equally effective embodiments, and are intended to be within the scope of the appended claims.

Claims (5)

1. The man-machine co-fusion automatic driving decision method based on the speed system is characterized by comprising the following steps of:
s1, data acquisition and environment sensing are carried out, and real-time vehicle and environment state information is acquired;
s2, analyzing a slow system and generating a high-level instruction;
S3, real-time decision and control execution of a fast system;
S4, man-machine co-fusion and dynamic feedback;
the step S2 is specifically performed by,
The large language model LLM is used as a slow system, and the information input into the LLM comprises the identity positioning information, the real-time vehicle and environment state information and the human instruction information of the LLM;
the LLM identity positioning information is used for enabling the LLM to confirm the identity positioning of the LLM;
the human instruction information is input by a user into abstract or concrete driving instructions through voice or text;
the real-time vehicle and environment state information is obtained by the step S1, and is required to be converted into a standard expression conforming to natural language rules;
The information is input into a large language model, and human intentions and external environment constraints are comprehensively analyzed by utilizing natural language understanding and reasoning capabilities of the information;
To achieve effective guidance of the fast system DRL, the slow system needs to output explicit driving strategy elements, including target lanes And a travel modeHigh-level instruction output by slow systemThe following are provided:
Wherein, the
Target laneAccording to semantic analysis and road information, designating a lane which should be selected or kept by a vehicle preferentially;
running mode Including "quick", "comfortable" and "energy saving";
The slow system periodically or when detecting user instructions or environmental changes recalculates and updates the high-level policy information to ensure that the user requirements can be continuously met in a dynamic scene;
In step S3, the fast system is constructed based on the deep reinforcement learning DRL method, specifically as follows:
high-level instruction to output slow system With vehicle/environment statusCombined to form an extended observation space of a fast system:
Including when the vehicle is at timeIs provided with a target lane and a driving mode instruction element which are given by a slow system at the current moment;
deep reinforcement learning network pair Making real-time decisions to obtain bottom control actions:
Wherein the method comprises the steps ofIndicating the steering angle of the steering wheel,Indicating the speed of addition (reduction) or the throttle brake control amount;
training of the deep reinforcement learning network adopts a strategy gradient class algorithm to define a strategy And optimizing the following desired return function:
Wherein, the A parameter representing the policy network,Is shown in the stateTake action downwardsThe awards obtained; State access distribution for policy induction, for a given agent policy When the agent is distributed from the initial stateStarting from, according to the discount factorRunning endless steps, its discounted dominant state distribution is defined as:
representing policy-induced state distribution Randomly extracting a state;Representing policy-based distribution in stateRandom extraction action;Indicating the desire under the above conditions;
wherein, the rewards are designed as follows:
Wherein, the
In connection with safe driving, if a positive reward is given when a safe vehicle distance is kept, no collision or violation occurs, punishment occurs when a danger occurs;
if the current actual lane or speed of the vehicle has larger difference with the target requirement, giving negative rewards;
Related to factors such as running efficiency, comfort level and the like, such as reducing unnecessary lane changes, avoiding frequent acceleration and deceleration and the like; as a weighting coefficient, blending is carried out according to different scenes or requirements;
through training, the fast system network can gradually learn how to perform optimal driving decision on the premise of ensuring safety and obeying instructions, and outputs control signals at high frequency during actual operation;
In addition, the underlying actions that will be output by the fast system network during execution And if the system detects the potential danger or the command output violating the traffic rules, the system can introduce a safety filtering module to cut or alarm the actions so as to ensure the running safety of the vehicle.
2. The method for automatically driving decision-making based on the human-machine co-fusion of a speed system according to claim 1, wherein step S1 is specifically,
Deploying a multi-mode sensor on the vehicle, acquiring key information ① of the vehicle self state including the position through real-time sensing of the external environment and the vehicle self stateSpeed and velocity ofAnd acceleration② Surrounding environment states including adjacent vehicle positions and speeds, lane line information, traffic signals and obstacle positions;
the identified information is then fused and synchronized to the vehicle coordinate system or global coordinate system to obtain the following elements of vehicle position Vehicle speedRelative position of surrounding vehicles or obstaclesAnd relative velocityAt the moment ofOrganizing the preprocessed information into state vectorsThe vector contains the following elements:
Wherein the method comprises the steps of Is the number of other vehicles within the vehicle sensing range.
3. The automatic driving decision method based on the man-machine co-fusion of the speed system according to claim 1 is characterized in that the human instruction information is abstracted or specific driving instructions are input by a user through voice or texts, wherein the voice is transcribed into texts by using a voice recognition module if the voice is input, and the human instruction information is directly obtained through a vehicle-mounted man-machine interaction interface if the voice is input.
4. The automatic driving decision method based on the man-machine co-fusion of the speed system according to claim 1 is characterized in that observation information obtained by the vehicle and high-level instructions together form an observation space matrix, each row of the matrix represents information of one vehicle and comprises corresponding position, speed, acceleration and high-level instruction information of the vehicle, in addition, in order to enable the DRL intelligent agent to clearly determine information of the vehicle, the first row of the matrix is self-vehicle information, the other rows are arranged according to Euclidean distance between an environmental vehicle and the automatic driving vehicle, the last two rows of the matrix are high-level instruction information which is respectively a longitudinal coordinate value of a center line of an expected lane and corresponding numbers of different driving styles, and for the surrounding environmental vehicle, no corresponding high-level instruction exists, and the longitudinal coordinate value of the vehicle in the current state and the default driving style are directly adopted.
5. The method for automatically driving decision-making based on the human-machine co-fusion of the speed system according to claim 1, wherein in step S4, the human-machine co-fusion and the dynamic feedback comprise:
Dynamic analysis of slow system, which is to re-analyze the semantics and plan the higher layer when the external environment or user command changes, and to use the current state of the vehicle And new user requirements, generating a new round of high-level instructions;
Immediate response of the fast System is acquiring a new oneAfter that, it is incorporated into the extended observation spaceQuickly adjusting the bottom layer action through a DRL network;
the multi-round interaction and user experience are improved, the system can interact with the user for multiple times in the whole process, if the user is not satisfied with the selected scheme, the user can input the adjustment instruction again, the slow system re-plans the route or the speed, and the fast system immediately executes the new high-level instruction.
CN202510536060.3A 2025-04-27 2025-04-27 A human-machine collaborative autonomous driving decision-making method based on fast and slow systems Active CN120066281B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510536060.3A CN120066281B (en) 2025-04-27 2025-04-27 A human-machine collaborative autonomous driving decision-making method based on fast and slow systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510536060.3A CN120066281B (en) 2025-04-27 2025-04-27 A human-machine collaborative autonomous driving decision-making method based on fast and slow systems

Publications (2)

Publication Number Publication Date
CN120066281A CN120066281A (en) 2025-05-30
CN120066281B true CN120066281B (en) 2025-07-15

Family

ID=95802666

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510536060.3A Active CN120066281B (en) 2025-04-27 2025-04-27 A human-machine collaborative autonomous driving decision-making method based on fast and slow systems

Country Status (1)

Country Link
CN (1) CN120066281B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115564029A (en) * 2022-11-14 2023-01-03 吉林大学 A High Consistency Human-Machine Hybrid Decision-Making Method Based on Hybrid Augmented Intelligence
CN116353585A (en) * 2023-04-06 2023-06-30 余姚市机器人研究中心 Automatic driving automobile external human-computer interaction system and method based on Lu Yun cooperation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117719535A (en) * 2023-11-23 2024-03-19 同济大学 An interactive adaptive decision-making control method for autonomous vehicles using human feedback

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115564029A (en) * 2022-11-14 2023-01-03 吉林大学 A High Consistency Human-Machine Hybrid Decision-Making Method Based on Hybrid Augmented Intelligence
CN116353585A (en) * 2023-04-06 2023-06-30 余姚市机器人研究中心 Automatic driving automobile external human-computer interaction system and method based on Lu Yun cooperation

Also Published As

Publication number Publication date
CN120066281A (en) 2025-05-30

Similar Documents

Publication Publication Date Title
KR102166811B1 (en) Method and Apparatus for Controlling of Autonomous Vehicle using Deep Reinforcement Learning and Driver Assistance System
Li et al. Combined trajectory planning and tracking for autonomous vehicle considering driving styles
CN109598934B (en) Rule and learning model-based method for enabling unmanned vehicle to drive away from high speed
CN111289978A (en) Method and system for making decision on unmanned driving behavior of vehicle
CN114013443A (en) A lane-changing decision control method for autonomous vehicles based on hierarchical reinforcement learning
CN114707364B (en) Ramp vehicle convergence simulation method, device, equipment and readable storage medium
CN114987538B (en) A cooperative lane-changing method considering multi-objective optimization in a connected autonomous driving environment
CN118212808B (en) Method, system and equipment for planning traffic decision of signalless intersection
CN116935671A (en) Automatic road intersection management method based on projection type constraint strategy optimization
CN117429431A (en) Channel switching decision and time delay compensation control method and device based on prediction information
Vasquez et al. Multi-objective autonomous braking system using naturalistic dataset
CN114802306A (en) Intelligent vehicle integrated decision-making system based on man-machine co-driving concept
Guo et al. Toward human-like behavior generation in urban environment based on Markov decision process with hybrid potential maps
CN117312841A (en) Method for formulating training data of automatic driving vehicle, electronic equipment and medium
CN114360290A (en) Method for selecting vehicle group lanes in front of intersection based on reinforcement learning
Islam et al. Enhancing longitudinal velocity control with attention mechanism-based deep deterministic policy gradient (ddpg) for safety and comfort
CN117962926A (en) Autonomous driving decision system based on deep reinforcement learning
CN119568155A (en) Automatic driving vehicle expressway intelligent lane changing method based on reinforcement learning
CN118451430A (en) Method and processor unit for optimizing the consumption of fully automatic or semi-automatic driving maneuvers of a motor vehicle, and correspondingly equipped motor vehicle and system
CN112835362B (en) Automatic lane change planning method and device, electronic equipment and storage medium
Yuan et al. From naturalistic traffic data to learning-based driving policy: A sim-to-real study
Zhang et al. A game theoretic four-stage model predictive controller for highway driving
CN120066281B (en) A human-machine collaborative autonomous driving decision-making method based on fast and slow systems
Yu et al. Game-Theoretic Model Predictive Control for Safety-Assured Autonomous Vehicle Overtaking in Mixed-Autonomy Environment
CN118004217A (en) Autonomous driving behavior decision system considering social preferences of surrounding vehicles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant