CN115545188A

CN115545188A - Multitask offline data sharing method and system based on uncertainty estimation

Info

Publication number: CN115545188A
Application number: CN202211307085.9A
Authority: CN
Inventors: 李学龙; 白辰甲; 王震
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2022-12-30
Anticipated expiration: 2042-10-24
Also published as: CN115545188B

Abstract

The invention relates to the technical field of reinforcement learning in general, and provides a multitask offline data sharing method and system based on uncertainty estimation. The method comprises the following steps: providing a multitask offline data set, wherein the multitask offline data set comprises a plurality of tasks; performing data sharing by using the multitask offline data set to generate a mixed data set; performing offline policy learning according to the hybrid dataset, wherein the offline policy learning comprises: training a plurality of value function networks according to the mixed data set and generating a plurality of prediction results; performing an uncertainty calculation using standard deviations of the plurality of predicted outcomes; and performing policy learning based on a result of the uncertainty calculation. The invention greatly improves the efficiency of data sharing, creatively uses the approximate Bayes posterior to measure the uncertainty of the data, accords with the application scene of off-line reinforcement learning, and can be used for large-scale robot tasks.

Description

Multitask offline data sharing method and system based on uncertainty estimation

Technical Field

The present invention relates generally to the field of reinforcement learning. Specifically, the invention relates to a multitask offline data sharing method and system based on uncertainty estimation.

Background

In offline reinforcement learning, an agent learns a policy from a fixed set of data. Consider a multi-task offline learning scenario, with each task corresponding to an independent offline dataset. Since the experience included with each offline data set is limited, the offline samples can only cover a limited state space and action space. In learning, a strategy obtained by optimization of a reinforcement learning algorithm based on a value function is different from a strategy space covered by an offline sample, so that the estimation of the algorithm on a target value function is inaccurate and even diverged.

The direct idea for solving the problem is to expand the data volume of the task by using other task data in the multi-task data set, thereby improving the coverage degree of the state action space. However, the difference between the expression strategies used when sampling data is also large, due to the different learning objectives of different tasks. Therefore, direct data augmentation can introduce severe distribution bias problems in offline reinforcement learning. In order to solve the distribution shift problem, existing data sharing methods include a data sharing method based on human priori knowledge, a data sharing method based on policy distance measurement, and an offline data sharing method based on a pessimistic value function.

The data sharing method based on the prior knowledge of the human beings has certain knowledge about the relevance degree of different tasks. For example, multitasking in a four-footed robot may include standing, walking, jumping, running, etc., so in learning a running task, sharing data for the walking task may be considered; in learning the jump task, it may be considered to share the data of the standing task, since the agent needs to stand first and then jump. The human can reduce the problem of distribution deviation by weighting the correlation degrees of different tasks according to the subjective understanding of the correlation between the tasks and then giving larger weight to the tasks with higher correlation degrees in the learning.

The data sharing method based on the strategy distance measurement means that in order to measure the relation between the shared data set and the original task data, the distance between the two representing strategies can be calculated according to the given data set. For example, for each data set, a supervised learning manner is used to fit the mapping relationship of the state to the action, and a performance strategy is obtained. In sharing, the distance (such as KL distance, maximum mean distance, etc.) between the performance strategies between tasks is calculated, and for the two strategies with smaller distance, the tasks can be considered to be related, and data can be shared. And the two strategies at a longer distance do not share data, so that the problem of distribution deviation is reduced.

The off-line data sharing method based on the pessimistic value function is that in a shared data set, the pessimistic value function is trained by using an off-line reinforcement learning method, so that the value of each state action in the task is measured. If the value function for a state action is low, there may be two reasons. First, the state action cannot bring a high value reward signal, resulting in a smaller value function estimate; second, the state action is less relevant to the task, regardless of the optimal strategy for solving the task. The second reason is often caused by data sharing between unrelated tasks. According to the pessimistic value function estimation, the data values in the shared data can be ranked, and the data with high value is selected for training in the training.

However, the existing offline data sharing method still has the following problems:

1. data selection needs to be performed according to certain rules. The prior art selects data based on human knowledge, policy distance, value functions. The human knowledge lacks an automatic process, has high subjectivity, and is difficult to perform when the number of tasks is large. Both policy distance and value function estimation require additional computation modules. Strategy distance requires the use of deep neural networks to fit the distribution of the representation strategy in the dataset, which is often a hybrid strategy. The value function estimation needs to be learned according to a strategy evaluation method and a Bellman operator. After the estimation, the value functions of the offline shared samples need to be sorted, and a part of samples with higher value functions need to be selected for policy learning, which requires additional calculation.

2. Data selection and multitask off-line learning are split. After data selection, the agent will obtain an offline multitask data set, from which the agent will subsequently learn using an offline reinforcement learning method. However, there is a break between data selection and multitask offline learning, both using different algorithms and different learning mechanisms. The finally used offline learning method cannot influence the data selection process, and if the mechanism of data selection is weak, the subsequent offline learning is greatly influenced.

Disclosure of Invention

To at least partially solve the above problems in the prior art, the present invention provides a multitask data sharing method for offline reinforcement learning, comprising the following steps:

providing a multitask offline data set, the multitask offline data set comprising a plurality of tasks;

performing data sharing by using the multitask offline data set to generate a mixed data set;

performing offline policy learning according to the hybrid dataset, wherein the offline policy learning comprises:

training a plurality of value function networks according to the mixed data set and generating a plurality of prediction results;

performing an uncertainty calculation using standard deviations of the plurality of predicted outcomes; and

performing policy learning based on a result of the uncertainty calculation.

In one embodiment of the invention, it is provided that the data sharing using the multitask offline data set to generate a mixed data set comprises the following steps:

selecting a main task and a shared task among the plurality of tasks, wherein data is shared from the shared task while learning the main task;

performing reward relabeling on the data in the shared task, wherein the reward of the sample in the shared task is recalculated according to the reward function of the main task; and

the shared task is blended with the primary task to generate a blended dataset.

In one embodiment of the invention, it is provided that the plurality of value function networks comprise identical network structures and different initialization parameters, wherein the plurality of value function networks are trained using stochastic gradient methods to evaluate a bayesian posterior distribution of the value function.

In one embodiment of the invention, it is provided that the value function is learned by a speech-critic model and is iterated by a bellman operator, wherein the method comprises the following steps:

representing experience stored in the mixed dataset as a set of state transition tuples (s, a, r, s '), where s represents state, a represents action, r represents reward, and s' represents state at the next time;

the learning objective y according to the bellman operator setting value function Q (s, a) is expressed as follows:

y＝r+γmax _a ，Q(s′，a′)，

wherein r represents a single-step environment reward, gamma represents a discount factor of the reward changing along with time, and a' represents greedy action at the next moment;

the bellman loss L is expressed as: l = (Q (s, a) -y) ² (ii) a And

training of the value function is performed by minimizing the Bellman penalty L.

In one embodiment of the present invention, it is provided that the uncertainty Γ (s, a) of the state action (s, a) is calculated using the standard deviation of the plurality of predictions, and is expressed as:

Γ(s，a)＝Std(Q _i (s，a))，

where i ∈ [1, K ], K represents the number of value function networks.

In one embodiment of the invention it is provided that the learning of the strategy based on the result of said uncertainty calculation comprises:

the learning target y is reset using the result of the uncertainty calculation as a penalty in the value function learning, expressed as the following equation:

y＝r+γmax _a ' Q (s ', a ') - Γ (s ', a '); and

performing policy learning based on the penalized learning objective byOptimization of min Q _i To perform policy output, i ∈ [1, K ∈ ]]。

The invention also provides a multitask off-line data sharing system based on uncertainty estimation, which comprises the following steps:

a data sharing module configured to perform the following acts:

providing a multitask offline data set, the multitask offline data set comprising a plurality of tasks; and

a policy learning module configured to perform offline policy learning from the hybrid dataset.

In one embodiment of the invention, it is provided that the policy learning module comprises:

a value function learning module configured to train a plurality of value function networks from the mixed dataset and generate a plurality of predicted results;

an uncertainty metric module configured to perform an uncertainty calculation using standard deviations of the plurality of predictors; and

a policy learning module configured to perform policy learning based on a result of the uncertainty calculation.

The invention also proposes a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps according to the method.

The present invention also provides a computer system, comprising:

a processor configured to execute machine executable instructions; and

a memory having stored thereon machine executable instructions which, when executed by the processor, perform steps according to the method.

The invention has at least the following beneficial effects: the invention provides a multitask off-line data sharing method and a multitask off-line data sharing system based on uncertainty estimation, wherein an extra action selection module is removed when multitask data are shared, and uncertainty estimation and strategy learning can be directly carried out on the multitask shared data; the original single value network is expanded through the integrated value function network, the standard deviation of the integrated network in the value function estimation is used as the prediction of uncertainty, the posterior distribution of the value function of the integrated network approximate estimation is utilized, and the theoretical guarantee is provided in the uncertainty measurement; in addition, the uncertainty in the updating process of the invention is used as the penalty of value function learning, so that the sample with larger uncertainty has larger penalty in the value function learning process, and the stable off-line strategy can be learned on the basis.

Compared with a data selection method based on the size of the pessimistic value function, the method does not need to perform additional data selection, and directly utilizes all shared data to learn, so that the data sharing efficiency is greatly improved. In addition, the invention uses the approximate Bayes posterior to measure the uncertainty of the data, uses the uncertainty to measure the value of the data, better accords with the application scene of off-line reinforcement learning, and has theoretical guarantee. In addition, the specific implementation of the integrated model can be applied to high-dimensional states and action spaces, and can be used for large-scale robot tasks.

Drawings

To further clarify the advantages and features that may be present in various embodiments of the present invention, a more particular description of various embodiments of the invention will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. In the drawings, the same or corresponding parts will be denoted by the same or similar reference numerals for clarity.

FIG. 1 illustrates a computer system implementing systems and/or methods in accordance with the present invention.

FIG. 2 is a flow chart illustrating a multitasking offline data sharing method based on uncertainty estimation according to an embodiment of the present invention.

FIG. 3 is a block diagram of a multitask offline data sharing method based on uncertainty estimation according to an embodiment of the present invention.

FIG. 4 shows a diagram of an integrated model based uncertainty metric in one embodiment of the present invention.

FIG. 5 is a graph comparing the effect of the present method and the prior art on a physics simulator in one embodiment of the present invention.

Detailed Description

It should be noted that the components in the figures may be exaggerated and not necessarily to scale for illustrative purposes. In the figures, identical or functionally identical components are provided with the same reference symbols.

In the present invention, "disposed on" \ 8230 "", "disposed over" \823030 "", and "disposed over" \8230 "", do not exclude the presence of an intermediate therebetween, unless otherwise specified. Furthermore, "arranged above or 8230that" on "merely indicates the relative positional relationship between the two components, but in certain cases, for example after reversing the product direction, can also be switched to" arranged below or below "8230, and vice versa.

In the present invention, the embodiments are only intended to illustrate the aspects of the present invention, and should not be construed as limiting.

In the present invention, the terms "a" and "an" do not exclude the presence of a plurality of elements, unless otherwise specified.

It is further noted herein that in embodiments of the present invention, only a portion of the components or assemblies may be shown for clarity and simplicity, but those of ordinary skill in the art will appreciate that, given the teachings of the present invention, required components or assemblies may be added as needed for a particular situation. Furthermore, features from different embodiments of the invention may be combined with each other, unless otherwise indicated. For example, a feature of the second embodiment may be substituted for a corresponding or functionally equivalent or similar feature of the first embodiment, and the resulting embodiments are likewise within the scope of the disclosure or recitation of the present application.

It is also noted herein that, within the scope of the present invention, the terms "same", "equal", and the like do not mean that the two values are absolutely equal, but allow some reasonable error, that is, the terms also encompass "substantially the same", "substantially equal". By analogy, in the present invention, the terms "perpendicular", "parallel" and the like in the directions of the tables also cover the meanings of "substantially perpendicular", "substantially parallel".

In the present invention, the module may be implemented in software, hardware, firmware, or a combination thereof.

The numbering of the steps of the methods of the present invention does not limit the order of execution of the steps of the methods. Unless specifically stated, the method steps may be performed in a different order.

The invention is further elucidated with reference to the drawings in conjunction with the detailed description.

FIG. 1 illustrates a computer system 100 implementing systems and/or methods in accordance with the present invention. Unless specifically stated otherwise, a method and/or system in accordance with the present invention may be implemented in the computer system 100 shown in FIG. 1 for purposes of the present invention, or the present invention may be implemented in a distributed fashion across a network, such as a local area network or the Internet, among multiple computer systems 100 in accordance with the present invention. Computer system 100 of the present invention may comprise various types of computer systems, such as hand-held devices, laptop computers, personal Digital Assistants (PDAs), multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, network servers, tablet computers, and the like.

As shown in FIG. 1, computer system 100 includes a processor 111, a system bus 101, a system memory 102, a video adapter 105, an audio adapter 107, a hard drive interface 109, an optical drive interface 113, a network interface 114, and a Universal Serial Bus (USB) interface 112. The system bus 101 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system bus 101 is used for communication between the respective bus devices. In addition to the bus devices or interfaces shown in fig. 1, other bus devices or interfaces are also contemplated. The system memory 102 includes a Read Only Memory (ROM) 103 and a Random Access Memory (RAM) 104, where the ROM 103 may store, for example, basic input/output system (BIOS) data of basic routines for implementing information transfer at start-up, and the RAM 104 is used to provide a fast-access operating memory for the system. The computer system 100 further includes a hard disk drive 109 for reading from and writing to a hard disk 110, an optical drive interface 113 for reading from or writing to optical media such as a CD-ROM, and the like. Hard disk 110 may store, for example, an operating system and application programs. The drives and their associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for the computer system 100. Computer system 100 may also include a video adapter 105 for image processing and/or image output for connecting an output device such as a display 106. Computer system 100 may also include an audio adapter 107 for audio processing and/or audio output, for connecting output devices such as speakers 108. In addition, computer system 100 may also include a network interface 114 for network connections, where network interface 114 may connect to the Internet 116 through a network device, such as a router 115, where the connection may be wired or wireless. In addition, computer system 100 may also include a universal serial bus interface (USB) 112 for connecting peripheral devices, including, for example, a keyboard 117, a mouse 118, and other peripheral devices, such as a microphone, a camera, and the like.

When the present invention is implemented on the computer system 100 described in fig. 1, the extra action selection module can be removed when sharing multitask data, and uncertainty estimation and policy learning can be directly performed in the multitask shared data; the original single value network is expanded through the integrated value function network, the standard deviation of the integrated network in the value function estimation is used as uncertainty prediction, and the posterior distribution of the value function of the integrated network approximate estimation is utilized, so that the theoretical guarantee is provided in the uncertainty measurement; in addition, the uncertainty in the updating process of the invention is used as the penalty of value function learning, so that the sample with larger uncertainty has larger penalty in the value function learning process, and the stable off-line strategy can be learned on the basis.

Furthermore, embodiments may be provided as a computer program product that may include one or more machine-readable media having stored thereon machine-executable instructions that, when executed by one or more machines such as a computer, network of computers, or other electronic devices, may result in the one or more machines performing operations in accordance with embodiments of the present invention. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disc read-only memories), and magneto-optical disks, ROMs (read-only memories), RAMs (random access memories), EPROMs (erasable programmable read-only memories), EEPROMs (electrically erasable programmable read-only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions.

Moreover, embodiments may be downloaded as a computer program product, wherein the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of one or more data signals embodied in and/or modulated by a carrier wave or other propagation medium via a communication link (e.g., a modem and/or network connection). Thus, a machine-readable medium as used herein may include, but is not necessarily required to be, such a carrier wave.

FIG. 2 is a flow chart illustrating a multitasking offline data sharing method based on uncertainty estimation according to an embodiment of the present invention. As shown in fig. 2, the method may include the steps of:

step 201, providing a multitask offline data set, wherein the multitask offline data set comprises a plurality of tasks.

And 202, sharing data by utilizing the multitask offline data set to generate a mixed data set.

And 203, performing offline strategy learning according to the mixed data set.

The system for operating the multitask off-line data sharing method based on the uncertainty estimation can comprise a data sharing module and a strategy learning module.

Wherein the data sharing module can perform the following actions:

providing a multitask offline data set, wherein the multitask offline data set comprises a plurality of tasks; and

the strategy learning module can perform offline strategy learning according to the mixed data set, wherein the strategy learning module comprises:

a value function learning module that trains a plurality of value function networks from the mixed dataset and generates a plurality of prediction results;

an uncertainty metric module that performs an uncertainty calculation using standard deviations of the plurality of predictors; and

a policy learning module that performs policy learning based on a result of the uncertainty calculation.

The method and system of the present invention will be described in detail with reference to the following examples.

FIG. 3 is a block diagram of a multitask offline data sharing method based on uncertainty estimation according to an embodiment of the present invention. As shown in FIG. 3, the method may include both data sharing and policy learning aspects.

In the data sharing stage, the complex data selection module is removed, and data of other tasks can be directly shared. Taking fig. 1 as an example, assume that the main task is A1, and it is desired to share data from the task Ai when learning the A1 task. At this point, all data for task Ai is rewarded and then combined with the A1 task into a mixed offline data set for policy learning. The process of rewarding is to recalculate the rewards of the samples in task Ai using the reward function of task A1, so that the shared whole data set has a uniform reward function. Since the uncertainty metric is used in the policy learning stage, even if the correlation between the task Ai and the task A1 is low, directly sharing data does not cause the problem of distribution shift.

The contents of the three modules are included in the uncertainty-based multitask strategy learning phase.

In the value function learning module, the value function is performed using an actor-critic (operator-critic) algorithm in reinforcement learning. Wherein, commenting house modelAnd (4) learning a value function, and iterating through a Bellman operator. The experience stored in the offline dataset is a set of state transition tuples (s, a, r, s '), each tuple comprising a state s, an action a, a reward r and a next time state s'. The Bellman operator sets the learning target of the value function Q (s, a) to y = r + γ max _a 'Q (s', a '), where a' is the greedy action at the next time instant. Bellman loss is defined as L = (Q (s, a) -y) ² . The training of the single value function can be performed by minimizing the loss function. In the patent, the original value function network is expanded into an integrated network, and the integrated network comprises about 5 independent networks. Wherein, each network has the same network structure but different initialization parameters, and iterative training is respectively carried out. Different value function networks can generate different gradients when being trained by using a random gradient method, so that the optimization directions of the networks are slightly different, and further, the parameters of the networks are different. It can be shown in theory that the integrated value function model can approximate a bayesian posterior of the value function.

In the uncertainty measurement module, on the basis of learning the integrated value function, the standard deviation predicted by the K value function is used as the uncertainty estimate Γ (s, a) of the state-action pair (s, a). Formalized, Γ (s, a) = Std (Q) _i (s, a)), where i ∈ [1, K ]]. In an offline dataset, the uncertainty metric measures the distribution of each state action pair in the dataset. If the uncertainty is high, it indicates that the shared data is far from the data distribution of the original task, and the value function estimation may be inaccurate. If the uncertainty is low, indicating that the shared data is close to the original task, a small uncertainty penalty should be given in learning.

FIG. 4 shows a diagram of an integrated model based uncertainty metric in one embodiment of the present invention. In the single dataset task shown in the left diagram of fig. 4, the first data point 201 is more centrally distributed, with areas near the data point having lower uncertainty and the remaining areas having higher uncertainty. It can be found that the distribution of the samples in the state action space can be accurately measured based on the uncertainty of the integrated model. The right graph shows the uncertainty distribution after sharing the data points for the remaining two tasks, the second 202 and third 203 data points from the offline data for the other tasks. It can be seen that the uncertainty metric in the whole space changes after offline data sharing, and the uncertainty in many areas is reduced by data sharing. Since uncertainty is used as penalty in value function estimation, the penalty of value function is smaller in the region with less uncertainty, so that the strategy can iterate and optimize in a wider region.

In the strategy learning module, strategy learning based on uncertainty can be performed based on the uncertainty measure. First, using uncertainty as a penalty in value function learning, the learning target is reset to y = r + γ max _a Q (s ', a') - Γ (s ', a'). If the uncertainty is larger, the penalty term contained in the value function is larger, so that a conservative strategy estimation is obtained. For shared data that is not related to the main task optimization strategy, there will be greater uncertainty in the metrics of the integrated model, reducing its role in learning. Strategy learning can be carried out on the basis, and the strategy output is optimized through min Q _i To achieve where i e [1, K ]]Means that the minimum value is selected from the integrated value function network for optimization. The strategy is optimized by using the lower bound of the median function of Bayesian posterior estimation, thereby ensuring that a conservative and stable strategy is learned.

On the basis of model design, the effect of multi-task off-line learning is evaluated by using a robot multi-task off-line data set. The uncertainty measure shown in fig. 4 is the situation when the input is a two-dimensional variable, in an actual robot task the state tends to be high-dimensional. For example, the state of a robotic arm typically includes information such as the position and velocity of each joint, typically in 10-20 dimensions. The value function network outputs an estimate of the value function using a combination of the state and the action as inputs. Uncertainty of high-dimensional state action can be measured through the integrated model, so that samples with high uncertainty are punished in learning, and the generation of distribution deviation problem in multi-task data sharing is reduced. The method provided by the patent can be tested in the environment of a mechanical arm, a quadruped robot and the like by using robot simulation environment such as Deepmed Control Suite and the like, and the application value and the actual effect of the method are comprehensively evaluated. In addition, the strategy learned in the simulation environment can be transferred to a real robot scene for implementation.

In the prior art, a data selection method based on the size of a pessimistic value function is used, and a shared sample with a higher value function under a given task is selected for training, so that a high-value sample can be screened out. However, this method requires an additional data selection module, and after value function learning, the value functions of the data need to be sorted and selected, so as to obtain a high-value post-learning strategy of the shared sample. Meanwhile, as the value function is continuously iterated, data needs to be selected again according to the latest value function after the value function is updated, and the calculation cost is high.

Compared with a data selection method based on the size of the pessimistic value function, the method does not need to perform additional data selection, and directly utilizes all shared data to learn, so that the data sharing efficiency is greatly improved. In addition, the invention uses the approximate Bayes posterior to measure the uncertainty of the data, uses the uncertainty to measure the value of the data, better accords with the application scene of the off-line reinforcement learning, and has theoretical guarantee. In addition, the specific implementation of the integrated model can be applied to high-dimensional states and action spaces, and can be used for large-scale robot tasks.

FIG. 5 is a graph comparing the performance of the present method and the prior art on a physics simulator in one embodiment of the present invention. As shown in fig. 5, the method is trained on a real physical simulator DeepMind Control Suite, and includes a plurality of tasks related to robots. The results show that the model achieves better effect than other related technical methods at present.

While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various combinations, modifications, and changes can be made thereto without departing from the spirit and scope of the invention. Thus, the breadth and scope of the present invention disclosed herein should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. A multitask off-line data sharing method based on uncertainty estimation is characterized by comprising the following steps:

performing data sharing by using the multitask offline data set to generate a mixed data set; and

performing offline policy learning according to the hybrid dataset, comprising:

and performing strategy learning based on the result of the uncertainty calculation.

2. The method of claim 1, wherein the step of sharing data with the multitask offline data set to generate a mixed data set comprises the steps of:

the shared task is blended with the primary task to generate a blended dataset.

3. The uncertainty estimation based multitask offline data sharing method according to claim 2, wherein the multiple value function networks comprise the same network structure and different initialization parameters, wherein the multiple value function networks are trained using stochastic gradient method to estimate Bayesian posterior distribution of the value function.

4. The uncertainty estimation-based multitask offline data sharing method according to claim 3, characterized by learning the value function through speech-critic model and iterating through Bellman operator, comprising the following steps:

y＝r+γmax _a′ Q(s′，a′)，

bellman loss L is expressed as the following formula: l = (Q (s, a) -y) ² (ii) a And

5. The method of claim 4, wherein the uncertainty Γ (s, a) of the state action (s, a) is computed using the standard deviation of the plurality of predictors, as represented by the following equation:

Γ(s，a)＝Std(Q _i (s，a))，

where i ∈ [1, K ], K represent the number of value function networks.

6. The method of claim 5, wherein performing policy learning based on the result of the uncertainty calculation comprises:

the learning objective y is reset as a penalty using the result of the uncertainty calculation in the value function learning, expressed as:

y＝r+γmax _a′ q (s ', a') - Γ (s ', a'); and

performing strategy learning according to the punished learning target, wherein the min Q is optimized _i To perform policy output, i ∈ [1, K ∈ ]]。

7. A multitasking, offline data sharing system based on uncertainty estimation, the system comprising:

a data sharing module configured to perform the following actions:

8. The uncertainty estimation based multitasking, offline data sharing system according to claim 7, wherein said policy learning module comprises:

a value function learning module configured to train a plurality of value function networks from the mixed dataset and generate a plurality of prediction results;

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to one of claims 1 to 6.

10. A computer system, comprising:

a processor configured to execute machine executable instructions; and

memory having stored thereon machine executable instructions which, when executed by the processor, perform the steps of the method according to one of claims 1 to 6.