Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
As shown in FIG. 1, the current commodity recommendation flow mainly comprises the steps of carrying out recall processing on candidate commodities in a candidate commodity library by adopting different recall algorithms (including but not limited to collaborative filtering, vectorization recall, category recall, label recall, new commodity recall, heat recall and the like) to obtain a recall commodity set, then estimating click rate and conversion rate of the recall commodities by a click rate estimation model and a conversion rate estimation model, sorting the recall commodities according to the click rate and the conversion rate to obtain sorted recall commodities, and further screening the sorted recall commodities according to supplemental strategies such as pushing intervention rules, diversity strategies, repeated recommendation strategies and the like to obtain a final recommended commodity list.
Currently, various sorting algorithms are provided in the related art to implement sorting processes of recalled commodities, the first is traditional collaborative filtering, LR (Logistic Regression, logistic regression model) + GBDT (Gradient Boosting Decision Tree ), FM (Factorization Machines, factorizer) and the like, the second is a Deep Learning-based method, such as Wide & Deep, deepFM and the like, and the third is a Deep reinforcement Learning model, such as Policy Gradient, DQN (Deep Q-Learning), and Actor-Critic (action-assessment).
However, when the sorting algorithm faces a new pushing requirement, the new pushing requirement can affect the exposure sorting of the commodity, so that the clicking and conversion of the recommended commodity cannot reach the expected effect, and even the recommendation fairness and the clicking rate are reduced as a whole.
For recommendation scenes considering commodity clicking and conversion values, many recommendation methods can distinguish values through static grouping once or periodically unifying a certain score for commodities in different groups, but the method fails to consider dynamic characteristics of commodity values, the commodity values may change in real time along with recommendation strategies and user feedback (such as exposure, clicking and conversion), for example, when daily exposure and clicking conversion of one commodity meet push requirements, other non-exposed commodities should be recommended, or commodities with higher clicking conversion values can be brought to achieve recommendation fairness.
In order to solve the above problems, the present disclosure improves the ranking layer in the related art, as shown in fig. 2, mainly uses MCTS (Monte Carlo TREE SEARCH ) to explore the recalled commodities, and merges strategies such as commodity value (such as a value estimation model in fig. 2), repeated recommendation penalty (such as a cost estimation model in fig. 2), hot penalty and the like in the exploration process, and comprehensively considers targets such as click rate, conversion rate, user step size and the like to recommend, and further screens the ordered recalled commodities according to diversity strategies to obtain a final recommended commodity list.
According to an embodiment of the present disclosure, the present disclosure provides a data processing method.
FIG. 3 is a flow chart of a data processing method according to an embodiment of the present disclosure, as shown in FIG. 3, including the steps of:
step S302, a recall data set is obtained.
The recall data set in the above step may be a set of recall data determined by a recall algorithm in the data recommendation process, where the recall data may be candidate data successfully matched with the user in the candidate data set, and in different data recommendation scenarios, the recall data set may be a recall commodity set determined by the recall algorithm, for example, for a commodity recommendation scenario as shown in fig. 2.
Step S304, constructing a search tree corresponding to the recall data set, wherein the search tree comprises a root node and a plurality of data nodes positioned at different levels, each data node is used for representing recall data in the recall data set, each data node is used for storing push value and search times of the corresponding recall data, and the push value is used for representing the value of a feedback result received after the corresponding recall data is pushed to a target object.
The search tree in the above step may be a tree constructed by simulating the push process of recall data for a limited time. The search tree may include a root node representing a request to push data to a user and a plurality of data nodes at different levels, each data node representing recall data. In the simulation process, the selection of the nodes can be performed based on the pushing value and the searching times of the nodes, the estimated click rate and the conversion rate are used as the state transition probability when the nodes are expanded, each recall data is pushed to a target object (namely a user) in a simulation mode, and the feedback result fed back by the target object aiming at the pushed recall data can be the recall data pushed by clicking the user, the recall data pushed by consulting, leaving and the like, and can be determined according to a specific data pushing scene.
In order to ensure that the click rate and conversion rate of target data pushed to a target object can obtain an expected effect, a value estimation model can be constructed to estimate rewards reward corresponding to different feedback results, wherein the negative number of the estimated reward can be used as a cost for the action of 'leaving' of a user, the click and conversion value can be estimated for the action of 'clicking' and 'consulting' of the user, and the click and conversion value can be used as the reward. Meanwhile, recommendation fairness is ensured, certain punishment can be given to recall data with more exposure, and a higher value is given to some recall data with potential. It should be noted that the higher value does not directly expose the recall data, but pushes the recall data only in a short time, but if the click rate/conversion rate of the recall data is not improved, the probability of selecting the recall data is reduced, and the push of the recall data is stopped.
In addition, considering the influence of repeated recommendation punishment strategies on sequencing, a cost estimation module can be constructed for recall data repeatedly pushed in a short time, and the cost repeatedly pushed each time is estimated according to target object characteristics, background information and the like. The context information herein may refer to the time interval that the recall data is currently presented, the historical presentation location, the page, the user's preference or acceptance of repeated recommended data, and the like.
In some alternative embodiments, an MCTS-structured MDP (Markov Decision Process ) tree may be employed, with each simulation, the evaluated state values stored on the nodes of the tree, which may be accumulated by loop iterations of the four steps of selecting, expanding, simulating, and back-propagating, where each data node of the tree would store two state values V(s) of the push value and the number of searches. The push value here is determined based on both the rewind and the cost.
The parameters needed to be used in the expression of each node, the state transition process and the actual operation of the MDP tree are as follows:
The state s t can be represented by user preference, user request, system state;
the Action a can be that the system selects one recall data from the candidate recommendation list to recommend to the user;
Transitions P a (s|s') can be the state transition probabilities, and the success states are obtained through user feedback. The probability equation of Transition is generally equivalent to the probability of user behavior, and p is estimated through a probability network (comprising a click rate estimation model and a conversion rate estimation model);
Reward r (s t,a,st+1) can be an index that can be measured for recommendation accuracy & exposure conversion satisfaction after the user takes action. The index is given by a value estimation model, and the model can be used for estimating the recall by combining the exposure, clicking and conversion conditions of recall data and considering recommendation fairness and the value brought by clicking and conversion;
The Cost c (s t,a,st+1) can be a Cost for enabling the user to take a certain action, such as repeated recommendation in a short period, and the Cost can be estimated according to the repeated occurrence time interval, the page type, the position and the acceptance degree of the repeated recommendation by the user;
The discover rate, γ, is used to measure the contribution rate of long-term reward to the current value, and it is generally considered that the greater the speculation depth, the higher the uncertainty of the behavior will be, and the lower the contribution to the current value will be.
Step S306, determining target data in the recall data set based on the push value of each recall data in the recall data set, wherein the target data is data pushed to the target object.
In this embodiment, the push value of each recall data reflects the click and conversion value of the recall data, in order to achieve the purpose of maximizing the overall value, all recall data may be ordered according to the push value, and a plurality of recall data with the highest order may be selected as target data, and further the target data may be pushed to the target user according to a certain push rule, where the push rule may be set according to different application scenarios and push requirements, and this disclosure does not specifically limit this disclosure.
Through the scheme, after the recall data set is obtained, a search tree corresponding to the recall data set can be constructed, and then the target data pushed to the target object is determined based on the push value of each recall data in the search tree, so that the aim of sequencing the recall data is fulfilled. It is easy to note that the target data pushed to the target object is determined based on the push value of each recall data, and the push value is used to characterize the value of the feedback result received after the corresponding recall data is pushed to the target object, so that the push value can embody the click and conversion effect of the recall data, thereby ensuring that the click and conversion of the target data pushed to the target object can achieve the desired effect, and in addition, the push value of the recall data is determined in the process of constructing the search tree, therefore, the push values of different recall data can be adjusted according to actual push requirements, and the ordering result of all recall data is adjusted, so that the push values of different recall data cannot be influenced by a new push intervention rule, the push requirements of target data pushed to a target object are met, and the technical problem that the ordering algorithm in the related technology is easily interfered by different push requirements, and the click and conversion of recommended commodities cannot reach the expected effect is solved.
The method comprises the steps of constructing a search tree corresponding to a recall data set, wherein the search tree comprises the steps of A, determining target nodes to be searched, wherein the target nodes are used for representing recall data in a root node or the recall data set, B, expanding new sub-nodes below the target nodes, determining rewards of the new sub-nodes by using a value estimation model, C, simulating the target sub-nodes in the new sub-nodes, determining push values of last sub-nodes when simulation is finished, D, carrying out reverse iteration on the search tree based on the push values of the last sub-nodes, updating the push values of each layer of sub-nodes below the target nodes and the target nodes, updating search times of the target nodes, and repeatedly executing the steps A to D until the search time of the search tree reaches preset search time or the search depth reaches preset search depth.
The preset exploration time in the above steps may be a simulated construction time of the search tree, and the preset exploration depth may be a simulated exploration depth of the search tree, which may be set according to actual needs, which is not specifically limited in the present disclosure.
In the embodiment of the disclosure, the search tree needs to be explored from the root node, after the root node is explored, other data nodes can be explored continuously, therefore, a selection step (i.e. the step A) is firstly executed to determine the target node which needs to be explored at this time, then an expansion step (i.e. the step B) is executed to expand the target node through actions and expand all feedback results generated by the actions, wherein a new sub-node can be created for the new feedback results, and meanwhile, a value evaluation model can be utilized to evaluate the feedback results to obtain rewards of the sub-node, a simulation step (i.e. the step C) is further executed to randomly select one sub-node for simulation until the simulation is finished (the simulation end condition can be set as required, the set simulation time is reached, or the set simulation depth is reached), at this time, the push value of the simulation termination state (including reward, cost and the like) can be given, and finally, a back propagation step (i.e. the step D) can be executed to update the push value of the last sub-node, and the push value of the target node is updated in a back propagation mode.
In some alternative embodiments, for the MDP tree shown in fig. 4, in the first loop iteration process, a root node may be selected as a target node, then an action (assuming that a push Item 1 is selected) is selected, as shown by a solid circle in fig. 4, then the corresponding child nodes of feedback results (including clicking, leaving and converting) after pushing Item 1 may be expanded according to a state transition probability, the corresponding child nodes of each child node are respectively corresponding to three ellipses shown in an upper left part in fig. 4, and using a value evaluation model to evaluate rewards corresponding to each child node, assuming that the rewards corresponding to the three child nodes are 3, 0 and 100 respectively, randomly selecting one child node for simulation, assuming that the corresponding child node is clicked, as shown by a solid ellipse in fig. 4, selecting a non-selected child node as shown by a solid ellipse in fig. 4, selecting a result after the node according to a state transition probability (assuming that a push Item k is selected), then expanding the child nodes after pushing Item k according to a state transition probability, including clicking, leaving and converting, and evaluating the corresponding rewards of each child node by using a value evaluation model, and finally searching for a new child node according to a predicted value, and the estimated value of the last child node is the last layer, and the final value can be estimated, and the value can be estimated to be the final value is estimated, and the value can be the value is estimated by searching the last child node after the new child node, and the new child node is the new by the new child node, and the new child node has a new value.
It should be noted that, the push value of the child node of the previous layer may be obtained by obtaining the maximum value of the total value (i.e. the sum of the push value and the formaldehyde rewarded) of the child node of the next layer, but is not limited thereto.
Through the scheme, the steps of selecting, expanding, simulating, back-propagating and the like are circularly executed, so that the push value of each recall data can be continuously updated in the loop iteration, the push value is ensured to meet the push requirement, the click and conversion value of the recall data can be truly reflected, the accuracy of the push value is improved, and the effect of improving the data push accuracy is achieved.
Optionally, determining the target node to be searched comprises traversing from the root node, determining whether an unexpanded node exists, determining the unexpanded node as the target node in response to the unexpanded node exists, and determining the target node based on the push value and the searching times of each data node in response to the unexpanded node does not exist.
The unexpanded node in the above embodiment may mean that the node has an action that is not simulated, or that the node has a child node that is not simulated.
In the embodiment of the disclosure, the construction process of the search tree is to start expanding the next node after one node expansion is finished, so after each iteration is started, whether an unexpanded node exists is firstly determined, if yes, the node can be directly used as a target node, and subsequent expansion, simulation and back propagation steps are executed, if not, the selection of the target node can be realized through UCT (Upper Confidence Bound Apply to Tree, namely an upper limit confidence interval algorithm), and the calculation formula of the target node is as follows:
Where Q (s, a) is the push value of state s, N(s) is the number of searches, N (s, a) is the number of times action a is performed in state s, where Q (s, a) is the encouragement "utilization", Is encouragement "exploration".
In some alternative embodiments, for the MDP tree shown in fig. 4, in the first loop iteration process, the root node may be directly determined to be the target node, where the pushing value and the searching frequency of the root node are both 0, in the second loop iteration process, the root node may still be determined to be the target node because the root node is not expanded, where the pushing value and the searching frequency of the root node are no longer 0 and updated in the last loop iteration, and in the third loop iteration process, the root node may be continuously traversed after being expanded, and the node corresponding to Item 1 is determined to be the target node.
According to the scheme, the judging result is obtained by determining whether the unexpanded node exists or not, and the target object is determined in different modes aiming at different judging results, so that the effect of improving the determining efficiency and accuracy of the target node is achieved.
Optionally, expanding the new child node below the target node includes obtaining a state transition probability corresponding to the target node, determining a target execution operation corresponding to the target node based on the state transition probability, and creating the new child node below the target node based on the target execution operation.
The state transition probability in the above steps may be a probability determined based on the click rate estimated by the click rate estimation model and the conversion rate estimated by the conversion rate estimation model, and is used for judging an operation that the user may perform on the recall data, and the target execution operation may be an execution operation corresponding to the maximum probability in the state transition probabilities.
In some alternative embodiments, for the MDP tree shown in fig. 4, after determining the target node in the first loop iteration process, an action that is not expanded, i.e. pushing Item 1, then determining, according to the state transition probability, target execution operations that may be executed by the user, which are clicking, leaving and consulting, respectively, and creating three sub-nodes on the next level of the target node, where the pushing value and the searching number of the three sub-nodes are both 0, and in the second loop iteration process, after determining the target node, an action that is not expanded, i.e. pushing Item2, may be selected, then determining, according to the state transition probability, target execution operations that may be executed by the user, which are clicking, leaving and consulting, respectively, and creating three sub-nodes on the next level of the target node, where the pushing value and the searching number of the three sub-nodes are both 0.
Through the scheme, the aim of expanding the child nodes is achieved through the state transition probability, the expanded child nodes are ensured to truly reflect the click and conversion value of recall data, the accuracy of pushing value is improved, and the effect of improving the accuracy of data pushing is achieved.
Optionally, simulating the target sub-node in the new sub-node, and determining the push value corresponding to the last sub-node when the simulation is finished comprises determining the target sub-node based on the probability corresponding to the new sub-node, simulating the target sub-node, determining the simulation is finished when the simulation time reaches the preset simulation time or the simulation depth reaches the preset simulation depth, and determining the push value of the last sub-node.
The preset simulation time in the above steps may be a preset time for performing simulation by the child node, and may be set according to actual needs, which is not specifically limited in the present disclosure. The preset simulation depth in the above step may be a depth of simulation performed by the preset child node, where the preset simulation depth may be determined according to user habit or average step size. In most data push scenarios, the user step size tends to be small, as shown in fig. 5, with the user step size being mostly concentrated below 5 steps. Thus, the exploration depth may be determined by a small increase in the user history step size. It should be noted that, for a new user, the history step of the user may be estimated according to the average step of the total station user, or the average history step of the similar user may be directly used as the history of the user.
In the embodiment of the disclosure, after the child node is expanded, one child node can be selected, simulation is performed according to the MDP until the set simulation time or simulation depth is reached, then the reward of the last child node can be estimated according to a value estimation model, and then the pushing value of the child node is updated based on the estimated reward. If the recall data is repeatedly pushed, the cost of the last node can be estimated through a cost estimation model, and then the pushing value of the child node is updated based on the estimated rewards and the cost. For the child nodes of different levels, discount rates corresponding to different levels can be set, and the discount rate is lower when the level is deeper, so that the corresponding push value is updated by obtaining the product of rewards and discount rates, for example, the discount rate can be 0.9, but is not limited to the value.
In some alternative embodiments, for the MDP tree shown in fig. 4, after three sub-nodes are created, one sub-node (i.e. the sub-node corresponding to the click) may be selected for simulation, then Item k may be selected according to the state transition probability, three sub-nodes of the next hierarchy may be created, and the sub-node corresponding to the click may be selected, at this time, it is determined that the search depth is reached, so that the push value of the sub-node corresponding to the click may be determined, assuming that the estimated reward of the sub-node is 10, the simulation depth is 3, the push value V (y) =γ 3×reward=0.93 ×10=7 of the node is updated, where γ represents the discount rate, the deeper the search depth is the discount rate, and in the second iteration, after three sub-nodes are created, one sub-node (i.e. the sub-node corresponding to the click) may be selected for simulation, then three sub-nodes of the next hierarchy may be selected according to the state transition probability, at this time, and the sub-node corresponding to the click is selected, and the estimated value of the sub-node corresponding to the click may be determined that the estimated value V (y) =γ 3×reward=0.93 ×10=7 is updated, where γ represents the discount rate, and in the second iteration, and in the second iteration, in the loop iteration, after three sub-nodes are created, the sub-node corresponding to be selected, item k is selected, at the sub-node according to the state transition, and Item k' is selected, at the state, and at the sub-level, and the sub-level according to the level, and according to the state, and when the value, according to the state, and when according.
According to the scheme, the target child node is determined based on the probability corresponding to the new child node, the target child node is simulated, whether the simulation is finished or not is determined through the preset simulation time or the preset simulation depth, and the pushing value of the last child node is determined, so that the effect of adjusting the pushing value in real time and improving the pushing accuracy is achieved.
Optionally, before determining the push value of the last child node, determining whether the associated data corresponding to the last child node is repeated data repeatedly pushed in the target time period, processing the last child node by using the cost prediction model and the value prediction model to obtain the push value of the last child node in response to the associated data being the repeated data, and processing the last child node by using the value prediction model to obtain the push value of the last child node in response to the associated data not being the repeated data.
The target period of time in the above steps may be a preset short period of time, for example, may be the entire construction process of the search tree, or may be a history period of time before the construction of the search tree, but is not limited thereto.
In the embodiment of the disclosure, for recall data repeatedly pushed in a short time, in order to avoid bad experience brought to users by repeated pushing, whether recall data corresponding to the last child node is repeated data can be determined first, if so, the push value is determined by combining the cost estimation model and the estimation result of the value estimation model, and if not, the push value is determined only by the estimation result of the value estimation model.
By the scheme, different determining flows are provided for the repeated data and the non-repeated data, so that the effect of improving the determining accuracy of the pushing value and further improving the accuracy of data pushing is achieved.
Optionally, the search tree is reversely iterated based on the push value of the last child node, and the push value of each layer of child nodes below the target node is updated, wherein the method comprises the steps of a, obtaining the total value of the expansion node based on the push value, rewards and state transition probability of at least one child node located in the current layer, b, updating the push value of a father node located in the upper layer based on the maximum value of the total value of the expansion node, and repeatedly executing the steps a to b until the update of the push value of the root node is completed.
The update may be to update the current push value of the parent node to the maximum value of the total value of the extension node, or may be to superimpose the current push value of the parent node and the maximum value of the total value of the extension node. In the embodiment of the present disclosure, an example is described in which the current push value of the parent node is updated to the maximum value of the total value of the extension node.
In the back propagation process, the push value of each child node has discount, so that the product of the push value and the discount rate can be obtained, then the push value and the discount rate are accumulated with rewards of the child node, and finally the push value and the discount rate are multiplied with the state transition probability of the child node, so that the final total value is obtained. In some alternative embodiments, for the MDP tree as shown in fig. 4, in the first loop iteration process, the push value V (t) =max (P (click |t) × [ r (y) +γv (y) ]+p (away |t) × [ r (y ')+γv (y') ] +p (transition |t) × [ r (y ')) +γv (y') ] =max (0.1× (0+0.9×7) +0.79× (0+0) +0.01× (0+0))=0.63 of the node t may be updated according to the following formula, and then at this point, the push value V(s)=maxa∈{1,2,…,k}(Pa(t|s)×[r(t,a,s′)+γV(s′)])=max(0.1×(3+0.9×0.63)+0.89×(0+0)+0.01×(100+0))-action:Item1=1.356, of the node s may be updated according to the following formula, the first loop iteration process ends, the search number N of the root node is updated to 1, and the push value is updated to 1.356. In the second loop iteration process, the push value V (m) =max (P (click |n) × [ r (N) +γv (N) ]+p (leave |n) × [ r (N ')+γv (N ') ] +p (transform |n) × [ r (N ')) +γv (N ")) ] =max (0.1× (0+0.9×70) +0.78× (0+0) +0.02× (0+0))=6.3 of the node m may be updated according to the following formula, and then the push value V(s)=maxa∈{1,2,…,k}(Pa(t|s)×[r(t,a,s′)+γV(s′)])=max(0.1×(3+0.9×0.63)+0.89×(0+0)+0.01×(100+0)-action:Item1,0.15×(5+0.9×6.3)+0.8×(0+0)+0.05×(80+0)-action:Item2)=max(1.356,4.8505)=4.8505, of the node s may be updated according to the following formula, at this time, the second loop iteration process ends, the search number N of the root node is updated to 2, and the push value is updated to 4.8505.
It should be noted that, in the above case, the ordering manner of Item2> Item1 may be selected to give the final result.
Through the scheme, the total value is determined through the push value, the rewards and the state transition probability, and then the push value of each node is updated through back propagation, so that the push value of each node is accurately determined, the determination accuracy of target data is improved, and the effect of improving the push accuracy is achieved.
A preferred embodiment of the present disclosure will be described in detail below with reference to fig. 4 and 6 by taking a commodity recommendation scenario as an example. First, a user request is received, wherein the user request can be different requests aiming at different scenes, for example, in a search scene, the user request can be search text searched by a user, in a list page recommending scene, the user request can be search text searched by the user, and in a detail page recommending scene, the user request can be behavior information of clicking a current commodity detail page by the user. In addition, search results and user history of browsing, clicking actions may be used as background information.
The specific flow of the scheme is as follows, as shown in fig. 6, the user can search the information to be queried in the search box, and supposing that the user searches "the popular list" in the search box, the "popular list" can be determined as the user request, after the request is received, the process of recommending result decision can be started, the recall data can be ordered in combination with the MCTS tree, and the recommending result, for example, "recommending result 1", "recommending result 2", and the like, is given below the searching result, as shown by the solid circles in fig. 4. Then, by way of simulation, it is assumed that if Item1 is recommended (e.g., recommendation 1), the user may click on the Item entry details page, may directly click on the "consult" button stay wire conversion, and may browse to leave. Suppose that the user clicks on an item in the recommendation result and enters a detail page, at which point the detail page again displays the recommendation result. Further, assuming that the user clicks data such as "allied diary", "content property", "authentication rating", "recommendation", etc., a series of candidate lists (Item a-Item k) can be given to search, and after observing these commodities as list page recommendation Item1, it is assumed that after clicking, the user judges whether or not sufficient subsequent value will be brought.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
According to an embodiment of the present disclosure, the present disclosure provides a data processing apparatus, which is used to implement the foregoing embodiment and a preferred real-time manner, and will not be described again. As used below, the term "module" may be a combination of software and/or hardware that implements the predetermined functionality, although the apparatus described in the embodiments below is preferably implemented in software, implementation of hardware, or a combination of software and hardware, is also possible and contemplated.
FIG. 7 is a schematic diagram of a data processing apparatus according to the present disclosure, as shown in FIG. 7, including an acquisition module 72 for acquiring a recall data set, a construction module 74 for constructing a search tree corresponding to the recall data set, wherein the search tree includes a root node and a plurality of data nodes at different levels, each data node for characterizing recall data in the recall data set, each data node for storing a push value and a search number of the corresponding recall data, the push value for characterizing a value of a feedback result received after the corresponding recall data has been pushed to a target object, and a decision module 76 for determining target data in the recall data set based on the push value of each recall data in the recall data set, wherein the target data is data pushed to the target object.
The building module comprises a first determining unit, a simulation unit, a second determining unit and an executing unit, wherein the first determining unit is used for determining target nodes to be searched, the target nodes are used for representing recall data in a root node or recall data set, the expanding unit is used for expanding new sub-nodes below the target nodes and determining rewards of the new sub-nodes by using a value estimation model, the simulation unit is used for simulating the target sub-nodes in the new sub-nodes and determining push values of last sub-nodes when the simulation is finished, the second determining unit is used for carrying out reverse iteration on a search tree based on the push values of the last sub-nodes and determining push values of each layer of sub-nodes below the target nodes, and the executing unit is used for repeatedly executing functions of the determining unit, the expanding unit, the simulation unit and the second determining unit until the exploration time of the search tree reaches preset exploration time or the exploration depth reaches preset exploration depth.
Optionally, the first determining unit comprises a traversing subunit, a first node determining subunit and a second node determining subunit, wherein the traversing subunit is used for traversing from a root node to determine whether unexpanded nodes exist, the first node determining subunit is used for determining that the unexpanded nodes are target nodes in response to the existence of the unexpanded nodes, and the second node determining subunit is used for determining the target nodes based on the push value and the searching times of each data node in response to the absence of the unexpanded nodes.
Optionally, the expansion unit comprises a probability acquisition subunit, an operation determination subunit and a creation subunit, wherein the probability acquisition subunit is used for acquiring the state transition probability corresponding to the target node, the operation determination subunit is used for determining the target execution operation corresponding to the target node based on the state transition probability, and the creation subunit is used for creating a new child node below the target node based on the target execution operation.
The simulation unit comprises a probability determination subunit, a simulation determination subunit and a value determination subunit, wherein the probability determination subunit is used for determining a target sub-node based on the probability corresponding to a new sub-node, the simulation subunit is used for simulating the target sub-node, the simulation determination subunit is used for determining that the simulation is ended when the simulation time reaches the preset simulation time or the simulation depth reaches the preset simulation depth, and the value determination subunit is used for determining the push value corresponding to the last sub-node.
Optionally, the simulation unit further comprises a data determining subunit, a first processing subunit, a second processing subunit and a third processing subunit, wherein the data determining subunit is used for determining whether the associated data corresponding to the last child node is the repeated data repeatedly pushed in the target time period, the first processing subunit is used for processing the last child node by using the cost estimation model and the value estimation model to obtain the push value of the last child node in response to the fact that the associated data is the repeated data, and the second processing subunit is used for processing the last child node by using the value estimation model to obtain the push value of the last child node in response to the fact that the associated data is not the repeated data.
Optionally, the second determining unit comprises a value obtaining subunit, a value updating subunit, and an executing subunit, wherein the value obtaining subunit is used for obtaining the total value of the extended node based on the push value, rewards and state transition probability of at least one child node located in the current layer, the value updating subunit is used for updating the push value of a parent node located in the upper layer based on the maximum value of the total value of the extended node, the target child node is a child node with an association relation with at least one child node, and the executing subunit is used for repeatedly executing functions of the value obtaining subunit and the value updating subunit until the push value updating of the root node is completed.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in the device 800 are connected to the I/O interface 805, including an input unit 806, such as a keyboard, a mouse, etc., an output unit 807, such as various types of displays, speakers, etc., a storage unit 808, such as a magnetic disk, optical disk, etc., and a communication unit 809, such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the data processing method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.