CN112001748A

CN112001748A - Data expansion method and equipment based on label propagation

Info

Publication number: CN112001748A
Application number: CN202010819988.XA
Authority: CN
Inventors: 刘楠
Original assignee: Guangzhou Zhangtao Network Technology Co ltd
Current assignee: Guangzhou Zhangtao Network Technology Co ltd
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2020-11-27

Abstract

The application aims to provide a data expansion scheme based on label propagation. According to the scheme, a first data set is obtained, the first data set comprises labeled user data and unlabeled user data, then a second data set is generated according to the first data set by using a label propagation algorithm, all the user data of the second data set are provided with labels, a Lookalike model is obtained based on the second data set through training, and a target user is found by using the Lookalike model through expansion. Compared with the prior art, the method and the device have the advantages that the seed user data are effectively expanded through the graph calculation mode, the expansion effect of the Lookalike model is improved, accordingly, a target user can be found more accurately, the user experience is improved, and the data purchase cost is saved.

Description

Data expansion method and equipment based on label propagation

Technical Field

The application relates to the technical field of information, in particular to a data expansion technology based on label propagation.

Background

In recent years, the development of various industries is promoted by the progress of information technology, the advertisement marketing industry needs to help brand products to operate and popularize, but the advertisement budget is usually water drift due to extensive putting. If the user expansion is carried out by using the Lookalike mode, the aim of saving the advertising expenses and carrying out accurate marketing can be achieved. The user expansion model Lookalike needs seed users, namely user data with labels (labels), and similar data can be found in massive large data through the characteristics according to the analysis of data characteristics of the seed users, so that the purpose of crowd expansion is achieved.

However, the Lookalike model in the prior art scheme solely depends on seed user data to perform similar population expansion. Many cold-start projects often have very little data of labels (label), and label data are extremely expensive, difficult to obtain, and limited label data can cause the model training sample too little, and the model can not refine abundant characteristics, this leads to the crowd to expand the Lookalike model at present and can only play limited role in new product, new brand, new project.

Disclosure of Invention

An object of the application is to provide a data expansion method and device based on label propagation, which effectively combine a label propagation algorithm and similar population expansion of Lookalike, thereby improving the effect of data expansion.

According to an aspect of the present application, a data expansion method based on tag propagation is provided, wherein the method includes:

acquiring a first data set, wherein the first data set comprises labeled user data and unlabeled user data;

generating a second data set from the first data set by using a label propagation algorithm, wherein all user data of the second data set are labeled;

and training based on the second data set to obtain a Lookalike model, and finding the target user by using the Lookalike model expansion.

According to another aspect of the present application, there is also provided a data expansion apparatus based on tag propagation, wherein the apparatus includes:

the system comprises an input module, a first data processing module and a second data processing module, wherein the input module is used for acquiring a first data set, and the first data set comprises labeled user data and unlabeled user data;

a label propagation module, configured to generate a second data set according to the first data set by using a label propagation algorithm, where all user data of the second data set are labeled;

and the extension module is used for training based on the second data set to obtain a Lookalike model and finding the target user by utilizing the Lookalike model in an extension mode.

According to yet another aspect of the present application, there is also provided a computing device, wherein the device comprises a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to perform the tag propagation based data expansion method.

According to yet another aspect of the present application, there is also provided a computer readable medium having stored thereon computer program instructions executable by a processor to implement the tag propagation-based data expansion method.

According to the scheme, a first data set is obtained firstly, the first data set comprises labeled user data and label-free user data, then a second data set is generated according to the first data set through a label propagation algorithm, all the user data of the second data set are provided with labels, a Lookalike model is obtained through training based on the second data set, and a target user is found through Lookalike model expansion. Compared with the prior art, the method and the device have the advantages that the seed user data are effectively expanded through the graph calculation mode, the expansion effect of the Lookalike model is improved, accordingly, a target user can be found more accurately, the user experience is improved, and the data purchase cost is saved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow chart of a data expansion method based on tag propagation according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a fully connected diagram according to an embodiment of the present application;

FIG. 3 is a diagram illustrating results of an iterative execution using a tag propagation algorithm according to an embodiment of the present application;

fig. 4 is a schematic diagram of a data expansion device based on tag propagation according to an embodiment of the present application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, program means, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The embodiment of the application provides a data expansion method based on label propagation, seed user data are effectively expanded through a graph calculation mode, and the expansion effect of a Lookalike model is improved, so that a target user can be found more accurately, the data purchase cost is saved, accurate recommendation can be performed on the target user, and the user experience is improved.

In a practical scenario, the device performing the method may be a user equipment, a network device, or a device formed by integrating the user equipment and the network device through a network. The user equipment includes, but is not limited to, a terminal device such as a smartphone, a tablet computer, a Personal Computer (PC), and the like, and the network device includes, but is not limited to, a network host, a single network server, multiple network server sets, or a cloud computing-based computer set. Here, the Cloud is made up of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a type of distributed Computing, one virtual computer consisting of a collection of loosely coupled computers.

Fig. 1 is a flowchart of a data expansion method based on tag propagation according to an embodiment of the present application, where the method includes step S101, step S102, and step S103.

Step S101, a first data set is obtained, wherein the first data set comprises labeled user data and unlabeled user data.

For example, taking a sales scenario of a mother-and-baby product as an example, assuming that there is a small amount of mother-and-baby product payment user data (i.e. the labeled user data), a merchant of the mother-and-baby product needs to be helped to find more similar users having a need to purchase the product, which is similar to known payment users. First, in the step S101, the first data set is input, and in a sales scenario of a maternal and infant product, the first data set may include a maternal and infant product payment user data set Y_m＝{y₁,……,y_mThe (i.e. the tagged user data) and the unlabeled data set Y_n＝{y_m+1,……,y_m+nI.e. the untagged user data, typically m<<n, the number of data with labels (label) is much smaller than the number of data without labels (unlabel).

Step S102, generating a second data set according to the first data set by using a label propagation algorithm, wherein all user data of the second data set are provided with labels.

The Label Propagation Algorithm (Label Propagation Algorithm) is one of the classic algorithms for non-overlapping community discovery, and the basic idea is to use the Label with the largest number in labels of neighbor nodes of a node as the Label of the node. Each node is tagged (label) to represent the community to which it belongs, and the propagation of the label forms the "community" structure of the same label. The specific algorithm propagation process is as follows: (1) initially, giving each node a unique label; (2) each node updates its own label with the most labels among the labels of its neighbor nodes; (3) and repeatedly executing the previous step to update the labels of the nodes until the label of each node is not changed any more.

In an iterative process, the updating of a node label can be divided into synchronous updating and asynchronous updating. So-called synchronous updating, namely the label of the node z in the t iteration is based on the label obtained by the neighbor node in the t-1 iteration; and asynchronous updating, namely the label of the node z in the t iteration is based on the label of the node which has updated label in the t iteration and the label of the node which has not updated label in the t iteration in the t-1 iteration.

In some embodiments, the step S102 includes: and determining a label corresponding to the user data without the label in the first data set according to the user data with the label in the first data set by using a label propagation algorithm, and generating the second data set.

For example, taking a sales scenario of a maternal and infant product as an example, the first data set may comprise a maternal and infant product payment user data set Y_m＝{y₁,……,y_mThe (i.e. the tagged user data) and the unlabeled data set Y_n＝{y_m+1,……,y_m+nI.e. the unlabeled user data, in said step S102 a data set Y is output_nThe labels corresponding to all of the data in (a) and then the data set Y can be merged_mAnd data set Y_nAnd generating the second data set.

In some embodiments, the step S102 includes: establishing a full connection graph according to the user data with the label and the user data without the label in the first data set, wherein each user data with the label or user data without the label is used as one node in the full connection graph; each labeled node is allowed to propagate through the edge to all nodes.

For example, as shown in fig. 2, a full connection graph is established, and each labeled (label) user data or unlabeled (unlabel) user data is used as a node (node).

In some embodiments, having each tagged node propagate through the edge to all nodes includes: setting the weight of an edge between two nodes by using a weight formula, wherein the nodes of the edge with large weight are easier to influence adjacent nodes; defining a probability propagation matrix, wherein the propagation probability of each node is that the labeled values propagated by the nodes around the node are added according to the weight and are updated to the probability distribution of the node; and repeatedly using the probability propagation matrix to execute propagation node labels until convergence.

For example, after the full-connected graph is built, the weight w of the edge between two points i, j can be set by using a weight formula_ijThe smaller the distance, the larger the weight, and the higher the similarity between two points. The specific weight formula is as follows:

then, a probability propagation matrix T of (m + n) × (m + n) may be defined, where T_ijRepresents the probability of label j propagating to label i:

specifically, each node propagation probability is a probability distribution that adds the labeled values propagated by its surrounding nodes by weight and updates to itself. Defining data of labeled label, reassigning probability distribution of the labeled label to initial values, and repeating the previous steps until convergence.

Taking a sales scene of mother and infant products as an example, edges (edge) of label data of a mother and infant product paying user and other unlabel user data can be defined by browsing data of mother and infant products by a user, and then the weight is the browsing frequency of the user, so when a certain new user browses the same products as the paying user and has very high frequency, the weight w of the edges of the new user and the paying user in graph calculation is very high, the new user and the paying user can be judged to have very close distance by a label propagation algorithm, which means that two users are very similar, and the paying label is transmitted to the new user by the label propagation algorithm at this time, so that the purpose of similar population expansion is achieved.

As shown in fig. 3, mother-infant product consumer sample data is input, wherein the label data size is 5 ten thousand, and the unlabel data size is 100 ten thousand. The final result of 300 iterations through the label propagation algorithm is shown in fig. 3, and the accuracy can reach 95%. Further, except that the iterative algorithm is stopped by judging convergence; and the test label can be used for testing the result once every iteration of a plurality of steps, and how the overall training performance of the model is observed, so that the method is very effective in judging whether the training is over-fitted. Subsequently, the 5 ten thousand label data and the 100 ten thousand unlabel data can be merged to be used as mother and infant product paid seed user data to expand similar populations of the Lookalike.

And S103, training based on the second data set to obtain a Lookalike model, and finding the target user by using the Lookalike model expansion.

For example, the Lookalike model may be a classification prediction model, and the class of the Lookalike model may include, but is not limited to, logistic regression, SVM, decision trees, random forests, naive bayes, neural networks, and the like. In a sales scene of maternal and infant products, a user payment intention score can be obtained by using the Lookalike model, a user (namely the target user) with the payment intention score higher than a preset threshold value is output, and then maternal and infant products can be accurately recommended for the target user.

Fig. 4 is a schematic diagram of a data expansion device based on tag propagation according to an embodiment of the present application, where the device includes an input module 401, a tag propagation module 402, and an expansion module 403.

The input module 401 obtains a first data set, where the first data set includes tagged user data and untagged user data.

For example, taking a sales scenario of a mother-and-baby product as an example, assuming that there is a small amount of mother-and-baby product payment user data (i.e. the labeled user data), a merchant of the mother-and-baby product needs to be helped to find more similar users having a need to purchase the product, which is similar to known payment users. First, the input module 401 inputs the first data set, which may include a paid user data set Y of a mother and baby product in a sales scenario of the mother and baby product_m＝{y₁,……,y_mThe (i.e. the tagged user data) and the unlabeled data set Y_n＝{y_m+1,……,y_m+nI.e. the untagged user data, typically m<<n, the number of data with labels (label) is much smaller than the number of data without labels (unlabel).

And a label propagation module 402, configured to generate a second data set from the first data set by using a label propagation algorithm, wherein all user data of the second data set are labeled.

In some embodiments, the label propagation module 402 determines a label corresponding to the user data without a label in the first data set according to the user data with a label in the first data set by using a label propagation algorithm, and generates the second data set.

For example, taking a sales scenario of a maternal and infant product as an example, the first data set may comprise a maternal and infant product payment user data set Y_m＝{y₁,……,y_mThe (i.e. the tagged user data) and the unlabeled data set Y_n＝{y_m+1,……,y_m+nI.e. the label-free user data, the label propagation module 402 outputs a data set Y_nThe labels corresponding to all of the data in (a) and then the data set Y can be merged_mAnd data set Y_nAnd generating the second data set.

In some embodiments, the label propagation module 402 establishes a full connection graph according to the tagged user data and the untagged user data in the first data set, where each tagged user data or untagged user data serves as a node in the full connection graph; each labeled node is allowed to propagate through the edge to all nodes.

And the extension module 403 is used for training based on the second data set to obtain a lookelike model, and extending the lookelike model to find the target user.

To sum up, the seed user data are effectively expanded through the graph calculation mode, the expansion effect of the Lookalike model is improved, and therefore the target user can be found more accurately, a merchant can recommend products to the target user accurately, the sales efficiency of the merchant is improved, and the purchase experience of the user is improved.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. Herein, some embodiments of the present application provide a computing device comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to perform the methods and/or aspects of the embodiments of the present application as described above.

Furthermore, some embodiments of the present application also provide a computer readable medium, on which computer program instructions are stored, the computer readable instructions being executable by a processor to implement the methods and/or aspects of the foregoing embodiments of the present application.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In some embodiments, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A data expansion method based on tag propagation, wherein the method comprises:

2. The method of claim 1, wherein generating a second data set from the first data set using a label propagation algorithm comprises:

and determining a label corresponding to the user data without the label in the first data set according to the user data with the label in the first data set by using a label propagation algorithm, and generating the second data set.

3. The method of claim 2, wherein determining a label corresponding to unlabeled user data within the first data set from labeled user data within the first data set using a label propagation algorithm comprises:

establishing a full connection graph according to the user data with the label and the user data without the label in the first data set, wherein each user data with the label or user data without the label is used as one node in the full connection graph;

each labeled node is allowed to propagate through the edge to all nodes.

4. The method of claim 3, wherein letting each tagged node propagate through edges to all nodes comprises:

setting the weight of an edge between two nodes by using a weight formula, wherein the nodes of the edge with large weight are easier to influence adjacent nodes;

defining a probability propagation matrix, wherein the propagation probability of each node is that the labeled values propagated by the nodes around the node are added according to the weight and are updated to the probability distribution of the node;

and repeatedly using the probability propagation matrix to execute propagation node labels until convergence.

5. A data expansion apparatus based on tag propagation, wherein the apparatus comprises:

6. The device of claim 5, wherein the tag propagation module is to:

7. The apparatus of claim 6, wherein the tag propagation module is to:

each labeled node is allowed to propagate through the edge to all nodes.

8. The apparatus of claim 7, wherein having each tagged node propagate through edges to all nodes comprises:

9. A computing device, wherein the device comprises a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to perform the method of any of claims 1 to 4.

10. A computer readable medium having stored thereon computer program instructions executable by a processor to implement the method of any one of claims 1 to 4.