[go: up one dir, main page]

CN112001748A - Data expansion method and equipment based on label propagation - Google Patents

Data expansion method and equipment based on label propagation Download PDF

Info

Publication number
CN112001748A
CN112001748A CN202010819988.XA CN202010819988A CN112001748A CN 112001748 A CN112001748 A CN 112001748A CN 202010819988 A CN202010819988 A CN 202010819988A CN 112001748 A CN112001748 A CN 112001748A
Authority
CN
China
Prior art keywords
label
data set
data
node
propagation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010819988.XA
Other languages
Chinese (zh)
Inventor
刘楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Zhangtao Network Technology Co ltd
Original Assignee
Guangzhou Zhangtao Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Zhangtao Network Technology Co ltd filed Critical Guangzhou Zhangtao Network Technology Co ltd
Priority to CN202010819988.XA priority Critical patent/CN112001748A/en
Publication of CN112001748A publication Critical patent/CN112001748A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0249Advertisements based upon budgets or funds
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application aims to provide a data expansion scheme based on label propagation. According to the scheme, a first data set is obtained, the first data set comprises labeled user data and unlabeled user data, then a second data set is generated according to the first data set by using a label propagation algorithm, all the user data of the second data set are provided with labels, a Lookalike model is obtained based on the second data set through training, and a target user is found by using the Lookalike model through expansion. Compared with the prior art, the method and the device have the advantages that the seed user data are effectively expanded through the graph calculation mode, the expansion effect of the Lookalike model is improved, accordingly, a target user can be found more accurately, the user experience is improved, and the data purchase cost is saved.

Description

Data expansion method and equipment based on label propagation
Technical Field
The application relates to the technical field of information, in particular to a data expansion technology based on label propagation.
Background
In recent years, the development of various industries is promoted by the progress of information technology, the advertisement marketing industry needs to help brand products to operate and popularize, but the advertisement budget is usually water drift due to extensive putting. If the user expansion is carried out by using the Lookalike mode, the aim of saving the advertising expenses and carrying out accurate marketing can be achieved. The user expansion model Lookalike needs seed users, namely user data with labels (labels), and similar data can be found in massive large data through the characteristics according to the analysis of data characteristics of the seed users, so that the purpose of crowd expansion is achieved.
However, the Lookalike model in the prior art scheme solely depends on seed user data to perform similar population expansion. Many cold-start projects often have very little data of labels (label), and label data are extremely expensive, difficult to obtain, and limited label data can cause the model training sample too little, and the model can not refine abundant characteristics, this leads to the crowd to expand the Lookalike model at present and can only play limited role in new product, new brand, new project.
Disclosure of Invention
An object of the application is to provide a data expansion method and device based on label propagation, which effectively combine a label propagation algorithm and similar population expansion of Lookalike, thereby improving the effect of data expansion.
According to an aspect of the present application, a data expansion method based on tag propagation is provided, wherein the method includes:
acquiring a first data set, wherein the first data set comprises labeled user data and unlabeled user data;
generating a second data set from the first data set by using a label propagation algorithm, wherein all user data of the second data set are labeled;
and training based on the second data set to obtain a Lookalike model, and finding the target user by using the Lookalike model expansion.
According to another aspect of the present application, there is also provided a data expansion apparatus based on tag propagation, wherein the apparatus includes:
the system comprises an input module, a first data processing module and a second data processing module, wherein the input module is used for acquiring a first data set, and the first data set comprises labeled user data and unlabeled user data;
a label propagation module, configured to generate a second data set according to the first data set by using a label propagation algorithm, where all user data of the second data set are labeled;
and the extension module is used for training based on the second data set to obtain a Lookalike model and finding the target user by utilizing the Lookalike model in an extension mode.
According to yet another aspect of the present application, there is also provided a computing device, wherein the device comprises a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to perform the tag propagation based data expansion method.
According to yet another aspect of the present application, there is also provided a computer readable medium having stored thereon computer program instructions executable by a processor to implement the tag propagation-based data expansion method.
According to the scheme, a first data set is obtained firstly, the first data set comprises labeled user data and label-free user data, then a second data set is generated according to the first data set through a label propagation algorithm, all the user data of the second data set are provided with labels, a Lookalike model is obtained through training based on the second data set, and a target user is found through Lookalike model expansion. Compared with the prior art, the method and the device have the advantages that the seed user data are effectively expanded through the graph calculation mode, the expansion effect of the Lookalike model is improved, accordingly, a target user can be found more accurately, the user experience is improved, and the data purchase cost is saved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a flow chart of a data expansion method based on tag propagation according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a fully connected diagram according to an embodiment of the present application;
FIG. 3 is a diagram illustrating results of an iterative execution using a tag propagation algorithm according to an embodiment of the present application;
fig. 4 is a schematic diagram of a data expansion device based on tag propagation according to an embodiment of the present application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, program means, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The embodiment of the application provides a data expansion method based on label propagation, seed user data are effectively expanded through a graph calculation mode, and the expansion effect of a Lookalike model is improved, so that a target user can be found more accurately, the data purchase cost is saved, accurate recommendation can be performed on the target user, and the user experience is improved.
In a practical scenario, the device performing the method may be a user equipment, a network device, or a device formed by integrating the user equipment and the network device through a network. The user equipment includes, but is not limited to, a terminal device such as a smartphone, a tablet computer, a Personal Computer (PC), and the like, and the network device includes, but is not limited to, a network host, a single network server, multiple network server sets, or a cloud computing-based computer set. Here, the Cloud is made up of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a type of distributed Computing, one virtual computer consisting of a collection of loosely coupled computers.
Fig. 1 is a flowchart of a data expansion method based on tag propagation according to an embodiment of the present application, where the method includes step S101, step S102, and step S103.
Step S101, a first data set is obtained, wherein the first data set comprises labeled user data and unlabeled user data.
For example, taking a sales scenario of a mother-and-baby product as an example, assuming that there is a small amount of mother-and-baby product payment user data (i.e. the labeled user data), a merchant of the mother-and-baby product needs to be helped to find more similar users having a need to purchase the product, which is similar to known payment users. First, in the step S101, the first data set is input, and in a sales scenario of a maternal and infant product, the first data set may include a maternal and infant product payment user data set Ym={y1,……,ymThe (i.e. the tagged user data) and the unlabeled data set Yn={ym+1,……,ym+nI.e. the untagged user data, typically m<<n, the number of data with labels (label) is much smaller than the number of data without labels (unlabel).
Step S102, generating a second data set according to the first data set by using a label propagation algorithm, wherein all user data of the second data set are provided with labels.
The Label Propagation Algorithm (Label Propagation Algorithm) is one of the classic algorithms for non-overlapping community discovery, and the basic idea is to use the Label with the largest number in labels of neighbor nodes of a node as the Label of the node. Each node is tagged (label) to represent the community to which it belongs, and the propagation of the label forms the "community" structure of the same label. The specific algorithm propagation process is as follows: (1) initially, giving each node a unique label; (2) each node updates its own label with the most labels among the labels of its neighbor nodes; (3) and repeatedly executing the previous step to update the labels of the nodes until the label of each node is not changed any more.
In an iterative process, the updating of a node label can be divided into synchronous updating and asynchronous updating. So-called synchronous updating, namely the label of the node z in the t iteration is based on the label obtained by the neighbor node in the t-1 iteration; and asynchronous updating, namely the label of the node z in the t iteration is based on the label of the node which has updated label in the t iteration and the label of the node which has not updated label in the t iteration in the t-1 iteration.
In some embodiments, the step S102 includes: and determining a label corresponding to the user data without the label in the first data set according to the user data with the label in the first data set by using a label propagation algorithm, and generating the second data set.
For example, taking a sales scenario of a maternal and infant product as an example, the first data set may comprise a maternal and infant product payment user data set Ym={y1,……,ymThe (i.e. the tagged user data) and the unlabeled data set Yn={ym+1,……,ym+nI.e. the unlabeled user data, in said step S102 a data set Y is outputnThe labels corresponding to all of the data in (a) and then the data set Y can be mergedmAnd data set YnAnd generating the second data set.
In some embodiments, the step S102 includes: establishing a full connection graph according to the user data with the label and the user data without the label in the first data set, wherein each user data with the label or user data without the label is used as one node in the full connection graph; each labeled node is allowed to propagate through the edge to all nodes.
For example, as shown in fig. 2, a full connection graph is established, and each labeled (label) user data or unlabeled (unlabel) user data is used as a node (node).
In some embodiments, having each tagged node propagate through the edge to all nodes includes: setting the weight of an edge between two nodes by using a weight formula, wherein the nodes of the edge with large weight are easier to influence adjacent nodes; defining a probability propagation matrix, wherein the propagation probability of each node is that the labeled values propagated by the nodes around the node are added according to the weight and are updated to the probability distribution of the node; and repeatedly using the probability propagation matrix to execute propagation node labels until convergence.
For example, after the full-connected graph is built, the weight w of the edge between two points i, j can be set by using a weight formulaijThe smaller the distance, the larger the weight, and the higher the similarity between two points. The specific weight formula is as follows:
Figure BDA0002634112480000051
then, a probability propagation matrix T of (m + n) × (m + n) may be defined, where TijRepresents the probability of label j propagating to label i:
Figure BDA0002634112480000052
specifically, each node propagation probability is a probability distribution that adds the labeled values propagated by its surrounding nodes by weight and updates to itself. Defining data of labeled label, reassigning probability distribution of the labeled label to initial values, and repeating the previous steps until convergence.
Taking a sales scene of mother and infant products as an example, edges (edge) of label data of a mother and infant product paying user and other unlabel user data can be defined by browsing data of mother and infant products by a user, and then the weight is the browsing frequency of the user, so when a certain new user browses the same products as the paying user and has very high frequency, the weight w of the edges of the new user and the paying user in graph calculation is very high, the new user and the paying user can be judged to have very close distance by a label propagation algorithm, which means that two users are very similar, and the paying label is transmitted to the new user by the label propagation algorithm at this time, so that the purpose of similar population expansion is achieved.
As shown in fig. 3, mother-infant product consumer sample data is input, wherein the label data size is 5 ten thousand, and the unlabel data size is 100 ten thousand. The final result of 300 iterations through the label propagation algorithm is shown in fig. 3, and the accuracy can reach 95%. Further, except that the iterative algorithm is stopped by judging convergence; and the test label can be used for testing the result once every iteration of a plurality of steps, and how the overall training performance of the model is observed, so that the method is very effective in judging whether the training is over-fitted. Subsequently, the 5 ten thousand label data and the 100 ten thousand unlabel data can be merged to be used as mother and infant product paid seed user data to expand similar populations of the Lookalike.
And S103, training based on the second data set to obtain a Lookalike model, and finding the target user by using the Lookalike model expansion.
For example, the Lookalike model may be a classification prediction model, and the class of the Lookalike model may include, but is not limited to, logistic regression, SVM, decision trees, random forests, naive bayes, neural networks, and the like. In a sales scene of maternal and infant products, a user payment intention score can be obtained by using the Lookalike model, a user (namely the target user) with the payment intention score higher than a preset threshold value is output, and then maternal and infant products can be accurately recommended for the target user.
Fig. 4 is a schematic diagram of a data expansion device based on tag propagation according to an embodiment of the present application, where the device includes an input module 401, a tag propagation module 402, and an expansion module 403.
The input module 401 obtains a first data set, where the first data set includes tagged user data and untagged user data.
For example, taking a sales scenario of a mother-and-baby product as an example, assuming that there is a small amount of mother-and-baby product payment user data (i.e. the labeled user data), a merchant of the mother-and-baby product needs to be helped to find more similar users having a need to purchase the product, which is similar to known payment users. First, the input module 401 inputs the first data set, which may include a paid user data set Y of a mother and baby product in a sales scenario of the mother and baby productm={y1,……,ymThe (i.e. the tagged user data) and the unlabeled data set Yn={ym+1,……,ym+nI.e. the untagged user data, typically m<<n, the number of data with labels (label) is much smaller than the number of data without labels (unlabel).
And a label propagation module 402, configured to generate a second data set from the first data set by using a label propagation algorithm, wherein all user data of the second data set are labeled.
The Label Propagation Algorithm (Label Propagation Algorithm) is one of the classic algorithms for non-overlapping community discovery, and the basic idea is to use the Label with the largest number in labels of neighbor nodes of a node as the Label of the node. Each node is tagged (label) to represent the community to which it belongs, and the propagation of the label forms the "community" structure of the same label. The specific algorithm propagation process is as follows: (1) initially, giving each node a unique label; (2) each node updates its own label with the most labels among the labels of its neighbor nodes; (3) and repeatedly executing the previous step to update the labels of the nodes until the label of each node is not changed any more.
In an iterative process, the updating of a node label can be divided into synchronous updating and asynchronous updating. So-called synchronous updating, namely the label of the node z in the t iteration is based on the label obtained by the neighbor node in the t-1 iteration; and asynchronous updating, namely the label of the node z in the t iteration is based on the label of the node which has updated label in the t iteration and the label of the node which has not updated label in the t iteration in the t-1 iteration.
In some embodiments, the label propagation module 402 determines a label corresponding to the user data without a label in the first data set according to the user data with a label in the first data set by using a label propagation algorithm, and generates the second data set.
For example, taking a sales scenario of a maternal and infant product as an example, the first data set may comprise a maternal and infant product payment user data set Ym={y1,……,ymThe (i.e. the tagged user data) and the unlabeled data set Yn={ym+1,……,ym+nI.e. the label-free user data, the label propagation module 402 outputs a data set YnThe labels corresponding to all of the data in (a) and then the data set Y can be mergedmAnd data set YnAnd generating the second data set.
In some embodiments, the label propagation module 402 establishes a full connection graph according to the tagged user data and the untagged user data in the first data set, where each tagged user data or untagged user data serves as a node in the full connection graph; each labeled node is allowed to propagate through the edge to all nodes.
For example, as shown in fig. 2, a full connection graph is established, and each labeled (label) user data or unlabeled (unlabel) user data is used as a node (node).
In some embodiments, having each tagged node propagate through the edge to all nodes includes: setting the weight of an edge between two nodes by using a weight formula, wherein the nodes of the edge with large weight are easier to influence adjacent nodes; defining a probability propagation matrix, wherein the propagation probability of each node is that the labeled values propagated by the nodes around the node are added according to the weight and are updated to the probability distribution of the node; and repeatedly using the probability propagation matrix to execute propagation node labels until convergence.
For example, after the full-connected graph is built, the weight w of the edge between two points i, j can be set by using a weight formulaijThe smaller the distance, the larger the weight, and the higher the similarity between two points. The specific weight formula is as follows:
Figure BDA0002634112480000081
then, a probability propagation matrix T of (m + n) × (m + n) may be defined, where TijRepresents the probability of label j propagating to label i:
Figure BDA0002634112480000082
specifically, each node propagation probability is a probability distribution that adds the labeled values propagated by its surrounding nodes by weight and updates to itself. Defining data of labeled label, reassigning probability distribution of the labeled label to initial values, and repeating the previous steps until convergence.
Taking a sales scene of mother and infant products as an example, edges (edge) of label data of a mother and infant product paying user and other unlabel user data can be defined by browsing data of mother and infant products by a user, and then the weight is the browsing frequency of the user, so when a certain new user browses the same products as the paying user and has very high frequency, the weight w of the edges of the new user and the paying user in graph calculation is very high, the new user and the paying user can be judged to have very close distance by a label propagation algorithm, which means that two users are very similar, and the paying label is transmitted to the new user by the label propagation algorithm at this time, so that the purpose of similar population expansion is achieved.
As shown in fig. 3, mother-infant product consumer sample data is input, wherein the label data size is 5 ten thousand, and the unlabel data size is 100 ten thousand. The final result of 300 iterations through the label propagation algorithm is shown in fig. 3, and the accuracy can reach 95%. Further, except that the iterative algorithm is stopped by judging convergence; and the test label can be used for testing the result once every iteration of a plurality of steps, and how the overall training performance of the model is observed, so that the method is very effective in judging whether the training is over-fitted. Subsequently, the 5 ten thousand label data and the 100 ten thousand unlabel data can be merged to be used as mother and infant product paid seed user data to expand similar populations of the Lookalike.
And the extension module 403 is used for training based on the second data set to obtain a lookelike model, and extending the lookelike model to find the target user.
For example, the Lookalike model may be a classification prediction model, and the class of the Lookalike model may include, but is not limited to, logistic regression, SVM, decision trees, random forests, naive bayes, neural networks, and the like. In a sales scene of maternal and infant products, a user payment intention score can be obtained by using the Lookalike model, a user (namely the target user) with the payment intention score higher than a preset threshold value is output, and then maternal and infant products can be accurately recommended for the target user.
To sum up, the seed user data are effectively expanded through the graph calculation mode, the expansion effect of the Lookalike model is improved, and therefore the target user can be found more accurately, a merchant can recommend products to the target user accurately, the sales efficiency of the merchant is improved, and the purchase experience of the user is improved.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. Herein, some embodiments of the present application provide a computing device comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to perform the methods and/or aspects of the embodiments of the present application as described above.
Furthermore, some embodiments of the present application also provide a computer readable medium, on which computer program instructions are stored, the computer readable instructions being executable by a processor to implement the methods and/or aspects of the foregoing embodiments of the present application.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In some embodiments, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (10)

1. A data expansion method based on tag propagation, wherein the method comprises:
acquiring a first data set, wherein the first data set comprises labeled user data and unlabeled user data;
generating a second data set from the first data set by using a label propagation algorithm, wherein all user data of the second data set are labeled;
and training based on the second data set to obtain a Lookalike model, and finding the target user by using the Lookalike model expansion.
2. The method of claim 1, wherein generating a second data set from the first data set using a label propagation algorithm comprises:
and determining a label corresponding to the user data without the label in the first data set according to the user data with the label in the first data set by using a label propagation algorithm, and generating the second data set.
3. The method of claim 2, wherein determining a label corresponding to unlabeled user data within the first data set from labeled user data within the first data set using a label propagation algorithm comprises:
establishing a full connection graph according to the user data with the label and the user data without the label in the first data set, wherein each user data with the label or user data without the label is used as one node in the full connection graph;
each labeled node is allowed to propagate through the edge to all nodes.
4. The method of claim 3, wherein letting each tagged node propagate through edges to all nodes comprises:
setting the weight of an edge between two nodes by using a weight formula, wherein the nodes of the edge with large weight are easier to influence adjacent nodes;
defining a probability propagation matrix, wherein the propagation probability of each node is that the labeled values propagated by the nodes around the node are added according to the weight and are updated to the probability distribution of the node;
and repeatedly using the probability propagation matrix to execute propagation node labels until convergence.
5. A data expansion apparatus based on tag propagation, wherein the apparatus comprises:
the system comprises an input module, a first data processing module and a second data processing module, wherein the input module is used for acquiring a first data set, and the first data set comprises labeled user data and unlabeled user data;
a label propagation module, configured to generate a second data set according to the first data set by using a label propagation algorithm, where all user data of the second data set are labeled;
and the extension module is used for training based on the second data set to obtain a Lookalike model and finding the target user by utilizing the Lookalike model in an extension mode.
6. The device of claim 5, wherein the tag propagation module is to:
and determining a label corresponding to the user data without the label in the first data set according to the user data with the label in the first data set by using a label propagation algorithm, and generating the second data set.
7. The apparatus of claim 6, wherein the tag propagation module is to:
establishing a full connection graph according to the user data with the label and the user data without the label in the first data set, wherein each user data with the label or user data without the label is used as one node in the full connection graph;
each labeled node is allowed to propagate through the edge to all nodes.
8. The apparatus of claim 7, wherein having each tagged node propagate through edges to all nodes comprises:
setting the weight of an edge between two nodes by using a weight formula, wherein the nodes of the edge with large weight are easier to influence adjacent nodes;
defining a probability propagation matrix, wherein the propagation probability of each node is that the labeled values propagated by the nodes around the node are added according to the weight and are updated to the probability distribution of the node;
and repeatedly using the probability propagation matrix to execute propagation node labels until convergence.
9. A computing device, wherein the device comprises a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the device to perform the method of any of claims 1 to 4.
10. A computer readable medium having stored thereon computer program instructions executable by a processor to implement the method of any one of claims 1 to 4.
CN202010819988.XA 2020-08-14 2020-08-14 Data expansion method and equipment based on label propagation Pending CN112001748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010819988.XA CN112001748A (en) 2020-08-14 2020-08-14 Data expansion method and equipment based on label propagation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010819988.XA CN112001748A (en) 2020-08-14 2020-08-14 Data expansion method and equipment based on label propagation

Publications (1)

Publication Number Publication Date
CN112001748A true CN112001748A (en) 2020-11-27

Family

ID=73473230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010819988.XA Pending CN112001748A (en) 2020-08-14 2020-08-14 Data expansion method and equipment based on label propagation

Country Status (1)

Country Link
CN (1) CN112001748A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112907282A (en) * 2021-02-05 2021-06-04 杭州微洱网络科技有限公司 Architecture application method based on global e-commerce industry advertisement DMP
CN113139125A (en) * 2021-04-21 2021-07-20 北方工业大学 User demand driven service matching method
CN114387477A (en) * 2022-01-18 2022-04-22 中国农业银行股份有限公司 Label classification model training method, label classification method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020188602A1 (en) * 2001-05-07 2002-12-12 Eastman Kodak Company Method for associating semantic information with multiple images in an image database environment
CN103632294A (en) * 2013-12-20 2014-03-12 互动通天图信息技术有限公司 Method for integrating user data based on media and third-party data platform
CN108647983A (en) * 2018-03-16 2018-10-12 北京奇艺世纪科技有限公司 Seed user determines method, apparatus and advertisement placement method, device
CN110399564A (en) * 2019-07-23 2019-11-01 腾讯科技(深圳)有限公司 Account number classification method and device, storage medium and electronic device
CN111382283A (en) * 2020-03-12 2020-07-07 腾讯科技(深圳)有限公司 Resource category label labeling method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020188602A1 (en) * 2001-05-07 2002-12-12 Eastman Kodak Company Method for associating semantic information with multiple images in an image database environment
CN103632294A (en) * 2013-12-20 2014-03-12 互动通天图信息技术有限公司 Method for integrating user data based on media and third-party data platform
CN108647983A (en) * 2018-03-16 2018-10-12 北京奇艺世纪科技有限公司 Seed user determines method, apparatus and advertisement placement method, device
CN110399564A (en) * 2019-07-23 2019-11-01 腾讯科技(深圳)有限公司 Account number classification method and device, storage medium and electronic device
CN111382283A (en) * 2020-03-12 2020-07-07 腾讯科技(深圳)有限公司 Resource category label labeling method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴俊,李焱,党莎;: "《一本书读透Martech智慧营销》", 30 June 2020, 机械工业出版社 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112907282A (en) * 2021-02-05 2021-06-04 杭州微洱网络科技有限公司 Architecture application method based on global e-commerce industry advertisement DMP
CN113139125A (en) * 2021-04-21 2021-07-20 北方工业大学 User demand driven service matching method
CN113139125B (en) * 2021-04-21 2024-02-09 北方工业大学 User demand driven service matching method
CN114387477A (en) * 2022-01-18 2022-04-22 中国农业银行股份有限公司 Label classification model training method, label classification method, device and equipment
CN114387477B (en) * 2022-01-18 2025-03-18 中国农业银行股份有限公司 Label classification model training method, label classification method, device and equipment

Similar Documents

Publication Publication Date Title
JP7392668B2 (en) Data processing methods and electronic equipment
US20230102337A1 (en) Method and apparatus for training recommendation model, computer device, and storage medium
US11074295B2 (en) Distributed graph embedding method and apparatus, device, and system
US8990209B2 (en) Distributed scalable clustering and community detection
US9836701B2 (en) Distributed stage-wise parallel machine learning
US11315032B2 (en) Method and system for recommending content items to a user based on tensor factorization
CN113268656A (en) User recommendation method and device, electronic equipment and computer storage medium
CN112001748A (en) Data expansion method and equipment based on label propagation
CN113763077B (en) Method and apparatus for detecting false trade orders
CN113516524B (en) Method and device for pushing information
US12182098B1 (en) Curating ambiguous data for use in a data pipeline through interaction with a data source
CN114443958A (en) A recommendation method, recommendation system and recommendation system training method
CN113609345A (en) Target object association method and device, computing equipment and storage medium
CN115293919A (en) Graph neural network prediction method and system for out-of-distribution generalization of social network
CN116245139B (en) Training method and device for graph neural network model, event detection method and device
US20220027722A1 (en) Deep Relational Factorization Machine Techniques for Content Usage Prediction via Multiple Interaction Types
CN112035581B (en) Model-based task processing method, device, equipment and medium
CN118211590A (en) A method, system, storage medium and terminal for extracting entity relations from documents
CN111507471A (en) A model training method, device, equipment and storage medium
Sulistianingsih et al. GN-PPN: Parallel Girvan-Newman-Based Algorithm to Detect Communities in Graph with Positive and Negative Weights.
CN116975686A (en) Method for training student model, behavior prediction method and device
CN110851600A (en) Text data processing method and device based on deep learning
CN114970758A (en) Data analysis method and device, electronic equipment and computer storage medium
Nainwal et al. Text summarization of amazon customer reviews using NLP
CN109597851B (en) Feature extraction method and device based on incidence relation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201127

RJ01 Rejection of invention patent application after publication