CN114238304A

CN114238304A - A label generation method, device, computer equipment and storage medium

Info

Publication number: CN114238304A
Application number: CN202111601380.0A
Authority: CN
Inventors: 刘新宇; 王霏; 王彪; 胡玉玮
Original assignee: Shenzhen Xinguodu Digital Technology Co ltd
Current assignee: Shenzhen Xinguodu Digital Technology Co ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-03-25
Anticipated expiration: 2041-12-24
Also published as: CN114238304B

Abstract

The invention discloses a label generation method, device, computer equipment and storage medium. The method includes: acquiring historical data records containing different types of labels, and constructing a label metadata table based on the labels in the historical data records; The tags in the records are summarized into tag configuration information in the tag metadata table; each tag in the tag metadata table is extracted based on the tag configuration information, and an original tag data table is constructed according to the extracted tags; The label mapping table performs label cleaning on the original label data table, and sets the original label data table after label cleaning as the final label data table. The present invention summarizes the tags in the historical data records into tag configuration information, and constructs the original tag data table based on this, and then cleans the original tag data table to obtain the final tag data table, so that the tag data table can be improved. Build efficiency and label management effects.

Description

Label generation method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of computer software technologies, and in particular, to a tag generation method and apparatus, a computer device, and a storage medium.

Background

The existing label generation technology is generally based on manual label definition and arrangement at a database field level, then configured into a system background, and generates logic of each label through code correspondence, but the method has the problem of large labor consumption. In addition, the extraction and calculation of the labels are performed based on the traditional relational database, but when the data volume is large, the defects of insufficient data storage space, low calculation efficiency and the like exist. Meanwhile, when performing tag management, metadata of each tag needs to be manually maintained, which may cause omission, errors, and the like when the number of tags is too large.

Disclosure of Invention

The embodiment of the invention provides a label generation method, a label generation device, computer equipment and a storage medium, and aims to improve label generation efficiency and label management effect.

In a first aspect, an embodiment of the present invention provides a tag generation method, including:

acquiring a historical data record containing different types of labels, and constructing a label metadata table based on the labels in the historical data record;

summarizing and summarizing the labels in the historical data records into label configuration information in the label metadata table;

extracting each label in the label metadata table based on the label configuration information, and constructing an original label data table according to the extracted label;

and carrying out label cleaning treatment on the original label data table through a preset label mapping table, and setting the original label data table after label cleaning as a final label data table.

In a second aspect, an embodiment of the present invention provides a tag generation apparatus, including:

the data acquisition unit is used for acquiring historical data records containing different types of labels and constructing a label metadata table based on the labels in the historical data records;

the summarizing and summarizing unit is used for summarizing and summarizing the labels in the historical data record into label configuration information in the label metadata table;

the tag extraction unit is used for extracting each tag in the tag metadata table based on the tag configuration information and constructing an original tag data table according to the extracted tag;

and the label cleaning unit is used for carrying out label cleaning treatment on the original label data table through a preset label mapping table and setting the original label data table after label cleaning as a final label data table.

In a third aspect, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the tag generation method according to the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the tag generation method according to the first aspect.

The embodiment of the invention provides a label generation method, a label generation device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a historical data record containing different types of labels, and constructing a label metadata table based on the labels in the historical data record; summarizing and summarizing the labels in the historical data records into label configuration information in the label metadata table; extracting each label in the label metadata table based on the label configuration information, and constructing an original label data table according to the extracted label; and carrying out label cleaning treatment on the original label data table through a preset label mapping table, and setting the original label data table after label cleaning as a final label data table. According to the embodiment of the invention, the labels in the historical data record are summarized and summarized into the label configuration information, the original label data table is constructed according to the label configuration information, and then the original label data table is cleaned and the like by combining the label mapping table, so that the final label data table is obtained, and the label generation efficiency and the label management effect can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a tag generation method according to an embodiment of the present invention;

fig. 2 is a schematic block diagram of a tag generation apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flow chart of a tag generation method according to an embodiment of the present invention, which specifically includes: steps S101 to S104.

S101, acquiring historical data records containing different types of labels, and constructing a label metadata table based on the labels in the historical data records;

s102, summarizing and summarizing the labels in the historical data record into label configuration information in the label metadata table;

s103, extracting each label in the label metadata table based on the label configuration information, and constructing an original label data table according to the extracted label;

and S104, carrying out label cleaning treatment on the original label data table through a preset label mapping table, and setting the original label data table after label cleaning as a final label data table.

In this embodiment, the labels in the history data record are summarized and summarized into label configuration information, an original label data table is constructed according to the label configuration information, and then the original label data table is cleaned and the like by combining a label mapping table, so that a final label data table is obtained, and thus, the label generation efficiency and the label management effect can be improved.

In addition, the embodiment also summarizes and summarizes the tags into the tag configuration information in a tag customization mode, so that the generation of complex tags which cannot be defined by the tag types is covered. And in the generation process, the generation of the label of the complex logic can be realized through an sql statement.

In one embodiment, the step S102 includes:

when the number of the data values corresponding to the labels in the historical data record is multiple, summarizing the labels into a row of multiple labels;

when the number of the corresponding data values of the labels in the historical data record is one, summarizing the labels into a list of single labels;

when the labels in the historical data records correspond to the identification codes, the labels are summarized into one-hot coded labels;

when the label in the historical data record is a table head name, summarizing the label into a table label;

when the data value corresponding to the label in the historical data record is obtained through calculation, the label is summarized into a self-defined label;

setting the list of multi-tags, the list of single tags, the one-hot coded tags, the table tags and the custom tags as the tag configuration information.

In this embodiment, the tags are respectively summarized into a list of multi-tags, a list of single-tags, single-hot-coded tags, table tags and custom tags according to the data values or identifiers of the tags in the history data record, and further, the list of multi-tags, the list of single-tags, the single-hot-coded tags, the table tags and the custom tags are summarized as the tag configuration information.

For example, if one column in the history record is: 'type of business', where the data value is 'five hundred strong businesses, listed companies', it can be seen that two tags (i.e. five hundred strong businesses, listed companies) are separated by a separator ',' i.e. by configuring the separator, the tags can be automatically split, thus defining the type of tags as 'one column of multiple tags'; if one column in the historical data record is 'green channel enterprise', the qualified green channel enterprise can store a 'green channel' value in a corresponding field, so that the type of label can be defined as 'a column of single labels'; if a column name in the historical data record is 'whether China is 500 strong', the storage value in the column is 0 or 1, namely if the corresponding enterprise is China 500 strong, the corresponding enterprise is marked as 1, otherwise, the corresponding enterprise is 0, and the label in the situation can be defined as a 'one-hot coded label'; if the table name in the history data record is 'green channel enterprise list', the label in this case can be defined as 'the table itself is a label', that is, a table label; for the tag data, the tag data needs to be obtained through calculation, and can be defined as a 'custom tag', and meanwhile, a specific calculation rule is defined by configuring tag calculation logic (sql).

In one embodiment, the step S103 includes:

extracting the labels in the label metadata table in batches based on the list of multi-labels, the list of single labels, the single-hot coded labels, the table labels and the custom labels;

and sequentially setting the labels extracted from the batches in order to construct the original label data table.

In this embodiment, after the tags are summarized into different types, the tags in the tag metadata table can all correspond to respective tag configuration information, that is, the types of the tags to which the tags belong. Then extracting the labels in the label metadata table in batches according to types according to the label configuration information, arranging the labels in the extraction sequence according to the regions or in the sequence, and setting the formed form as the original label data table. Of course, in the original tag data table, the order between different types of tags can be freely set, for example, a list of tags is located before a list of individual tags, or a unique thermally encoded tag is located after a custom tag, and so on.

In an embodiment, the tag generation method further includes:

and acquiring an enterprise standard information table, and correcting and completing the enterprise basic information in the original label data table based on the enterprise standard information table so as to correctly associate each label in the original label data table with an enterprise.

In this embodiment, according to an existing enterprise standard information table (that is, the existing enterprise standard information table includes real and accurate enterprise basic information, such as an enterprise name, a unicode, a registration number, and the like), operations such as correcting and complementing the enterprise basic information (such as a name, a unicode, a registration number, and the like) in the original tag data table are performed, so as to ensure that the tags in the original tag data table can be associated with correct enterprises and enterprise information.

In one embodiment, the step S104 includes:

based on the label mapping table, mapping labels with different names and the same meaning in the original label data table into a standard label;

and acquiring labels with the same name in the original label data table, and performing duplicate removal processing on the labels with the same name according to the enterprise standard information table.

In this embodiment, the tag mapping table is configured with information on how tags from different data tables should be cleaned, and a standard name of the tag. For example, a business is labeled as 'china 500 strong' and 'five hundred strong' in two different data tables, respectively, and the two labels have different names but identical meanings, so that by mapping the two labels onto a standard 'china five hundred strong' label in the label mapping table, the consistency and accuracy of the final label can be ensured. In one embodiment, based on the tags in the history and past tag configuration experience, a tag mapping system may be constructed, by which it may be determined whether to map the name of the tag and to which standard name the tag should be mapped.

Further, in an embodiment, the tag generation method further includes:

setting an effective period for the label after mapping processing and de-duplication processing;

and disabling the failed label and the expired label in the original label data table.

In this embodiment, validity and expiration date are set for the tag, so that the tag can be used only within the expiration date, and if the expiration date is exceeded, the tag loses its validity, and is changed to a failed tag or an expired tag. For the invalid label and the overdue label, the invalid label and the overdue label can be forbidden in time, so that the invalid or overdue label is prevented from being found after being used, and the use experience degree can be improved.

In an embodiment, the tag generation method further includes:

loading a Hive configuration file based on Spark SQL, and acquiring the metadata information of the Hive;

storing the tag data table as a Hive table through the metadata information of the Hive;

and correspondingly operating the Hive table based on Spark SQL.

In this embodiment, by using a big data technology, data storage and processing are performed on the tag data table based on Hive (data warehouse tool) and Spark (a calculation engine), etc., a tag calculation process can be concurrently processed, a storage space of data is reduced, tag generation efficiency is improved, and linear promotion can be achieved by laterally expanding hardware resources. Specifically, the Hive configuration file is loaded through Spark SQL to obtain corresponding metadata information, and the tag data table is stored as a Hive table, and then corresponding operations, such as query, update, and the like, may be performed on the Hive table through Spark SQL. Further, by means of distributed computing, millions of data labels can be generated in a very short time. Through the steps, the accuracy and the timeliness of the generated label data can be effectively guaranteed, meanwhile, only few manual intervention processes are needed, and compared with the existing label generation technology, the method has the advantage of greatly improving the efficiency.

Fig. 2 is a schematic block diagram of a tag generation apparatus 200 according to an embodiment of the present invention, where the apparatus 200 includes:

the data acquisition unit 201 is used for acquiring historical data records containing different types of tags and constructing a tag metadata table based on the tags in the historical data records;

a summary summarization unit 202, configured to summarize the tags in the history data record into tag configuration information in the tag metadata table;

the tag extraction unit 203 is configured to extract each tag in the tag metadata table based on the tag configuration information, and construct an original tag data table according to the extracted tag;

and the label cleaning unit 204 is configured to perform label cleaning processing on the original label data table through a preset label mapping table, and set the original label data table after label cleaning as a final label data table.

In one embodiment, the summary summarization unit 202 comprises:

the first induction unit is used for inducing the labels into a column of multi-labels when the number of the data values corresponding to the labels in the historical data record is multiple;

the second induction unit is used for inducing the labels into a list of single labels when the corresponding data value of the label in the historical data record is one;

the third storage unit is used for summarizing the label in the historical data record into a one-hot coded label when the label corresponds to the identification code;

a fourth induction unit, configured to induce a tag in the history data record into a table tag when the tag is a table header name;

a fifth induction unit, configured to induce a tag into a custom tag when a data value corresponding to the tag in the historical data record is obtained through calculation;

and the summarizing unit is used for setting the list of multi-labels, the list of single labels, the one-hot coded labels, the table labels and the custom labels as the label configuration information.

In one embodiment, the tag extracting unit 203 includes:

the batch extraction unit is used for extracting the batches of the tags in the tag metadata table based on the list of multi-tags, the list of single tags, the one-hot coded tags, the table tags and the custom tags;

and the sequence setting unit is used for sequentially setting the labels extracted from the batches in sequence so as to construct the original label data table.

In one embodiment, the tag generation apparatus 200 further comprises:

and the correction and completion unit is used for acquiring an enterprise standard information table, and correcting and completing the enterprise basic information in the original label data table based on the enterprise standard information table so as to correctly associate each label in the original label data table with the enterprise.

In one embodiment, the label washing unit 204 includes:

the label mapping unit is used for mapping labels with different names and the same meaning in the original label data table into a standard label based on the label mapping table;

and the label duplication removing unit is used for acquiring labels with the same name in the original label data table and carrying out duplication removing processing on the labels with the same name according to the enterprise standard information table.

In one embodiment, the tag generation apparatus 200 further comprises:

the time limit setting unit is used for setting the valid time limit for the label after the mapping processing and the de-duplication processing;

and the disabling unit is used for disabling the failed label and the expired label in the original label data table.

In one embodiment, the tag generation apparatus 200 further comprises:

the file loading unit is used for loading the Hive configuration file based on Spark SQL and acquiring the metadata information of the Hive;

a data storage unit, configured to store the tag data table as a Hive table through the metadata information of Hive;

and the table operation unit is used for carrying out corresponding operation on the Hive table based on Spark SQL.

Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed, the steps provided by the above embodiments can be implemented. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiment of the present invention further provides a computer device, which may include a memory and a processor, where the memory stores a computer program, and the processor may implement the steps provided in the above embodiments when calling the computer program in the memory. Of course, the computer device may also include various network interfaces, power supplies, and the like.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. a label generation method, is characterized in that, comprises:

Obtain historical data records containing tags of different categories, and build a tag metadata table based on the tags in the historical data records;

Summarize the tags in the historical data records into tag configuration information in the tag metadata table;

Extracting each tag in the tag metadata table based on the tag configuration information, and constructing an original tag data table according to the extracted tags;

Label cleaning is performed on the original label data table through a preset label mapping table, and the original label data table after label cleaning is set as the final label data table.

2. The method for generating labels according to claim 1, wherein the summarizing the labels in the historical data records into label configuration information in the label metadata table, comprising:

When there are multiple data values corresponding to the tags in the historical data record, the tags are summarized into one column of multiple tags;

When the data value corresponding to the tag in the historical data record is one, the tag is summarized into a list of single tags;

When the label in the historical data record corresponds to an identification code, the label is summarized as a one-hot encoded label;

When the label in the historical data record is the header name, the label is summarized as a table label;

When the data value corresponding to the tag in the historical data record is obtained by calculation, the tag is summarized as a custom tag;

The one-column multi-label, one-column single-label, one-hot encoding label, table label, and custom label are set as the label configuration information.

3. The label generation method according to claim 2, characterized in that, each label in the label metadata table is extracted based on the label configuration information, and an original label data is constructed according to the extracted label. Table, including:

Based on the one-column multi-label, one-column single-label, one-hot encoding label, table label and custom label, batch extraction is performed on the labels in the label metadata table;

The labels extracted from the batches are set in sequence to construct the original label data table.

4. label generation method according to claim 1, is characterized in that, also comprises:

The enterprise standard information table is obtained, and the basic information of the enterprise in the original label data table is corrected and supplemented based on the enterprise standard information table, so that each label in the original label data table is correctly associated with the enterprise.

5 . The label generation method according to claim 4 , wherein the label cleaning process is performed on the original label data table through a preset label mapping table, and the original label data table after label cleaning is set as 5 . The final label data sheet, including:

Based on the label mapping table, the labels with different names and the same meaning in the original label data table are mapped to a standard label;

Acquire tags with the same name in the original tag data table, and perform deduplication processing on the tags with the same name according to the enterprise standard information table.

6. label generation method according to claim 5, is characterized in that, also comprises:

Set the validity period for the tags after mapping processing and deduplication processing;

Disabling processing is performed on the invalid labels and expired labels in the original label data table.

7. label generation method according to claim 1, is characterized in that, also comprises:

Load the Hive configuration file based on Spark SQL, and obtain the metadata information of Hive;

Store the tag data table as a Hive table through the metadata information of the Hive;

Perform corresponding operations on the Hive table based on Spark SQL.

8. A label generating device, characterized in that, comprising:

A data acquisition unit, used to acquire historical data records containing tags of different categories, and build a tag metadata table based on the tags in the historical data records;

a summary and summarization unit for summarizing the tags in the historical data records into tag configuration information in the tag metadata table;

A label extraction unit, configured to extract each label in the label metadata table based on the label configuration information, and construct an original label data table according to the extracted labels;

The label cleaning unit is configured to perform label cleaning processing on the original label data table through a preset label mapping table, and set the original label data table after label cleaning as the final label data table.

9. A computer device, characterized in that it comprises a memory, a processor and a computer program stored on the memory and running on the processor, the processor implementing the computer program as claimed in the claims The label generation method described in any one of 1 to 7.

10. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the label according to any one of claims 1 to 7 is implemented generate method.