Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1, fig. 1 is a schematic flow chart of a tag generation method according to an embodiment of the present invention, which specifically includes: steps S101 to S104.
S101, acquiring historical data records containing different types of labels, and constructing a label metadata table based on the labels in the historical data records;
s102, summarizing and summarizing the labels in the historical data record into label configuration information in the label metadata table;
s103, extracting each label in the label metadata table based on the label configuration information, and constructing an original label data table according to the extracted label;
and S104, carrying out label cleaning treatment on the original label data table through a preset label mapping table, and setting the original label data table after label cleaning as a final label data table.
In this embodiment, the labels in the history data record are summarized and summarized into label configuration information, an original label data table is constructed according to the label configuration information, and then the original label data table is cleaned and the like by combining a label mapping table, so that a final label data table is obtained, and thus, the label generation efficiency and the label management effect can be improved.
In addition, the embodiment also summarizes and summarizes the tags into the tag configuration information in a tag customization mode, so that the generation of complex tags which cannot be defined by the tag types is covered. And in the generation process, the generation of the label of the complex logic can be realized through an sql statement.
In one embodiment, the step S102 includes:
when the number of the data values corresponding to the labels in the historical data record is multiple, summarizing the labels into a row of multiple labels;
when the number of the corresponding data values of the labels in the historical data record is one, summarizing the labels into a list of single labels;
when the labels in the historical data records correspond to the identification codes, the labels are summarized into one-hot coded labels;
when the label in the historical data record is a table head name, summarizing the label into a table label;
when the data value corresponding to the label in the historical data record is obtained through calculation, the label is summarized into a self-defined label;
setting the list of multi-tags, the list of single tags, the one-hot coded tags, the table tags and the custom tags as the tag configuration information.
In this embodiment, the tags are respectively summarized into a list of multi-tags, a list of single-tags, single-hot-coded tags, table tags and custom tags according to the data values or identifiers of the tags in the history data record, and further, the list of multi-tags, the list of single-tags, the single-hot-coded tags, the table tags and the custom tags are summarized as the tag configuration information.
For example, if one column in the history record is: 'type of business', where the data value is 'five hundred strong businesses, listed companies', it can be seen that two tags (i.e. five hundred strong businesses, listed companies) are separated by a separator ',' i.e. by configuring the separator, the tags can be automatically split, thus defining the type of tags as 'one column of multiple tags'; if one column in the historical data record is 'green channel enterprise', the qualified green channel enterprise can store a 'green channel' value in a corresponding field, so that the type of label can be defined as 'a column of single labels'; if a column name in the historical data record is 'whether China is 500 strong', the storage value in the column is 0 or 1, namely if the corresponding enterprise is China 500 strong, the corresponding enterprise is marked as 1, otherwise, the corresponding enterprise is 0, and the label in the situation can be defined as a 'one-hot coded label'; if the table name in the history data record is 'green channel enterprise list', the label in this case can be defined as 'the table itself is a label', that is, a table label; for the tag data, the tag data needs to be obtained through calculation, and can be defined as a 'custom tag', and meanwhile, a specific calculation rule is defined by configuring tag calculation logic (sql).
In one embodiment, the step S103 includes:
extracting the labels in the label metadata table in batches based on the list of multi-labels, the list of single labels, the single-hot coded labels, the table labels and the custom labels;
and sequentially setting the labels extracted from the batches in order to construct the original label data table.
In this embodiment, after the tags are summarized into different types, the tags in the tag metadata table can all correspond to respective tag configuration information, that is, the types of the tags to which the tags belong. Then extracting the labels in the label metadata table in batches according to types according to the label configuration information, arranging the labels in the extraction sequence according to the regions or in the sequence, and setting the formed form as the original label data table. Of course, in the original tag data table, the order between different types of tags can be freely set, for example, a list of tags is located before a list of individual tags, or a unique thermally encoded tag is located after a custom tag, and so on.
In an embodiment, the tag generation method further includes:
and acquiring an enterprise standard information table, and correcting and completing the enterprise basic information in the original label data table based on the enterprise standard information table so as to correctly associate each label in the original label data table with an enterprise.
In this embodiment, according to an existing enterprise standard information table (that is, the existing enterprise standard information table includes real and accurate enterprise basic information, such as an enterprise name, a unicode, a registration number, and the like), operations such as correcting and complementing the enterprise basic information (such as a name, a unicode, a registration number, and the like) in the original tag data table are performed, so as to ensure that the tags in the original tag data table can be associated with correct enterprises and enterprise information.
In one embodiment, the step S104 includes:
based on the label mapping table, mapping labels with different names and the same meaning in the original label data table into a standard label;
and acquiring labels with the same name in the original label data table, and performing duplicate removal processing on the labels with the same name according to the enterprise standard information table.
In this embodiment, the tag mapping table is configured with information on how tags from different data tables should be cleaned, and a standard name of the tag. For example, a business is labeled as 'china 500 strong' and 'five hundred strong' in two different data tables, respectively, and the two labels have different names but identical meanings, so that by mapping the two labels onto a standard 'china five hundred strong' label in the label mapping table, the consistency and accuracy of the final label can be ensured. In one embodiment, based on the tags in the history and past tag configuration experience, a tag mapping system may be constructed, by which it may be determined whether to map the name of the tag and to which standard name the tag should be mapped.
Further, in an embodiment, the tag generation method further includes:
setting an effective period for the label after mapping processing and de-duplication processing;
and disabling the failed label and the expired label in the original label data table.
In this embodiment, validity and expiration date are set for the tag, so that the tag can be used only within the expiration date, and if the expiration date is exceeded, the tag loses its validity, and is changed to a failed tag or an expired tag. For the invalid label and the overdue label, the invalid label and the overdue label can be forbidden in time, so that the invalid or overdue label is prevented from being found after being used, and the use experience degree can be improved.
In an embodiment, the tag generation method further includes:
loading a Hive configuration file based on Spark SQL, and acquiring the metadata information of the Hive;
storing the tag data table as a Hive table through the metadata information of the Hive;
and correspondingly operating the Hive table based on Spark SQL.
In this embodiment, by using a big data technology, data storage and processing are performed on the tag data table based on Hive (data warehouse tool) and Spark (a calculation engine), etc., a tag calculation process can be concurrently processed, a storage space of data is reduced, tag generation efficiency is improved, and linear promotion can be achieved by laterally expanding hardware resources. Specifically, the Hive configuration file is loaded through Spark SQL to obtain corresponding metadata information, and the tag data table is stored as a Hive table, and then corresponding operations, such as query, update, and the like, may be performed on the Hive table through Spark SQL. Further, by means of distributed computing, millions of data labels can be generated in a very short time. Through the steps, the accuracy and the timeliness of the generated label data can be effectively guaranteed, meanwhile, only few manual intervention processes are needed, and compared with the existing label generation technology, the method has the advantage of greatly improving the efficiency.
Fig. 2 is a schematic block diagram of a tag generation apparatus 200 according to an embodiment of the present invention, where the apparatus 200 includes:
the data acquisition unit 201 is used for acquiring historical data records containing different types of tags and constructing a tag metadata table based on the tags in the historical data records;
a summary summarization unit 202, configured to summarize the tags in the history data record into tag configuration information in the tag metadata table;
the tag extraction unit 203 is configured to extract each tag in the tag metadata table based on the tag configuration information, and construct an original tag data table according to the extracted tag;
and the label cleaning unit 204 is configured to perform label cleaning processing on the original label data table through a preset label mapping table, and set the original label data table after label cleaning as a final label data table.
In one embodiment, the summary summarization unit 202 comprises:
the first induction unit is used for inducing the labels into a column of multi-labels when the number of the data values corresponding to the labels in the historical data record is multiple;
the second induction unit is used for inducing the labels into a list of single labels when the corresponding data value of the label in the historical data record is one;
the third storage unit is used for summarizing the label in the historical data record into a one-hot coded label when the label corresponds to the identification code;
a fourth induction unit, configured to induce a tag in the history data record into a table tag when the tag is a table header name;
a fifth induction unit, configured to induce a tag into a custom tag when a data value corresponding to the tag in the historical data record is obtained through calculation;
and the summarizing unit is used for setting the list of multi-labels, the list of single labels, the one-hot coded labels, the table labels and the custom labels as the label configuration information.
In one embodiment, the tag extracting unit 203 includes:
the batch extraction unit is used for extracting the batches of the tags in the tag metadata table based on the list of multi-tags, the list of single tags, the one-hot coded tags, the table tags and the custom tags;
and the sequence setting unit is used for sequentially setting the labels extracted from the batches in sequence so as to construct the original label data table.
In one embodiment, the tag generation apparatus 200 further comprises:
and the correction and completion unit is used for acquiring an enterprise standard information table, and correcting and completing the enterprise basic information in the original label data table based on the enterprise standard information table so as to correctly associate each label in the original label data table with the enterprise.
In one embodiment, the label washing unit 204 includes:
the label mapping unit is used for mapping labels with different names and the same meaning in the original label data table into a standard label based on the label mapping table;
and the label duplication removing unit is used for acquiring labels with the same name in the original label data table and carrying out duplication removing processing on the labels with the same name according to the enterprise standard information table.
In one embodiment, the tag generation apparatus 200 further comprises:
the time limit setting unit is used for setting the valid time limit for the label after the mapping processing and the de-duplication processing;
and the disabling unit is used for disabling the failed label and the expired label in the original label data table.
In one embodiment, the tag generation apparatus 200 further comprises:
the file loading unit is used for loading the Hive configuration file based on Spark SQL and acquiring the metadata information of the Hive;
a data storage unit, configured to store the tag data table as a Hive table through the metadata information of Hive;
and the table operation unit is used for carrying out corresponding operation on the Hive table based on Spark SQL.
Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.
Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed, the steps provided by the above embodiments can be implemented. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The embodiment of the present invention further provides a computer device, which may include a memory and a processor, where the memory stores a computer program, and the processor may implement the steps provided in the above embodiments when calling the computer program in the memory. Of course, the computer device may also include various network interfaces, power supplies, and the like.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.