[go: up one dir, main page]

CN113568914B - A data processing method, device, equipment and storage medium - Google Patents

A data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN113568914B
CN113568914B CN202110863400.5A CN202110863400A CN113568914B CN 113568914 B CN113568914 B CN 113568914B CN 202110863400 A CN202110863400 A CN 202110863400A CN 113568914 B CN113568914 B CN 113568914B
Authority
CN
China
Prior art keywords
field
english
definition
neural network
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110863400.5A
Other languages
Chinese (zh)
Other versions
CN113568914A (en
Inventor
韩若冰
李红
曾凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Pudong Development Bank Co Ltd
Original Assignee
Shanghai Pudong Development Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Pudong Development Bank Co Ltd filed Critical Shanghai Pudong Development Bank Co Ltd
Priority to CN202110863400.5A priority Critical patent/CN113568914B/en
Publication of CN113568914A publication Critical patent/CN113568914A/en
Application granted granted Critical
Publication of CN113568914B publication Critical patent/CN113568914B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本发明公开了一种数据处理方法、装置、设备及存储介质。该方法包括:获取数据字段需求表,其中,所述数据字段需求表包括:字段的中文定义和字段的属性描述信息;将所述数据字段需求表输入全连接神经网络模型,得到所述字段的英文定义,所述全连接神经网络模型通过目标样本集迭代训练神经网络模型得到,所述目标样本集包括:字段样本的中文定义、字段样本的英文定义和字段样本的属性描述信息,通过本发明的技术方案,能够结合数据库开发规范,将字段映射表制成训练样本集,采用全连接神经网络生成具有统一规范的字段定义。

The present invention discloses a data processing method, device, equipment and storage medium. The method comprises: obtaining a data field requirement table, wherein the data field requirement table comprises: a Chinese definition of a field and attribute description information of the field; inputting the data field requirement table into a fully connected neural network model to obtain an English definition of the field, wherein the fully connected neural network model is obtained by iteratively training a neural network model with a target sample set, wherein the target sample set comprises: a Chinese definition of a field sample, an English definition of a field sample and attribute description information of a field sample, and through the technical solution of the present invention, the field mapping table can be made into a training sample set in combination with database development specifications, and a field definition with a unified specification can be generated using a fully connected neural network.

Description

Data processing method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a data processing method, a device, equipment and a storage medium.
Background
In the initial stage of using the database application system, a large amount of initial data is required to be input at one time, and related data is often required to be acquired from a local or other systems in the process. The traditional method is to complete the work by utilizing the data input function of the database application system, or write specific codes aiming at specific business to realize the input of batch data. The development of the entry function includes the definition of the database, the definition of the data table and the definition of the index. Database definition is typically done using "sql" scripts, and specific business code is typically done using large data processing frameworks such as SpringBatch and Spark. The development efficiency of both is mainly dependent on the technical proficiency of the developer and understanding of business knowledge.
The main work of batch engineering is to import different data files into a database according to certain processing rules. The data file is described in a matrix form, each row of data represents an independent piece of data, and each column of data represents the same attribute. In order to ensure that each column of data has clear and accurate definition (English abbreviation, precision size, null value or the like) after the data is imported into a corresponding data table (the database comprises a plurality of data tables), a developer is required to design the data table in detail by combining the development specification of the database and the actual meaning of the data attribute. However, due to the difference of different developers' understanding of specifications and the difference of personal experiences, the phenomenon that the same data attribute is not uniformly defined in different data tables can occur. This problem is detrimental to the user's understanding of the data and the long-term maintenance of the database.
Disclosure of Invention
The embodiment of the invention provides a data processing method, a device, equipment and a storage medium, which are used for realizing that a field mapping table can be made into a training sample set by combining with a database development specification, and a fully connected neural network is adopted for model training, so that the network takes data attributes (fields) as input to generate field definitions with unified specification, thereby realizing the unification and standardization of data table design.
In a first aspect, an embodiment of the present invention provides a data processing method, including:
Acquiring a data field requirement table, wherein the data field requirement table comprises Chinese definition of a field and attribute description information of the field;
inputting the data field requirement table into a fully-connected neural network model to obtain English definitions of the fields, wherein the fully-connected neural network model is obtained by iteratively training the neural network model through a target sample set, and the target sample set comprises Chinese definitions of field samples, english definitions of the field samples and attribute description information of the field samples.
Further, iteratively training a neural network model through the set of target samples, comprising:
inputting Chinese definitions of field samples in the target sample set and attribute description information of the field samples into a neural network model to obtain prediction English definitions;
training parameters of the neural network model according to an objective function formed by the prediction English definition and the English definition of the field sample;
and returning to execute the operation of inputting the Chinese definition of the field sample in the target sample set and the attribute description information of the field sample into the neural network model to obtain the prediction English definition until the fully connected neural network model is obtained.
Further, after inputting the data field requirement table into a fully connected neural network model to obtain the english definition of the field, the method further includes:
determining the precision of the field and the constraint limit of the field according to the English definition of the field;
And creating a database according to the Chinese definition of the field, the English definition of the field, the precision of the field and the constraint limit of the field.
Further, after creating the database according to the Chinese definition of the field, the English definition of the field, the precision of the field and the constraint limit of the field, the method further comprises:
generating an index design table according to English definitions of the fields;
determining a target item set according to the support degree of each field combination in the index design table;
and determining the field combination with the maximum confidence in the target item set as a target joint index.
Further, after determining the field combination with the greatest confidence in the target item set as the target joint index, the method further includes:
generating a data table according to the English definition of the field and the target joint index;
And importing the data corresponding to the English definitions of the fields into a database through a Spark framework.
Further, iteratively training a neural network model through the set of target samples, comprising:
obtaining a target sample set, wherein the target sample set comprises Chinese definition of a field sample, english definition of the field sample and attribute description information of the field sample;
Performing semantic segmentation on the Chinese definition of the field sample and the attribute description information of the field sample to obtain English word combinations corresponding to the Chinese definition of the field sample and English word combinations corresponding to the attribute description information of the field sample;
Determining a field description matrix according to English word combinations corresponding to Chinese definitions of the field samples and English word combinations corresponding to attribute description information of the field samples;
acquiring a first eigenvalue of the field description matrix;
Inputting the first characteristic value into a neural network model to obtain a first vector;
acquiring a predicted English definition corresponding to the first vector;
if the predicted English definition is the same as the English definition of the field sample, judging that the predicted English definition is correct;
and when the accuracy is larger than the set threshold, obtaining the fully-connected neural network model.
In a second aspect, an embodiment of the present invention further provides a data processing apparatus, including:
the acquisition module is used for acquiring a data field requirement table, wherein the data field requirement table comprises Chinese definition of a field and attribute description information of the field;
The determining module is used for inputting the data field requirement table into a fully-connected neural network model to obtain English definitions of the fields, the fully-connected neural network model is obtained by iteratively training the neural network model through a target sample set, and the target sample set comprises Chinese definitions of field samples, english definitions of the field samples and attribute description information of the field samples.
Further, the determining module is specifically configured to:
inputting Chinese definitions of field samples in the target sample set and attribute description information of the field samples into a neural network model to obtain prediction English definitions;
training parameters of the neural network model according to an objective function formed by the prediction English definition and the English definition of the field sample;
and returning to execute the operation of inputting the Chinese definition of the field sample in the target sample set and the attribute description information of the field sample into the neural network model to obtain the prediction English definition until the fully connected neural network model is obtained.
In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor executes the program to implement a data processing method according to any one of the embodiments of the present invention.
In a fourth aspect, embodiments of the present invention further provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method according to any of the embodiments of the present invention.
The embodiment of the invention obtains a data field requirement table by obtaining the data field requirement table, wherein the data field requirement table comprises Chinese definition of a field and attribute description information of the field, inputting the data field requirement table into a fully connected neural network model to obtain the English definition of the field, and iteratively training the neural network model through a target sample set to obtain the fully connected neural network model, wherein the target sample set comprises the Chinese definition of a field sample, the English definition of the field sample and the attribute description information of the field sample so as to realize the development specification of a database in combination, the field mapping table is made into a training sample set, and the fully connected neural network is used for model training, so that the network takes the data attribute (field) as input to generate the field definition with unified specification, thereby realizing the unification and standardization of the data table design.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a data processing method in an embodiment of the invention;
FIG. 1a is a flow chart of training a fully connected neural network model in an embodiment of the present invention;
FIG. 1b is a flowchart of frequent K-term set computation in an embodiment of the invention;
FIG. 1c is a block diagram of a batch task program in an embodiment of the invention;
FIG. 1d is a block diagram of a tool configuration in an embodiment of the invention;
FIG. 2 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an electronic device according to an embodiment of the present invention;
Fig. 4 is a schematic structural view of a computer-readable storage medium containing a computer program in an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings. Furthermore, embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like. Furthermore, embodiments of the invention and features of the embodiments may be combined with each other without conflict.
The term "comprising" and variants thereof as used herein is intended to be open ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment".
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention, where the method may be applied to a data processing case, and the method may be performed by a data processing device according to an embodiment of the present invention, where the device may be implemented in software and/or hardware, as shown in fig. 1, and the method specifically includes the following steps:
s110, acquiring a data field requirement table, wherein the data field requirement table comprises Chinese definition of a field and attribute description information of the field.
S120, inputting the data field requirement table into a fully-connected neural network model to obtain English definitions of the fields, wherein the fully-connected neural network model is obtained by iteratively training the neural network model through a target sample set, and the target sample set comprises Chinese definitions of field samples, english definitions of the field samples and attribute description information of the field samples.
The target sample set may be a field attribute mapping table drawn by sorting a database design file, as shown in table 1:
TABLE 1
As can be seen from table 1, fields with similar meaning have the same definition in different data tables, and fields with the same chinese definition have different attribute descriptions and english definitions in different data tables.
As shown in fig. 1a, the data field requirement table is input into a fully connected neural network model, and the english definition of the field may be obtained by obtaining a field attribute mapping table, performing word sense segmentation on a chinese definition of a field sample and attribute description information of the field sample in the field attribute mapping table, performing field vectorization on the segmented result, determining a field description matrix according to the vectorized result, obtaining a feature value of the field description matrix, and training parameters of the neural network model according to the feature value of the field description matrix.
Optionally, iteratively training the neural network model by the target sample set includes:
inputting Chinese definitions of field samples in the target sample set and attribute description information of the field samples into a neural network model to obtain prediction English definitions;
training parameters of the neural network model according to an objective function formed by the prediction English definition and the English definition of the field sample;
and returning to execute the operation of inputting the Chinese definition of the field sample in the target sample set and the attribute description information of the field sample into the neural network model to obtain the prediction English definition until the fully connected neural network model is obtained.
Optionally, after inputting the data field requirement table into a fully connected neural network model to obtain the english definition of the field, the method further includes:
determining the precision of the field and the constraint limit of the field according to the English definition of the field;
And creating a database according to the Chinese definition of the field, the English definition of the field, the precision of the field and the constraint limit of the field.
Optionally, after creating the database according to the chinese definition of the field, the english definition of the field, the precision of the field, and the constraint limit of the field, the method further includes:
generating an index design table according to English definitions of the fields;
determining a target item set according to the support degree of each field combination in the index design table;
and determining the field combination with the maximum confidence in the target item set as a target joint index.
The index design table may be generated according to the english definition of the field by querying, from a database, a field combination related to the english definition of the field according to the english definition of the field, and generating the index design table according to the field combination related to the english definition of the field.
Wherein the target item set may be a frequent K item set.
It should be noted that, a combination of fields with a confidence level greater than a confidence threshold in the target item set may also be determined as a target joint index. The embodiment of the present invention is not limited thereto.
Specifically, the generated field definition is used as the input of the index definition module, and according to the index association relation extracted by the Apriori calculation unit, a proper field is selected from the input fields to be used as the joint index of the corresponding data table, and the joint index is used as the output of the module.
In one specific example, apriori (association analysis algorithm) is used to mine the index relationships in the index design table. The specific calculation process is as follows. As shown in Table 2, table 2 is an index design table, and A-E are five index fields.
TABLE 2
List number Index field
1 A,C,D
2 B,C,E
3 A,B,C,E
4 B,E
The "support" of each field combination in table 2, i.e. the probability that one or more fields are indexed in the same data table, is calculated as:
support(x,y)=num(xy)/num(All);
where num (xy) is the number of times the xy joint index occurs and num (All) is the number of times All the joint indexes occur in table 2.
I.e. the mathematical joint probabilities. From the above table:
support(A)=2/4=0.5;
support(A,B)=1/4=0.25。
the confidence of each field combination is calculated.
The "frequent k term set" is calculated, i.e. the term set that frequently occurs in the data set may be one or more, since it contains k fields, then it is called the k term set, and the event that satisfies the minimum support threshold is called the frequent k term set. As shown in fig. 1b, when the support degree is calculated, since the denominator part is the same in the calculation formula, the numerator part (i.e., the number of occurrences) is used as the support degree. Wherein the minimum support is set to 2, and a field combination is discarded when the support of the field combination is smaller than the minimum support. In this way, the field combination most suitable as an index can be selected among a plurality of field combinations.
(1) The confidence coefficient is obtained by calculating the support degree of the frequent k term set, and the confidence coefficient has the following calculation formula:
where p (xy) is the probability of occurrence of the xy joint index and p (y) is the probability of occurrence of the y index.
I.e. a mathematical conditional probability. From the results of frequent k-term sets, 6 sets of confidence levels can be reached:
(2) The confidence threshold value (set to 0.8 in the example) is set, and the combination of the 6 sets of confidence values, which is greater than the confidence threshold value, is used as an index setting rule, namely when three fields of B, C and E exist in one data table at the same time, the system takes the three fields of B, C and E as joint indexes, and the sequence of the joint indexes is (B, C, E) or (C, E and B). In the practical application process, because the data table and the field number are more, the confidence accuracy is relatively higher, and the situation that the same confidence occurs in the same field combination can not exist.
The training mature fully-connected neural network takes a data field requirement table (comprising Chinese definitions and attribute descriptions of fields) as input, outputs a2×5 matrix, and carries out inverse quantization processing on the matrix to obtain a corresponding English definition. And matching is carried out in the target sample set to obtain the precision and constraint limit of the field.
Specifically, index design is an important ring in database development, and the design result directly affects the query efficiency of data. The design of the joint index is more critical because the performance of the joint index is more efficient than an equal number of single indexes. In order to improve the use efficiency of the index, the index fields are subjected to data mining by combining with database development specifications and index design principles, potential association relations among the index fields are analyzed, and an index result conforming to the design specifications is output.
Optionally, after determining the field combination with the greatest confidence in the target item set as the target joint index, the method further includes:
generating a data table according to the English definition of the field and the target joint index;
And importing the data corresponding to the English definitions of the fields into a database through a Spark framework.
In particular, the development of batch tasks involves the development of data importers and database generators. Both have the characteristics of complicated development flow, strong code structuring, high code content repetition rate and the like in the implementation process. In the actual development process, developers can perform a large number of repeated development due to individual differences among development requirements. Aiming at the problem, the embodiment of the invention uniformly encapsulates structural contents in a development program, performs interface processing on customized contents, develops an automatic design module, replaces a manual development flow, and improves development efficiency.
Specifically, the english definition of the field is used as input, and a corresponding batch task program is generated, so that each data file specifies a batch task corresponding to the batch task. After secondary packaging, the generated batch task program structure is shown in fig. 1c, (1) preparation, and the part mainly prepares resources for batch tasks. The method mainly comprises a database link, an FTP link (used for pulling specified data from a remote database to the local), and an S3 link (used for acquiring object storage resources). In order to ensure the integrity of batch tasks, when the acquisition of resources fails, the code of the subsequent part is not executed, (2) Execute, the part takes the field definition result as input, takes the Spark frame as a bottom data processing unit, and guides the data in the data file into the database according to the field definition and the specified precision, (3) Finish, the part mainly backs up the local file which is guided into the database to the object storage, and closes all connections after the backup is completed, and returns the currently occupied resources.
Specifically, the English definition and the index definition of the field are used as input, the field definition is used as a parameter of a database table establishment statement, the index definition is used as a parameter of an index statement establishment statement, and a corresponding 'sql' script is generated, so that a corresponding data table is generated.
In order to improve definition accuracy of the fields and the indexes, the system provides a feedback module, the generated field definition and the index definition are manually checked, if the generated definition does not meet the development specification of the purchasers or has conflict with the data field requirement table, the definition is redefined in a manual design mode, and a definition result is recorded to a field attribute mapping table and an index design table, so that the field definition module and the index definition module perform secondary training (extraction), and the system accuracy is improved.
Optionally, iteratively training the neural network model by the target sample set includes:
obtaining a target sample set, wherein the target sample set comprises Chinese definition of a field sample, english definition of the field sample and attribute description information of the field sample;
Performing semantic segmentation on the Chinese definition of the field sample and the attribute description information of the field sample to obtain English word combinations corresponding to the Chinese definition of the field sample and English word combinations corresponding to the attribute description information of the field sample;
Determining a field description matrix according to English word combinations corresponding to Chinese definitions of the field samples and English word combinations corresponding to attribute description information of the field samples;
acquiring a first eigenvalue of the field description matrix;
Inputting the first characteristic value into a neural network model to obtain a first vector;
acquiring a predicted English definition corresponding to the first vector;
if the predicted English definition is the same as the English definition of the field sample, judging that the predicted English definition is correct;
and when the accuracy is larger than the set threshold, obtaining the fully-connected neural network model.
In a specific example, because the Chinese logic is complex and is not easy to be processed by a computer, the Chinese definition of the field and the attribute description information thereof are semantically segmented into a plurality of English word combinations. As shown in table 3, table 3 is a result of semantic division for table 1.
TABLE 3 Table 3
Since english is relatively chinese, different field names may form the same word combination after being split.
Field vectorization, in which each Word can be converted into a vector by a tool of Word2Vec, for example, the Word "creation" can be represented by [0.1,0.2,0.2,0.8.0.6] after vectorization, and the distance between vectors is relatively smaller for words with similar Word senses. The vector of the word "create" may be expressed in [ 0.1.. 0.3,0.2,0.8,0.6], its Euclidean distance from "creation" may be expressed as:
When the Euclidean distance is less than the distance threshold, both are considered to be the same word. The vectorization can calculate the similarity of different words, and can effectively avoid the interference of the word multi-mode on the network training.
The field matrixing is that the Chinese definition of a field is generally composed of 1-2 words, and the attribute description information is composed of 4-5 words, so that the Chinese definition of the field is represented by a 2×2 matrix, the attribute description information of the field is represented by a 5×5 matrix, when the number of words is insufficient, 0 vector supplement is adopted, and when the number of words exceeds the upper limit, only the front part of words are intercepted. Thus, the matrix expression of the definition "creation time" of the field and the attribute description information "Company effective registration time" is:
according to the uniqueness of the word vectors, the conjunctions, prepositions and the like can be eliminated in the field matrixing process, and the effective information quantity of the matrix is increased.
Firstly, splicing and fusing a Chinese definition (2 multiplied by 5) matrix of the field and an attribute description information (5 multiplied by 5) matrix to form a field description matrix (7 multiplied by 5).
Then, the field description matrix is subjected to convolution processing (Convolution), and since each word is described by a1×5 vector, the convolution kernels are respectively 2×5, 3×5, and 4×5, and each convolution kernel has two convolution kernels. The description matrices are convolved using different convolution kernels and different sliding distances such that each description matrix can produce multiple 1-dimensional vectors. The 1-dimensional vectors are pooled (Max-Pooling, i.e. the maximum value in each vector is selected as the characteristic of the vector), so that the maximum values of the vectors are fused into a 6-dimensional vector which is used as the characteristic value of the corresponding field description matrix.
And taking the characteristic value as the input of the neural network model, and obtaining a 2X 5-dimensional vector. And carrying out Euclidean distance calculation on the output vector and the field vector in the target sample set, and taking the minimum value as a final mapping result. If the field of the mapping result is the same as the input field, the mapping result is judged to be correct, otherwise, the mapping result is wrong. When the accuracy reaches a set threshold, the network is considered to be trained and mature, and the fully-connected neural network model is obtained.
In another specific example, as shown in fig. 1d, the design structure frame is composed of 3 modules and 6 modules, and the design structure frame specifically comprises a definition module, a feedback module and a design module, wherein the definition module comprises a field definition module and an index definition module, the field definition module comprises a field attribute mapping table and a fully-connected neural network, the index definition module comprises an Apriori computing unit and an index design table, the feedback module comprises at least two feedback modules, and the design module comprises a batch task design module and a database design module.
In the aspect of functions, the system takes static files as input, adopts a multi-module low-coupling mode to operate, eliminates the defect of manual intervention development flow, and realizes the function of automatic generation of batch tasks. In the aspect of design, the system introduces a negative feedback design mode, so that the system is continuously and independently updated in an iterative manner in the use process. In terms of performance, the definition module adopts various machine learning algorithms, and overcomes the defects of high artificial learning cost, variability in learning effect and the like. In terms of efficiency, the system performs structured packaging on database scripts and batch execution programs in batch tasks, so that code repeatability in batch engineering is reduced, and the difference between tasks is improved.
According to the technical scheme, the data field requirement table comprises Chinese definitions of fields and attribute description information of the fields, the data field requirement table is input into a fully-connected neural network model to obtain English definitions of the fields, the fully-connected neural network model is obtained through iterative training of a neural network model through a target sample set, the target sample set comprises the Chinese definitions of field samples, the English definitions of the field samples and the attribute description information of the field samples, so that development specifications of a database can be combined, a field mapping table is manufactured into a training sample set, model training is conducted through the fully-connected neural network, and the network takes data attributes (fields) as input to generate field definitions with unified specifications, so that unified and standardization of data table design can be achieved.
Fig. 2 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention. The present embodiment may be applicable to the case of data processing, and the apparatus may be implemented in software and/or hardware, and may be integrated in any device that provides a data processing function, as shown in fig. 2, where the data processing apparatus specifically includes an obtaining module 210 and a determining module 220.
The acquiring module 210 is configured to acquire a data field requirement table, where the data field requirement table includes a chinese definition of a field and attribute description information of the field;
The determining module 220 is configured to input the data field requirement table into a fully connected neural network model, and obtain an english definition of the field, where the fully connected neural network model is obtained by iteratively training the neural network model through a target sample set, and the target sample set includes a chinese definition of the field sample, an english definition of the field sample, and attribute description information of the field sample.
Optionally, the determining module is specifically configured to:
inputting Chinese definitions of field samples in the target sample set and attribute description information of the field samples into a neural network model to obtain prediction English definitions;
training parameters of the neural network model according to an objective function formed by the prediction English definition and the English definition of the field sample;
and returning to execute the operation of inputting the Chinese definition of the field sample in the target sample set and the attribute description information of the field sample into the neural network model to obtain the prediction English definition until the fully connected neural network model is obtained.
The product can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
According to the technical scheme, the data field requirement table comprises Chinese definitions of fields and attribute description information of the fields, the data field requirement table is input into a fully-connected neural network model to obtain English definitions of the fields, the fully-connected neural network model is obtained through iterative training of a neural network model through a target sample set, the target sample set comprises the Chinese definitions of field samples, the English definitions of the field samples and the attribute description information of the field samples, so that development specifications of a database can be combined, a field mapping table is manufactured into a training sample set, model training is conducted through the fully-connected neural network, and the network takes data attributes (fields) as input to generate field definitions with unified specifications, so that unified and standardization of data table design can be achieved.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. Fig. 3 illustrates a block diagram of an exemplary electronic device 12 suitable for use in implementing embodiments of the present invention. The electronic device 12 shown in fig. 3 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 3, the electronic device 12 is in the form of a general purpose computing device. The components of the electronic device 12 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry standard architecture (Industry Standard Architecture, ISA) bus, micro channel architecture (Micro Channel Architecture, MCA) bus, enhanced ISA bus, video electronics standards association (Video Electronics Standards Association, VESA) local bus, and peripheral component interconnect (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus.
Electronic device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory, RAM) 30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 3, commonly referred to as a "hard disk drive"). Although not shown in fig. 3, a disk drive for reading from and writing to a removable nonvolatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable nonvolatile optical disk (Compact Disc-Read Only Memory, CD-ROM), digital versatile disk (Digital Video Disc-Read Only Memory, DVD-ROM), or other optical media, may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. The system memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
The electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the electronic device 12, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. In the electronic device 12 of the present embodiment, the display 24 is not provided as a separate body but is embedded in the mirror surface, and the display surface of the display 24 and the mirror surface are visually integrated when the display surface of the display 24 is not displayed. Also, electronic device 12 may communicate with one or more networks such as a local area network (Local Area Network, LAN), a wide area network Wide Area Network, a WAN, and/or a public network such as the internet via network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 over the bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 12, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, disk array (Redundant Arrays of INDEPENDENT DISKS, RAID) systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the data processing method provided by the embodiment of the present invention:
Acquiring a data field requirement table, wherein the data field requirement table comprises Chinese definition of a field and attribute description information of the field;
inputting the data field requirement table into a fully-connected neural network model to obtain English definitions of the fields, wherein the fully-connected neural network model is obtained by iteratively training the neural network model through a target sample set, and the target sample set comprises Chinese definitions of field samples, english definitions of the field samples and attribute description information of the field samples.
Fig. 4 is a schematic structural diagram of a computer-readable storage medium containing a computer program according to an embodiment of the present application. The present application provides a computer readable storage medium 61 having stored thereon a computer program 610 which, when executed by one or more processors, implements a data processing method as provided by all the inventive embodiments of the present application:
Acquiring a data field requirement table, wherein the data field requirement table comprises Chinese definition of a field and attribute description information of the field;
inputting the data field requirement table into a fully-connected neural network model to obtain English definitions of the fields, wherein the fully-connected neural network model is obtained by iteratively training the neural network model through a target sample set, and the target sample set comprises Chinese definitions of field samples, english definitions of the field samples and attribute description information of the field samples.
Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (Hyper Text Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be included in the electronic device or may exist alone without being incorporated into the electronic device.
Computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic that may be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (10)

1. A method of data processing, comprising:
Acquiring a data field requirement table, wherein the data field requirement table comprises Chinese definition of a field and attribute description information of the field;
Inputting the data field requirement table into a fully-connected neural network model to obtain English definitions of the fields, wherein the fully-connected neural network model is obtained by iteratively training the neural network model through a target sample set, and the target sample set comprises Chinese definitions of field samples, english definitions of the field samples and attribute description information of the field samples;
the step of inputting the data field requirement table into a fully-connected neural network model to obtain English definitions of the fields comprises the following steps:
Performing semantic segmentation on the Chinese definition of the field and the attribute description information of the field in the data field requirement table to obtain English word combinations corresponding to the Chinese definition of the field and English word combinations corresponding to the attribute description information of the field;
determining a target field description matrix according to English word combinations corresponding to Chinese definitions and English word combinations corresponding to attribute description information of the fields;
And inputting the target field description matrix into a fully-connected neural network model to obtain English definition of the field.
2. The method of claim 1, wherein iteratively training the neural network model through the set of target samples comprises:
inputting Chinese definitions of field samples in the target sample set and attribute description information of the field samples into a neural network model to obtain prediction English definitions;
training parameters of the neural network model according to an objective function formed by the prediction English definition and the English definition of the field sample;
and returning to execute the operation of inputting the Chinese definition of the field sample in the target sample set and the attribute description information of the field sample into the neural network model to obtain the prediction English definition until the fully connected neural network model is obtained.
3. The method of claim 1, further comprising, after entering the data field requirements table into a fully connected neural network model to obtain an english definition of the field:
determining the precision of the field and the constraint limit of the field according to the English definition of the field;
And creating a database according to the Chinese definition of the field, the English definition of the field, the precision of the field and the constraint limit of the field.
4. The method of claim 3, further comprising, after creating the database from the chinese definition of the field, the english definition of the field, the precision of the field, and the constraint limits of the field:
generating an index design table according to English definitions of the fields;
determining a target item set according to the support degree of each field combination in the index design table;
and determining the field combination with the maximum confidence in the target item set as a target joint index.
5. The method of claim 4, further comprising, after determining the field combination with the greatest confidence in the target item set as a target joint index:
generating a data table according to the English definition of the field and the target joint index;
And importing the data corresponding to the English definitions of the fields into a database through a Spark framework.
6. The method of claim 1, wherein iteratively training the neural network model through the set of target samples comprises:
obtaining a target sample set, wherein the target sample set comprises Chinese definition of a field sample, english definition of the field sample and attribute description information of the field sample;
Performing semantic segmentation on the Chinese definition of the field sample and the attribute description information of the field sample to obtain English word combinations corresponding to the Chinese definition of the field sample and English word combinations corresponding to the attribute description information of the field sample;
Determining a field description matrix according to English word combinations corresponding to Chinese definitions of the field samples and English word combinations corresponding to attribute description information of the field samples;
acquiring a first eigenvalue of the field description matrix;
Inputting the first characteristic value into a neural network model to obtain a first vector;
acquiring a predicted English definition corresponding to the first vector;
if the predicted English definition is the same as the English definition of the field sample, judging that the predicted English definition is correct;
and when the accuracy is larger than the set threshold, obtaining the fully-connected neural network model.
7. A data processing apparatus, comprising:
the acquisition module is used for acquiring a data field requirement table, wherein the data field requirement table comprises Chinese definition of a field and attribute description information of the field;
The determining module is used for inputting the data field requirement table into a fully-connected neural network model to obtain English definitions of the fields, wherein the fully-connected neural network model is obtained by iteratively training the neural network model through a target sample set, and the target sample set comprises Chinese definitions of field samples, english definitions of the field samples and attribute description information of the field samples;
The determining module is used for carrying out semantic segmentation on Chinese definitions of fields and attribute description information of the fields in the data field requirement table to obtain English word combinations corresponding to the Chinese definitions of the fields and English word combinations corresponding to the attribute description information of the fields, determining a target field description matrix according to the English word combinations corresponding to the Chinese definitions and the English word combinations corresponding to the attribute description information of the fields, and inputting the target field description matrix into the fully-connected neural network model to obtain the English definitions of the fields.
8. The apparatus of claim 7, wherein the determining module is specifically configured to:
inputting Chinese definitions of field samples in the target sample set and attribute description information of the field samples into a neural network model to obtain prediction English definitions;
training parameters of the neural network model according to an objective function formed by the prediction English definition and the English definition of the field sample;
and returning to execute the operation of inputting the Chinese definition of the field sample in the target sample set and the attribute description information of the field sample into the neural network model to obtain the prediction English definition until the fully connected neural network model is obtained.
9. An electronic device, comprising:
One or more processors;
A memory for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the processor to implement the method of any of claims 1-6.
10. A computer readable storage medium containing a computer program, on which the computer program is stored, characterized in that the program, when executed by one or more processors, implements the method according to any of claims 1-6.
CN202110863400.5A 2021-07-29 2021-07-29 A data processing method, device, equipment and storage medium Active CN113568914B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110863400.5A CN113568914B (en) 2021-07-29 2021-07-29 A data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110863400.5A CN113568914B (en) 2021-07-29 2021-07-29 A data processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113568914A CN113568914A (en) 2021-10-29
CN113568914B true CN113568914B (en) 2025-06-24

Family

ID=78168910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110863400.5A Active CN113568914B (en) 2021-07-29 2021-07-29 A data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113568914B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817243B (en) * 2022-03-29 2025-07-18 平安国际智慧城市科技股份有限公司 Database joint index establishing method, device, equipment and storage medium
CN115878662B (en) * 2022-11-01 2025-06-24 国家电网有限公司大数据中心 Statement generation method and device, electronic equipment and medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800332A (en) * 2018-12-04 2019-05-24 北京明略软件系统有限公司 Method, apparatus, computer storage medium and the terminal of processing field name

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955781A (en) * 2019-11-14 2020-04-03 北京明略软件系统有限公司 Model training method and device, and method and device for realizing benchmarking
CN112199372A (en) * 2020-09-24 2021-01-08 中国建设银行股份有限公司 Mapping relation matching method and device and computer readable medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800332A (en) * 2018-12-04 2019-05-24 北京明略软件系统有限公司 Method, apparatus, computer storage medium and the terminal of processing field name

Also Published As

Publication number Publication date
CN113568914A (en) 2021-10-29

Similar Documents

Publication Publication Date Title
JP7223785B2 (en) TIME-SERIES KNOWLEDGE GRAPH GENERATION METHOD, APPARATUS, DEVICE AND MEDIUM
US10963794B2 (en) Concept analysis operations utilizing accelerators
US11797281B2 (en) Multi-language source code search engine
US20240338414A1 (en) Inter-document attention mechanism
US20220138240A1 (en) Source code retrieval
US20180068221A1 (en) System and Method of Advising Human Verification of Machine-Annotated Ground Truth - High Entropy Focus
US11216739B2 (en) System and method for automated analysis of ground truth using confidence model to prioritize correction options
WO2020232898A1 (en) Text classification method and apparatus, electronic device and computer non-volatile readable storage medium
US20200175390A1 (en) Word embedding model parameter advisor
KR102608867B1 (en) Method for industry text increment, apparatus thereof, and computer program stored in medium
CN110750297B (en) A Python code reference information generation method based on program analysis and text analysis
CN114090620B (en) Method and device for processing query request
Sellam et al. Deepbase: Deep inspection of neural networks
CN113568914B (en) A data processing method, device, equipment and storage medium
US20240370779A1 (en) Systems and methods for using contrastive pre-training to generate text and code embeddings
CN113705207A (en) Grammar error recognition method and device
CN114722833A (en) Semantic classification method and device
CN115146070A (en) Key value generation method, knowledge graph generation method, device, equipment and medium
CN119669317B (en) Information display method and device, electronic device, storage medium and program product
CN112905752A (en) Intelligent interaction method, device, equipment and storage medium
CN112364053A (en) Search optimization method and device, electronic equipment and storage medium
CN117251777A (en) Data processing method, device, computer equipment and storage medium
CN114997178A (en) Natural language processing method and system based on artificial intelligence
CN114842246A (en) Social media pressure category detection method and device
CN107402914B (en) Natural language deep learning systems and methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant