[go: up one dir, main page]

CN118332407B - Method and system for automatically carrying out data identification, classification and classification - Google Patents

Method and system for automatically carrying out data identification, classification and classification Download PDF

Info

Publication number
CN118332407B
CN118332407B CN202410521012.2A CN202410521012A CN118332407B CN 118332407 B CN118332407 B CN 118332407B CN 202410521012 A CN202410521012 A CN 202410521012A CN 118332407 B CN118332407 B CN 118332407B
Authority
CN
China
Prior art keywords
data
classification
identification
label
met
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410521012.2A
Other languages
Chinese (zh)
Other versions
CN118332407A (en
Inventor
刘晓光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lianshi Networks Technology Co ltd
Original Assignee
Beijing Lianshi Networks Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lianshi Networks Technology Co ltd filed Critical Beijing Lianshi Networks Technology Co ltd
Priority to CN202410521012.2A priority Critical patent/CN118332407B/en
Publication of CN118332407A publication Critical patent/CN118332407A/en
Application granted granted Critical
Publication of CN118332407B publication Critical patent/CN118332407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a method and a system for automatically carrying out data identification, classification and classification, wherein the method for automatically carrying out data identification, classification and classification comprises the steps of responding to an input request of a data source, inputting the data source; the method comprises the steps of preprocessing a data source to determine data resources, identifying the data resources based on an identification model to obtain identification data, classifying the identification data based on a classification model to obtain classification data, classifying the classification data based on a classification model to obtain classification data, establishing a label for each classification data to obtain label data, and managing a catalog of the label data. The automatic data identification, classification and classification method solves the problem that automatic identification, classification and classification of data cannot be accurately performed in the prior art.

Description

Method and system for automatically carrying out data identification, classification and classification
Technical Field
The invention relates to the technical field of computers, in particular to a method, a system, electronic equipment and a storage medium for automatically carrying out data identification, classification and classification.
Background
Data is one of important production materials in social activities, and comprehensive mastering of organization data assets becomes a primary task, wherein development of data identification, data classification and data grading are basic preconditions for comprehensive mastering of organization data assets.
However, the method of manually combing the asset management ledger is low in efficiency and difficult to deal with personnel post adjustment and change, the method of interfacing the information system is rough in asset acquisition and easy to cause incomplete data resources or complex data use process, and the method of manually combing the asset management ledger is combined with the method of interfacing the information system, so that the effort and cost for data identification, classification and classification are extremely easy to be excessive.
The identification of trade secret data is more required to integrate industry characteristics, such as comprehensive consideration of multi-industry attribute data, such as technical patents, development results, product design, business plans, sales strategies, market research reports, customer lists, supply chain information, and the like.
The business secret data shows more complex data attributes than personal information, which always causes the inefficiency of data identification classification, and can cause the management of data assets of organizations to become air pavilions, and further can cause the lack of the situation of the data assets of organizations as a whole, and even the value-keeping and value-increasing of the data assets are in trouble.
Therefore, a method capable of accurately performing automatic data identification, classification and classification is needed.
Disclosure of Invention
The embodiment of the invention aims to provide a method, a system, electronic equipment and a storage medium for automatically carrying out data identification, classification and classification, which are used for solving the problem that the automatic identification, classification and classification of data cannot be accurately carried out in the prior art.
In order to achieve the above object, an embodiment of the present invention provides a method for automatically classifying and classifying data, where the method specifically includes:
Inputting a data source in response to an input request of the data source;
Preprocessing the data source to determine a data resource;
identifying the data resource based on an identification model to obtain identification data;
Classifying the identification data based on a classification model to obtain classification data;
classifying the classified data based on a classification model to obtain classified data;
Establishing a label for each hierarchical data to obtain label data;
and performing catalog management on the tag data.
Based on the technical scheme, the invention can also be improved as follows:
Further, the preprocessing of the data source to determine data resources includes;
Extracting one or more data sources from the real-time service buried point data, the offline imported data and the batch converged data;
Converting voice data into text data, and converting video data into image data;
one or more data sources of the original file and the index value are stored.
Further, the recognition model is a deep learning model integrated into a convolutional neural network and a cyclic neural network;
extracting integral features and local features of the data resources through the identification model;
identifying the data resources based on automatic data identification rules, automatic data identification policies, and samples;
the automatic data recognition rules comprise keyword recognition, regular expressions, digital fingerprint technology and natural language processing;
the automatic data identification strategy comprises accurate data identification, fingerprint document identification and vector machine classification identification;
The sample includes structured data, unstructured data, and microfeature data.
Further, the classifying the identification data based on the classification model to obtain classification data includes:
dividing the identification data into static data and dynamic data based on data attributes;
Injecting the static data into the tree structure area to judge constraint conditions, when the constraint conditions are not met, discarding the static data, and when the constraint conditions are met, flowing into a grading flow;
and injecting the dynamic data into the graph structure area to judge the attribute conditions, when the attribute conditions are not met, discarding the dynamic data, and when the attribute conditions are met, flowing into the grading flow.
Further, the classifying the classified data based on the classification model to obtain classified data includes:
Judging whether the classified data meets the strategy judgment conditions, when the classified data does not meet the strategy judgment conditions, discarding, and when the classified data meets the strategy judgment conditions, executing one or more classification strategies to complete deviation analysis, wherein the classification strategies comprise a comparison method, a label matching method and an calculation method.
Further, the classifying data is classified based on the classification model to obtain classification data, and the method further includes:
K is calculated by equation (1):
Where i is the fractional element duty ratio, j is the fractional score, and K is the data level.
Further, the step of creating a tag for each hierarchical data to obtain tag data includes:
extracting data attributes of the data resources through a classification model and the data grading model;
Carrying out data tag prediction on the data attributes, wherein the data tags comprise common tags, attribute tags and business secret tags;
and establishing a label for each hierarchical data at least through a common label and an attribute label, and combining a service scene selection mark field or a mapping table to obtain label data.
A system for automating data identification classification, comprising:
the data source acquisition module is used for responding to the input request of the data source and inputting the data source;
the preprocessing module is used for preprocessing the data source to determine the data resource;
the data identification module is used for identifying the data resources based on the identification model to obtain identification data;
The data classification module is used for classifying the identification data based on a classification model to obtain classification data;
the data grading module is used for grading the classified data based on the grading model to obtain graded data;
the label building module is used for building labels for each hierarchical data to obtain label data;
And the catalog management module is used for carrying out catalog management on the tag data.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method when the computer program is executed.
A non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method.
The embodiment of the invention has the following advantages:
The method for automatically carrying out data identification, classification and classification in the invention responds to an input request of a data source, inputs the data source, carries out pretreatment on the data source to determine the data resource, identifies the data resource based on an identification model to obtain identification data, classifies the identification data based on a classification model to obtain classification data, classifies the classification data based on a classification model to obtain classification data, establishes a label for each classification data to obtain label data, carries out catalog management on the label data, and solves the problem that automatic identification, classification and classification cannot be accurately carried out on data in the prior art.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those skilled in the art from this disclosure that the drawings described below are merely exemplary and that other embodiments may be derived from the drawings provided without undue effort.
The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the scope of the invention.
FIG. 1 is a flow chart of a method for automating data identification classification according to the present invention;
FIG. 2 is a block diagram of a system for automating data identification classification in accordance with the present invention;
FIG. 3 is a block diagram of a preprocessing module according to the present invention;
FIG. 4 is a diagram of a feature extraction process for identifying a model in accordance with the present invention;
FIG. 5 is a flow chart of the data classification of the present invention;
FIG. 6 is a flow chart of the data staging of the present invention;
FIG. 7 is a diagram of a hierarchical model implementation of the present invention;
FIG. 8 is a diagram of a classification model and hierarchical model data tag generation worksheet of the present invention;
fig. 9 is a schematic diagram of an entity structure of an electronic device according to the present invention.
Wherein the reference numerals are as follows:
the system comprises a data source acquisition module 10, a preprocessing module 20, an extraction module 201, a conversion module 202, a storage module 203, a data identification module 30, a data classification module 40, a data classification module 50, a build tag module 60, a catalog management module 70, an electronic device 80, a processor 801, a memory 802 and a bus 803.
Detailed Description
Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of an embodiment of a method for automatically classifying and classifying data according to the present invention, as shown in fig. 1, the method for automatically classifying and classifying data according to the embodiment of the present invention includes the following steps:
s101, responding to an input request of a data source, and inputting the data source;
s102, preprocessing a data source to determine a data resource;
specifically, one or more data sources of real-time service buried point data, offline imported data and batch converged data are extracted;
The batch aggregation data comprises one or more modes of flow analysis, port scanning and interface transmission;
The business embedded points comprise page embedded points, functional embedded points and resident embedded points, and the data resource acquisition management of application behaviors or user behaviors is realized by customizing the starting conditions through a business system;
the offline importing comprises processing importing by utilizing a big data component, importing a database table, automatically importing a script and importing a data file, so as to realize the acquisition and management of data resources of application behaviors or user behaviors;
The batch aggregation comprises the steps of utilizing a database or a big data component to realize heterogeneous data aggregation, finishing and analyzing and importing, and realizing the acquisition and management of data resources of application behaviors or user behaviors;
The method comprises the steps of converting voice data into text data by utilizing a PAI model compression and MNN reasoning engine voice recognition method, and converting video data into image data by utilizing a video stripping method and a video frame extraction method;
one or more data sources of the original file and the index value are stored.
S103, recognizing the data resource based on the recognition model to obtain recognition data;
Specifically, automatic data identification is carried out through an identification model, wherein the identification model at least comprises one or more external parameters, internal training and special identifications, and the automatic data identification flow at least comprises rules, strategies and samples;
As shown in fig. 4. The recognition model is a deep learning model (transducer) integrated into a Convolutional Neural Network (CNN) and a cyclic neural network (RNN (), the bit filling and filling are realized by adopting the transducer, and the transducer is integrated into the CNN (database table) or the RNN (text file), wherein I is training data, Y is a predefined label, X is a weighting parameter, and the whole feature and the local feature are extracted.
Extracting integral features and local features of the data resources through the identification model;
identifying the data resources based on automatic data identification rules, automatic data identification policies, and samples;
the automatic data recognition rules comprise keyword recognition, regular expressions, digital fingerprint technology and natural language processing;
the automatic data identification strategy comprises accurate data identification, fingerprint document identification and vector machine classification identification;
The sample includes structured data, unstructured data, and microfeature data.
S104, classifying the identification data based on the classification model to obtain classification data;
Specifically, the classification model at least comprises sample data, a learning algorithm and data abstraction, wherein the data abstraction at least comprises one or more of attribute cluster analysis and similarity analysis;
the learning algorithm is a preset regression algorithm and a decision tree algorithm;
Performing target data characteristic measurement according to a preset Euclidean distance and cosine similarity to realize attribute cluster analysis;
As shown in fig. 5, the identification data is divided into static data and dynamic data based on data attributes;
Injecting the static data into the tree structure area to judge constraint conditions, when the constraint conditions are not met, discarding the static data, and when the constraint conditions are met, flowing into a grading flow;
and injecting the dynamic data into the graph structure area to judge the attribute conditions, when the attribute conditions are not met, discarding the dynamic data, and when the attribute conditions are met, flowing into the grading flow.
S105, grading the grading data based on the grading model to obtain grading data;
Specifically, as shown in fig. 6 and fig. 7, whether the classification data meets the policy judgment conditions is judged, when the policy judgment conditions are not met, discarding is performed, and when the policy judgment conditions are met, one or more classification policies are executed to complete deviation analysis, wherein the classification policies comprise a comparison method, a calibration method and a calculation method.
The method comprises the steps of combining business scenes, carrying out grading and comparison method according to a preset check list, carrying out data level classification according to a preset discriminant method according to industry specifications or supervision requirements, carrying out grading by utilizing a key parameter method in the process of data identification and classification, and calculating K through a formula (1):
In the formula, i is the fraction of the element of the sub-term, j is the sub-term score, K is the grading level of the data, and in practical application, K can be reduced positively.
S106, establishing a label for each hierarchical data to obtain label data;
specifically, as shown in fig. 8, extracting data attributes of the data resources through a classification model and the classification model;
Carrying out data tag prediction on the data attributes, wherein the data tags comprise common tags, attribute tags and business secret tags;
and establishing a label for each hierarchical data at least through a common label and an attribute label, and combining a service scene selection mark field or a mapping table to obtain label data.
The common tag comprises a data owner, a data size and a data type, the attribute tag comprises data access authority, a data sensitive level, a data physical position, a data logical position, data meta information, a data state and data security protection requirement information, the business secret tag comprises a data type, a security class, a security period and a decryption identifier, and the mark field comprises a file attribute mark and a database expansion field mark.
S107, managing the label data in a catalog;
Specifically, the tag data catalog management is realized through catalog management, data input is provided for an organization data asset management platform, a data security management platform, a data risk monitoring system and a data security audit system, and external data service is realized through an external service interface, presentation layer integration and data integration.
The automatic data identification, classification and classification method includes the steps of responding to an input request of a data source, inputting the data source, preprocessing the data source to determine data resources, identifying the data resources based on an identification model to obtain identification data, classifying the identification data based on a classification model to obtain classification data, classifying the classification data based on a classification model to obtain classification data, establishing a label for each classification data to obtain label data, and managing a catalog of the label data. The automatic data identification classification method solves the problem that automatic identification classification cannot be accurately performed on data in the prior art, can provide data input for other important asset management systems, improves data identification precision and classification intelligent degree, and achieves scientific and reasonable data identification classification management.
Fig. 2 is a diagram of an embodiment of a system for automatically classifying and classifying data according to the present invention, and as shown in fig. 2, the system for automatically classifying and classifying data according to the embodiment of the present invention includes the following steps:
A data source acquisition module 10 for inputting a data source in response to an input request of the data source;
a preprocessing module 20, configured to preprocess the data source to determine a data resource;
As shown in fig. 3, the preprocessing module 20 includes an extraction module 201, a conversion module 202, and a storage module 203;
The extraction module 201 is configured to extract one or more data sources of real-time service buried point data, offline import data, and batch aggregation data;
the conversion module 202 is configured to convert voice data into text data and convert video data into image data;
the storage module 203 is configured to store one or more data sources of an original file and an index value;
the data identification module 30 is configured to identify the data resource based on an identification model, so as to obtain identification data;
the recognition model is a deep learning model integrated with a convolutional neural network and a cyclic neural network;
extracting integral features and local features of the data resources through the identification model;
identifying the data resources based on automatic data identification rules, automatic data identification policies, and samples;
the automatic data recognition rules comprise keyword recognition, regular expressions, digital fingerprint technology and natural language processing;
the automatic data identification strategy comprises accurate data identification, fingerprint document identification and vector machine classification identification;
The sample includes structured data, unstructured data, and microfeature data.
A data classification module 40, configured to classify the identification data based on a classification model, so as to obtain classification data;
the data classification module 40 is further configured to:
dividing the identification data into static data and dynamic data based on data attributes;
Injecting the static data into the tree structure area to judge constraint conditions, when the constraint conditions are not met, discarding the static data, and when the constraint conditions are met, flowing into a grading flow;
and injecting the dynamic data into the graph structure area to judge the attribute conditions, when the attribute conditions are not met, discarding the dynamic data, and when the attribute conditions are met, flowing into the grading flow.
A data grading module 50, configured to grade the classified data based on a grading model, so as to obtain graded data;
The data ranking module 50 is also configured to:
Judging whether the classified data meets the strategy judgment conditions, when the classified data does not meet the strategy judgment conditions, discarding, and when the classified data meets the strategy judgment conditions, executing one or more classification strategies to complete deviation analysis, wherein the classification strategies comprise a comparison method, a label matching method and an calculation method.
K is calculated by equation (1):
where i is the fraction factor ratio, j is the fraction score, K is the data level, and rounding or other reduction is performed according to the predefined level number in actual application.
A label creating module 60, configured to create a label for each of the hierarchical data, so as to obtain label data;
The setup tag module 60 is further configured to:
extracting data attributes of the data resources through a classification model and the data grading model;
Carrying out data tag prediction on the data attributes, wherein the data tags comprise common tags, attribute tags and business secret tags;
and establishing a label for each hierarchical data at least through a common label and an attribute label, and combining a service scene selection mark field or a mapping table to obtain label data.
The catalog management module 70 is configured to manage the catalog of the tag data.
The automatic data identification, classification and classification system is used for responding to the input request of the data source through the data source acquisition module 10 and inputting the data source; preprocessing the data source by a preprocessing module 20 to determine a data resource; identifying the data resources based on the identification model by a data identification module 30 to obtain identification data; the system for automatically classifying the data by the identification data comprises a data classification module 40, a data classification module 50, a label establishing module 60, a catalog management module 70 and an automatic data identification classification system.
FIG. 9 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present invention, where, as shown in FIG. 9, the electronic device 80 includes a processor 801 (processor), a memory 802 (memory), and a bus 803;
The processor 801 and the memory 802 complete communication with each other through the bus 803;
The processor 801 is configured to invoke program instructions in the memory 802 to perform the methods provided in the above embodiments of the method, and for example, the method includes inputting a data source in response to an input request of the data source, preprocessing the data source to determine a data resource, identifying the data resource based on an identification model to obtain identification data, classifying the identification data based on a classification model to obtain classification data, classifying the classification data based on a classification model to obtain classification data, creating a tag for each of the classification data to obtain tag data, and performing catalog management on the tag data.
The embodiment provides a non-transitory computer readable storage medium storing computer instructions for causing a computer to execute the methods provided by the above-described method embodiments, for example, including inputting a data source in response to an input request of the data source, preprocessing the data source to determine a data resource, identifying the data resource based on an identification model to obtain identification data, classifying the identification data based on a classification model to obtain classification data, classifying the classification data based on a classification model to obtain classification data, creating a tag for each of the classification data to obtain tag data, and performing catalog management on the tag data.
It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be accomplished by hardware associated with program instructions, and that the above program may be stored in a computer readable storage medium which, when executed, performs the steps comprising the above method embodiments, where the above storage medium includes various storage media such as ROM, RAM, magnetic or optical disks, etc. that can store program code.
The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the embodiments or the methods of some parts of the embodiments.
While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims (6)

1.一种自动化进行数据识别分类分级的方法,其特征在于,所述方法具体包括:1. A method for automatically identifying, classifying and grading data, characterized in that the method specifically comprises: 响应于数据源的输入请求,输入数据源;In response to an input request from a data source, inputting a data source; 对所述数据源进行预处理,以确定数据资源,包括:Preprocessing the data source to determine the data resource includes: 对实时业务埋点数据、离线导入数据、批量汇聚数据中一种或多种数据源进行抽取;Extract one or more data sources including real-time business point data, offline imported data, and batch aggregated data; 将语音数据转化为文字数据,将视频数据转化为图像数据;Convert voice data into text data and convert video data into image data; 对原始文件、索引值中的其中一种或多种数据源进行存储;One or more data sources in the original file and index value are stored; 其中,业务埋点包括利用页面埋点、功能埋点、驻留埋点;离线导入包括利用大数据组件处理导入,数据库表导入,脚本自动导入,数据文件导入;批量汇聚包括利用数据库或大数据组件实现异构数据汇聚整理分析后导入;Among them, business tracking includes using page tracking, function tracking, and resident tracking; offline import includes using big data component processing import, database table import, script automatic import, and data file import; batch aggregation includes using database or big data components to realize heterogeneous data aggregation, sorting, analysis, and import; 基于识别模型对所述数据资源进行识别,得到识别数据;Identify the data resource based on the identification model to obtain identification data; 基于分类模型对所述识别数据进行分类,得到分类数据,包括:Classifying the identification data based on the classification model to obtain classified data includes: 将所述识别数据基于数据属性划分为静态数据和动态数据;dividing the identification data into static data and dynamic data based on data attributes; 将静态数据注入树结构区进行约束条件判断,当不满足约束条件时,进行丢弃处理,当满足约束条件时,流入分级流程;Inject static data into the tree structure area to judge the constraints. If the constraints are not met, the data is discarded. If the constraints are met, the data flows into the classification process. 将动态数据注入图结构区进行属性条件判断,当不满足属性条件时,进行丢弃处理,当满足属性条件时,流入分级流程;Inject dynamic data into the graph structure area to judge the attribute conditions. If the attribute conditions are not met, the data will be discarded. If the attribute conditions are met, the data will flow into the classification process. 基于分级模型对所述分类数据进行分级,得到分级数据,包括:The classified data is classified based on the classification model to obtain classified data, including: 判断所述分类数据是否满足策略判断条件,当不满足策略判断条件时,进行丢弃处理,当满足策略判断条件时,执行一种或多种分级策略,以完成偏离分析;其中,所述分级策略包括对照法、对标法和计算法,对照法基于预置的检查表进行数据分级;对标法基于预置的判别法进行数据分级;计算法基于关键参数法进行数据分级,Determine whether the classified data meets the policy judgment conditions. When the policy judgment conditions are not met, discard the data. When the policy judgment conditions are met, execute one or more classification strategies to complete the deviation analysis. The classification strategies include comparison method, benchmarking method and calculation method. The comparison method performs data classification based on a preset checklist. The benchmarking method performs data classification based on a preset discrimination method. The calculation method performs data classification based on a key parameter method. 通过公式(1)计算K:K is calculated by formula (1): 式中,i为分项要素占比,j为分项分值,K为数据级别;In the formula, i is the proportion of sub-item elements, j is the sub-item score, and K is the data level; 为每个所述分级数据建立标签,得到标签数据;Creating a label for each of the classified data to obtain label data; 对所述标签数据进行目录管理。Perform directory management on the tag data. 2.根据权利要求1所述自动化进行数据识别分类分级的方法,其特征在于,所述识别模型为融入卷积神经网络和循环神经网络的深度学习模型;2. The method for automatically performing data identification, classification and grading according to claim 1, wherein the identification model is a deep learning model that incorporates convolutional neural networks and recurrent neural networks; 通过所述识别模型提取所述数据资源的整体特征和局部特征;Extracting the overall features and local features of the data resource through the recognition model; 基于自动数据识别规则、自动数据识别策略和样本对所述数据资源进行识别;Identify the data resource based on automatic data identification rules, automatic data identification strategies and samples; 所述自动数据识别规则包括关键字识别、正则表达式、数字指纹技术和自然语言处理;The automatic data recognition rules include keyword recognition, regular expressions, digital fingerprint technology and natural language processing; 所述自动数据识别策略包括精确数据识别、指纹文档识别和向量机分类识别;The automatic data identification strategy includes accurate data identification, fingerprint document identification and vector machine classification identification; 所述样本包括结构化数据,非结构化数据和微特征数据。The sample includes structured data, unstructured data and micro-feature data. 3.根据权利要求1所述自动化进行数据识别分类分级的方法,其特征在于,所述为每个所述分级数据建立标签,得到标签数据,包括:3. The method for automatically identifying, classifying and grading data according to claim 1, wherein the step of creating a label for each graded data to obtain the label data comprises: 通过分类模型和所述数据分级模型提取所述数据资源的数据属性;Extracting data attributes of the data resource through a classification model and the data classification model; 对所述数据属性进行数据标签预测,其中,所述数据标签包括普通标签、属性标签和商业秘密标签;Predicting data labels for the data attributes, wherein the data labels include common labels, attribute labels, and trade secret labels; 至少通过普通标签和属性标签为每个所述分级数据建立标签,并结合业务场景选择标记字段或映射表,以得到标签数据。A label is established for each of the hierarchical data at least through a common label and an attribute label, and a tag field or a mapping table is selected in combination with a business scenario to obtain label data. 4.一种自动化进行数据识别分类分级的系统,其特征在于,包括:4. A system for automatically identifying, classifying and grading data, comprising: 数据源获取模块,用于响应于数据源的输入请求,输入数据源;A data source acquisition module, used to input a data source in response to an input request of a data source; 预处理模块,用于对所述数据源进行预处理,以确定数据资源,包括:A preprocessing module is used to preprocess the data source to determine the data resource, including: 对实时业务埋点数据、离线导入数据、批量汇聚数据中一种或多种数据源进行抽取;Extract one or more data sources including real-time business point data, offline imported data, and batch aggregated data; 将语音数据转化为文字数据,将视频数据转化为图像数据;Convert voice data into text data and convert video data into image data; 对原始文件、索引值中的其中一种或多种数据源进行存储;One or more data sources in the original file and index value are stored; 其中,业务埋点包括利用页面埋点、功能埋点、驻留埋点;离线导入包括利用大数据组件处理导入,数据库表导入,脚本自动导入,数据文件导入;批量汇聚包括利用数据库或大数据组件实现异构数据汇聚整理分析后导入;Among them, business tracking includes using page tracking, function tracking, and resident tracking; offline import includes using big data component processing import, database table import, script automatic import, and data file import; batch aggregation includes using database or big data components to realize heterogeneous data aggregation, sorting, analysis, and import; 数据识别模块,用于基于识别模型对所述数据资源进行识别,得到识别数据;A data identification module, used to identify the data resource based on the identification model to obtain identification data; 数据分类模块,用于基于分类模型对所述识别数据进行分类,得到分类数据,包括:A data classification module is used to classify the identification data based on a classification model to obtain classified data, including: 将所述识别数据基于数据属性划分为静态数据和动态数据;dividing the identification data into static data and dynamic data based on data attributes; 将静态数据注入树结构区进行约束条件判断,当不满足约束条件时,进行丢弃处理,当满足约束条件时,流入分级流程;Inject static data into the tree structure area to judge the constraints. If the constraints are not met, the data is discarded. If the constraints are met, the data flows into the classification process. 将动态数据注入图结构区进行属性条件判断,当不满足属性条件时,进行丢弃处理,当满足属性条件时,流入分级流程;Inject dynamic data into the graph structure area to judge the attribute conditions. If the attribute conditions are not met, the data will be discarded. If the attribute conditions are met, the data will flow into the classification process. 数据分级模块,用于基于分级模型对所述分类数据进行分级,得到分级数据,包括:A data classification module, used to classify the classified data based on a classification model to obtain classified data, including: 判断所述分类数据是否满足策略判断条件,当不满足策略判断条件时,进行丢弃处理,当满足策略判断条件时,执行一种或多种分级策略,以完成偏离分析;其中,所述分级策略包括对照法、对标法和计算法,对照法基于预置的检查表进行数据分级;对标法基于预置的判别法进行数据分级;计算法基于关键参数法进行数据分级,Determine whether the classified data meets the policy judgment conditions. When the policy judgment conditions are not met, discard the data. When the policy judgment conditions are met, execute one or more classification strategies to complete the deviation analysis. The classification strategies include comparison method, benchmarking method and calculation method. The comparison method performs data classification based on a preset checklist. The benchmarking method performs data classification based on a preset discrimination method. The calculation method performs data classification based on a key parameter method. 通过公式(1)计算K:K is calculated by formula (1): 式中,i为分项要素占比,j为分项分值,K为数据级别;In the formula, i is the proportion of sub-item elements, j is the sub-item score, and K is the data level; 建立标签模块,用于为每个所述分级数据建立标签,得到标签数据;A label establishment module is used to establish a label for each of the classified data to obtain label data; 目录管理模块,用于对所述标签数据进行目录管理。The directory management module is used to perform directory management on the tag data. 5.一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至3中的任一项所述的方法的步骤。5. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to any one of claims 1 to 3 when executing the computer program. 6.一种非暂态计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至3中的任一项所述的方法的步骤。6. A non-transitory computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 3 are implemented.
CN202410521012.2A 2024-04-28 2024-04-28 Method and system for automatically carrying out data identification, classification and classification Active CN118332407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410521012.2A CN118332407B (en) 2024-04-28 2024-04-28 Method and system for automatically carrying out data identification, classification and classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410521012.2A CN118332407B (en) 2024-04-28 2024-04-28 Method and system for automatically carrying out data identification, classification and classification

Publications (2)

Publication Number Publication Date
CN118332407A CN118332407A (en) 2024-07-12
CN118332407B true CN118332407B (en) 2025-03-07

Family

ID=91776938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410521012.2A Active CN118332407B (en) 2024-04-28 2024-04-28 Method and system for automatically carrying out data identification, classification and classification

Country Status (1)

Country Link
CN (1) CN118332407B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590698A (en) * 2021-06-29 2021-11-02 中国电子科技集团公司第三十研究所 Artificial intelligence technology-based data asset classification modeling and hierarchical protection method
CN114969467A (en) * 2022-04-15 2022-08-30 杭州美创科技有限公司 Data analysis and classification method and device, computer equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7962091B2 (en) * 2008-03-14 2011-06-14 Intel Corporation Resource management and interference mitigation techniques for relay-based wireless networks
CN116257877A (en) * 2022-12-27 2023-06-13 北京航空航天大学 Data classification grading method for privacy calculation
CN116415180A (en) * 2023-03-23 2023-07-11 中国电子科技集团公司第三十研究所 Automatic data classification and classification method and device based on identification
CN116737726A (en) * 2023-07-03 2023-09-12 中电数创(北京)科技有限公司 A method and device for grading data resources based on data fingerprints
CN117454284A (en) * 2023-08-22 2024-01-26 国信证券股份有限公司 Data automatic classification and classification method, device, equipment and computer storage medium
CN117573876A (en) * 2023-12-13 2024-02-20 中国民航信息网络股份有限公司 Service data classification and classification method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590698A (en) * 2021-06-29 2021-11-02 中国电子科技集团公司第三十研究所 Artificial intelligence technology-based data asset classification modeling and hierarchical protection method
CN114969467A (en) * 2022-04-15 2022-08-30 杭州美创科技有限公司 Data analysis and classification method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN118332407A (en) 2024-07-12

Similar Documents

Publication Publication Date Title
CN111831636B (en) Data processing method, device, computer system and readable storage medium
US11176154B1 (en) Collaborative dataset management system for machine learning data
WO2022134588A1 (en) Method for constructing information review classification model, and information review method
CN112639845A (en) Machine learning system and method for determining personal information search result credibility
CN109102145B (en) Process orchestration
US20150033362A1 (en) Notification and Privacy Management of Online Photos and Videos
CN113657993A (en) Credit risk identification method, device, equipment and storage medium
CN112632405A (en) Recommendation method, device, equipment and storage medium
CN109800354B (en) Resume modification intention identification method and system based on block chain storage
CN112215655B (en) Label management method and system for customer portrait
CN118710084B (en) A system and method for entering data assets into a table based on a multi-level knowledge graph
CN118761736A (en) A document management system and method based on artificial intelligence
CN113723093B (en) Personnel management policy recommendation method and device, computer equipment and storage medium
CN119089237A (en) Refined data processing method based on artificial intelligence
WO2025166928A1 (en) Fraud-related gang incident recognition system and method based on intelligence sharing and graph calculation, and related device
CN118332407B (en) Method and system for automatically carrying out data identification, classification and classification
CN111027296A (en) Method and system for generating report based on knowledge base
CN116975393A (en) Enterprise portrait construction method and device and electronic equipment
CN117150138A (en) Scientific and technological resource organization method and system based on high-dimensional space mapping
CN117033435A (en) Service complaint processing method, device, computer equipment and storage medium
CN112818215B (en) Product data processing method, device, equipment and storage medium
US20230020494A1 (en) Methods and systems of an automated collaborated content platform
CN115293867A (en) Financial reimbursement user portrait optimization method, device, equipment and storage medium
CN113902032A (en) Business data processing method, device, computer equipment and storage medium
CN118964745B (en) Government affair big data recommendation method, system, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant