CN115587132B - Method and system for identifying abnormal access of database based on session clustering - Google Patents
Method and system for identifying abnormal access of database based on session clustering Download PDFInfo
- Publication number
- CN115587132B CN115587132B CN202211408695.8A CN202211408695A CN115587132B CN 115587132 B CN115587132 B CN 115587132B CN 202211408695 A CN202211408695 A CN 202211408695A CN 115587132 B CN115587132 B CN 115587132B
- Authority
- CN
- China
- Prior art keywords
- session
- sql
- similarity
- template
- conversation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method and a system for identifying abnormal access of a database based on session clustering, and relates to the technical field of database security. According to the method and the device, multi-layer similarity calculation is carried out based on historical conversation information of the user, so that clustering analysis is carried out on the database access behaviors of the user, and all the access behaviors are aggregated into a plurality of conversation types. And then, the new session is subjected to similarity matching with all session types, so that whether the new session belongs to abnormal access can be judged and identified, a rule is not required to be designed for each access behavior for independent judgment, the identification efficiency is improved, and the obtained identification result is more accurate. In addition, the group number of the aggregated session types can be adjusted manually to improve the grouping effect, so that the method has higher flexibility.
Description
Technical Field
The invention relates to the technical field of database security, in particular to a method and a system for identifying abnormal access of a database based on session clustering.
Background
An important function in a database auditing system is to identify anomalous database access behavior. Conventional implementations are typically based on rule settings. For example, a rule is matched for each abnormal behavior according to the conditions of log time, frequency of sql (Structured Query Language), sql content, client IP and the like. However, the audit analysis is performed based on the manually set audit rule strategy, the algorithm design is complex, the detection result is mechanized (non-black or white), the flexibility is poor, the correlation analysis among logs is lacked, the detection effect is poor, the conditions of erroneous judgment and missed judgment exist, and the false alarm rate is high.
Disclosure of Invention
The invention aims to provide a method and a system for identifying abnormal access of a database based on session clustering, which can efficiently and accurately analyze the database access behavior of a user to judge and identify the abnormal access behavior, and have simple algorithm design and good flexibility.
The embodiment of the invention is realized by the following steps:
in a first aspect, an embodiment of the present application provides a method for identifying abnormal access to a database based on session clustering, which includes:
acquiring a historical audit log, obtaining session information from the historical audit log, and analyzing the session information to obtain a corresponding SQL template;
calculating to obtain a weight value corresponding to the SQL template contained in each session in the session by using a TF-IDF algorithm, and forming a session vector corresponding to the session based on the weight value;
calculating the similarity between any two/two groups of conversations based on the conversation vectors, marking the two/two groups of conversations with the similarity larger than a preset threshold as a conversation type, calculating to obtain the conversation vector corresponding to the conversation type, and recording the maximum weight value and the minimum weight value of each SQL template under the conversation type;
acquiring new session information, and calculating to obtain a session vector corresponding to the new session;
and calculating the similarity of the new session and all session types based on the session vector, and identifying abnormal access according to the similarity and the weight value of the SQL template.
Based on the first aspect, in some embodiments of the present invention, the step of obtaining the historical audit log, obtaining session information therefrom, and analyzing the session information to obtain a corresponding SQL template includes:
acquiring historical audit logs of all users accessing a database, and obtaining session information from the historical audit logs;
filtering SQL sentences in the session information based on a preset SQL white list to obtain filtered SQL sentences;
and removing the parameters in the SQL sentences to obtain the corresponding SQL template.
Based on the first aspect, in some embodiments of the present invention, the step of calculating, by using a TF-IDF algorithm, a weight value corresponding to the SQL template included in each session in the session, and forming a session vector corresponding to the session based on the weight value includes:
counting the SQL template contained in each conversation and the occurrence frequency of each SQL template in the conversation;
calculating the IDF value and the TF value of each SQL template in each session according to the quantity and the occurrence frequency of the SQL templates, and calculating the weight value of the SQL template in the session according to the IDF value and the TF value;
and counting and recording the weight values corresponding to each SQL template in each session to form a multidimensional vector corresponding to the session, wherein the multidimensional vector is the session vector.
Based on the first aspect, in some embodiments of the present invention, the calculating, based on the session vector, a similarity between any two/two groups of sessions, marking two/two groups of sessions with a similarity greater than a preset threshold as a session type, and calculating to obtain a session vector corresponding to the session type, and the recording the maximum weight value and the minimum weight value of each SQL template under the session type includes:
calculating the similarity between any two/two groups of conversations by a cosine similarity algorithm based on the conversation vectors;
judging whether the similarity between the two/two groups of conversations is larger than a preset threshold value, if so, marking the two/two groups of conversations as a conversation type; if the similarity between one session and the rest sessions is smaller than a preset threshold value, independently marking the session as a session type;
calculating the average value of the weighted values of all SQL templates under each session type to obtain a session vector corresponding to the session type;
and recording the maximum weight value and the minimum weight value of each SQL template under each session type.
Based on the first aspect, in some embodiments of the present invention, the step of calculating similarities between the new session and all session types based on the session vector, and identifying abnormal access according to the similarities and the weight values of the SQL template includes:
calculating the similarity of the new session and all session types based on the session vector, and judging whether the new session is similar to any session type according to the similarity;
if the two types of the SQL templates are similar, identifying abnormal access by combining the weight value of the SQL template in the new session and the maximum weight value and the minimum weight value of the SQL template in the session type similar to the new session;
if not, combining the weighted values of the SQL templates in the new session and all session types, and manually marking whether the new session is abnormal or not by the user; if the flag is normal, the new session is individually marked as a session type.
In some embodiments of the invention based on the first aspect, the method further includes:
carrying out dimensionality reduction on the session vector by utilizing a PCA dimensionality reduction algorithm to obtain a low-dimensional vector;
drawing a PCA image based on the low-dimensional vector, and checking the classification condition of the conversation through the PCA image.
Based on the first aspect, in some embodiments of the present invention, the method further includes normalizing the session vector.
In a second aspect, an embodiment of the present application provides a system for identifying abnormal access to a database based on session clustering, which includes:
the session information acquisition module is used for acquiring the historical audit logs, acquiring session information from the historical audit logs, and analyzing the session information to obtain a corresponding SQL template;
the session vector calculation module is used for calculating a weight value corresponding to the SQL template contained in each session in the session by using a TF-IDF algorithm and forming a session vector corresponding to the session based on the weight value;
the conversation clustering module is used for calculating the similarity between any two/two groups of conversations based on the conversation vector, marking the two/two groups of conversations of which the similarity is greater than a preset threshold value as a conversation type, calculating to obtain a conversation vector corresponding to the conversation type, and recording the maximum weight value and the minimum weight value of each SQL template under the conversation type;
the new session acquisition module is used for acquiring new session information and calculating to obtain a session vector corresponding to a new session;
and the new session abnormal access identification module is used for calculating the similarity between the new session and all session types based on the session vector and identifying abnormal access according to the similarity and the weight value of the SQL template.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory for storing one or more programs; a processor. The one or more programs, when executed by the processor, implement the method as described in any of the first aspects above.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method as described in any one of the above first aspects.
Compared with the prior art, the embodiment of the invention has at least the following advantages or beneficial effects:
the embodiment of the application provides a method and a system for identifying abnormal access of a database based on session clustering. Then, a weight value corresponding to the SQL template contained in each session in the session is calculated and obtained by using a TF-IDF algorithm, and a session vector corresponding to the session is formed based on the weight value. Then, calculating the similarity between any two/two groups of conversations based on the conversation vectors, marking the two/two groups of conversations with the similarity larger than a preset threshold value as a conversation type, calculating to obtain the conversation vector corresponding to the conversation type, and recording the maximum weight value and the minimum weight value of each SQL template under the conversation type. And finally, acquiring new session information, and calculating the similarity of the new session information and all session types, thereby identifying whether the new session belongs to abnormal access. In an integral way, the multi-layer similarity calculation is carried out on the basis of the historical conversation information of the user, so that the clustering analysis is carried out on the database access behaviors of the user, and all the access behaviors are aggregated into a plurality of conversation types. And then, the new session is subjected to similarity matching with all session types, so that whether the new session belongs to abnormal access can be judged and identified, a rule does not need to be designed for each access behavior for independent judgment, the identification efficiency is improved, and the obtained identification result is more accurate. In addition, the group number of the aggregated session types can be adjusted manually to improve the grouping effect, so that the method has higher flexibility.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a schematic flowchart illustrating an embodiment of a method for identifying abnormal access to a database based on session clustering according to the present invention;
FIG. 2 is a flowchart illustrating steps of obtaining a historical audit log, obtaining session information therefrom, and analyzing the session information to obtain a corresponding SQL template according to the method for identifying abnormal access to a database based on session clustering according to the present invention;
fig. 3 is a specific flowchart of the steps of calculating a weight value corresponding to the SQL template included in each session in the session by using a TF-IDF algorithm and constructing a session vector corresponding to the session based on the weight value in an embodiment of the method for identifying abnormal access to a database based on session clustering provided by the present invention;
FIG. 4 is a block diagram of a system for identifying abnormal database access based on session clustering according to the present invention;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.
Icon: 1. a memory; 2. a processor; 3. a communication interface; 11. a session information acquisition module; 12. a session vector calculation module; 13. a session clustering module; 14. a new session acquisition module; 15. the new session exception accesses the identification module.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Examples
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments and features of the embodiments described below can be combined with one another without conflict.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for identifying abnormal access to a database based on session clustering according to an embodiment of the present application, where the method includes the following steps:
step S1: and acquiring a historical audit log, obtaining session information from the historical audit log, and analyzing the session information to obtain a corresponding SQL template.
In the above steps, different use scenarios may be possible when the same user accesses the database, for example, the use scenarios may be used in different applications or batch processing tasks, and may also be used by different operation and maintenance personnel. Therefore, the session information can be obtained from the historical audit log of the database accessed by the user, and then the session information is analyzed according to the session, and similar sessions are gathered into one session type, so that the behavior characteristics of the user can be more accurately described through the session type, and meanwhile, the subsequent matching and comparison are facilitated, so that the abnormal access behavior can be identified.
Specifically, referring to fig. 2, the above steps specifically include:
step S1-1: and acquiring historical audit logs of all users accessing the database, and obtaining session information from the historical audit logs.
In the above steps, the historical audit logs of all users accessing the database can be obtained through the database audit system, so as to obtain the session information. The database audit system may typically capture the creation and destruction logs of a database connection, but may have uncaptured cases in special cases. So for the completeness of the method, the session can be partitioned with assistance according to the client IP, port, target IP, port, and the four. For example, if there is no log for an hour in a session, i.e., no log was captured for creation and destruction within the hour, then the session is split into two sessions. When the database connection is a long connection, the sessions may be split on a daily basis for statistical analysis.
Exemplary, log content includes, but is not limited to: access time, client IP, port, database IP, port, sql statement, number of bytes of result returned by executing sql statement instruction, and the like. Moreover, when data is collected, it is generally necessary to collect historical data for analysis for more than one week to be representative.
Step S1-2: and filtering the SQL sentences in the session information based on a preset SQL white list to obtain the filtered SQL sentences.
In the above steps, firstly, an SQL white list library of the application needs to be created, so that only SQL statements deemed safe in the white list can be released, and dangerous SQL statements can be blocked. By setting the SQL white list, SQL sentences which are uninterested and careless in the session information and have no effect on session classification can be filtered, the workload of subsequent analysis is reduced, and the processing efficiency is improved.
Step S1-3: and removing the parameters in the SQL statements to obtain the corresponding SQL template.
In the above steps, since the number of SQL statements in a session can reach hundreds of millions, but the number of SQL templates is usually no more than thousands, the SQL statements can be parsed to obtain the corresponding SQL templatesAnd the SQL template is used for describing the characteristics of the conversation, so that the analysis efficiency is improved. Specifically, the SQL template refers to an SQL statement from which the SQL parameters are removed. For example: "select from users 'where id =1" is the SQL template corresponding to the SQL statement "select from users' where id =". After the SQL template in each conversation is analyzed, the similarity between the conversations can be calculated according to the weight of the SQL template, and the conversations are classified.
Step S2: and calculating to obtain a weight value corresponding to the SQL template contained in each session in the session by using a TF-IDF algorithm, and forming a session vector corresponding to the session based on the weight value.
In the above steps, TF-IDF (term frequency-inverse document frequency) is a statistical method, which is commonly used in text retrieval technology to evaluate the importance degree of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Where TF (Term Frequency) refers to the number of times or Frequency that a word appears in a Document, and IDF (Inverse Document Frequency) refers to the ability of a word to express semantics, where the fewer articles in which the word appears, the higher the IDF value.
In view of the fact that in step S1, the SQL statement in each session has been parsed to obtain a corresponding SQL template, that is, the SQL template has been used to describe the characteristics of the session, the session can be regarded as a document, the SQL template included in the document can be regarded as a word in the document, so that a weight value corresponding to the SQL template included in each session in the session can be calculated by using the TF-IDF algorithm, and finally, a session is represented as a TF-IDF value corresponding to different SQL templates included in the session. For example, if a session contains n SQL templates, the TF-IDF value (i.e. weight) of each SQL template in the session can be calculated by using the TF-IDF algorithm, and each TF-IDF value is a one-dimensional vector, so that the session can be represented as an n-dimensional vector for the subsequent similarity calculation.
Specifically, referring to fig. 3, the steps specifically include:
step S2-1: and counting the SQL template contained in each session and the occurrence times of each SQL template in the session.
In the above step, if there are three session messages a, b, and c in a certain period of time, the SQL template and the corresponding times thereof included in each session are shown in the following table:
TABLE 1 Session-SQL template times correspondence Table
Step S2-2: and calculating the IDF value and the TF value of each SQL template in each session according to the quantity and the occurrence frequency of the SQL templates, and calculating the weight value of the SQL template in the session according to the IDF value and the TF value.
In the above steps, firstly, the TF value corresponding to the SQL template included in each session is calculated according to the session, and the calculation method is: TF = number of occurrences of a certain SQL template in the current session/number of occurrences of all SQL templates in the session. If the number of times of the SQL template appearing in one session is larger, the corresponding TF value is larger. Exemplary TF values calculated according to the data in table 1 above are shown in the following table:
TABLE 2 Session-SQL template TF value correspondence table
Then, calculating the IDF value of each SQL template in the following way: IDF = log (number of sessions/number of sessions containing this SQL template). If an SQL template appears in more sessions, the lower its corresponding IDF value will be. Exemplary IDF values calculated according to the data in table 1 above are shown in the following table:
table 3 conversation-SQL template IDF value correspondence table
And finally, calculating the TF-IDF value of the SQL template contained in each session according to the TF value and the IDF value, wherein the calculation mode is as follows: TF-IDF = TF x IDF. Exemplary IDF values calculated according to the correlation data of tables 1-3 above are shown in the following table:
TABLE 4 Session-SQL template TF-IDF value correspondence table
Through the calculation process, the TF-IDF value of the SQL template reflects the importance degree of the SQL template in the session, so that the TF-IDF value of the SQL template can be used as the corresponding weight value of the SQL template in the session. I.e. to reflect the composition of a session by the TF-IDF value.
Step S2-3: and counting and recording the weight values corresponding to each SQL template in each session to form a multidimensional vector corresponding to the session, wherein the multidimensional vector is the session vector.
From the data in table 4 above, we can get a vector representation of the session. For example, session a can be represented as a 6-dimensional vector of [0,0.09,0.036,0.048,0 ]. Therefore, each conversation can be expressed into a multidimensional vector by calculating the TF-IDF value of each SQL template in each conversation, and the conversation vector corresponding to each conversation is obtained.
Further, the above steps further include normalizing the session vector. For the convenience of the following calculation, the session vector may be normalized so that the modulus (length) of the vector is 1. The normalized calculation mode is as follows: each element in the vector is divided by the sum of the squares of all elements. For example, the results obtained after the data in table 4 are normalized are shown in the following table:
table 5 session vector table
Thus, the session a can be expressed as a multidimensional vector [0,0.83,0.33,0.44, 0] with a modulo length of 1, which facilitates subsequent similarity calculation.
And step S3: calculating the similarity between any two/two groups of conversations based on the conversation vectors, marking the two/two groups of conversations with the similarity larger than a preset threshold as a conversation type, calculating the conversation vector corresponding to the conversation type, and recording the maximum weight value and the minimum weight value of each SQL template under the conversation type.
In the above steps, firstly, based on the session vector, the similarity between any two sessions is calculated by a cosine similarity algorithm. Then, judging whether the similarity between the two conversations is greater than a preset threshold value, if so, marking the two conversations as a conversation type, namely, dividing the two conversations into a group; and if the similarity between one session and the rest sessions is smaller than a preset threshold, the session is marked as a session type independently, namely the session is divided into a group independently. Illustratively, the similarity threshold may be set between 0.7-0.75, and if the similarity of session a and session b is 0.8, then session a and session b are classified into one category. Therefore, through the similarity calculation of the first layer, two conversations with larger similarity can be grouped into one group. Then, for the classified conversation types, calculating an average value of weight values (vector elements) of each SQL template under each conversation type, and performing normalization processing to obtain conversation vectors corresponding to the conversation types. And then, performing second-layer similarity calculation, namely calculating the similarity between any two session types, and performing second aggregation classification on the session types to reduce the classification quantity. And circulating the steps, and calculating the multilayer similarity until the times of the aggregation classification reaches the upper limit, or the number of groups of the session types reaches a preset standard, and stopping the calculation. And finally, recording the maximum weight value and the minimum weight value of each SQL template under each session type, so as to facilitate subsequent abnormal access judgment.
Through the multilayer similarity calculation process, all the conversations can be aggregated and classified, the classification condition can be adjusted according to the actual condition, and the method has strong flexibility. It should be noted that, the above calculating the similarity by the cosine similarity algorithm belongs to the prior art, and is not described herein again.
Based on the first aspect, in some embodiments of the present invention, when calculating the similarity between sessions and performing classification according to the similarity, a PCA dimension reduction method may be further used to perform auxiliary judgment.
Specifically, firstly, a PCA (principal component analysis) dimension reduction algorithm is used for carrying out dimension reduction processing on the session vector to obtain a low-dimensional vector. For example, the session a obtained from table 5 is [0,0.83,0.33,0.44, 0], which is a 6-dimensional vector, and can be reduced to a two-dimensional or three-dimensional vector by the PCA dimension reduction method, thereby reducing the amount of data computation. Then, drawing a PCA image based on the low-dimensional vector, checking the classification condition of the conversation through the PCA image, and adaptively adjusting the similarity threshold according to the classification condition so as to achieve a better classification effect.
And step S4: and acquiring new session information, and calculating to obtain a session vector corresponding to the new session.
In the above step, please refer to step S2 above in the vector representation process of the new session, and still use the TF-IDF algorithm to calculate the session vector corresponding to the new session.
Step S5: and calculating the similarity of the new session and all session types based on the session vector, and identifying abnormal access according to the similarity and the weight value of the SQL template.
In the above steps, the similarity between the new session and all session types is calculated based on the session vector, and whether the new session is similar to any session type is judged according to the similarity.
Specifically, if the new session is not similar to all session types, further comparing deviations of weighted values of SQL templates in the new session and all session types, displaying the SQL template with the largest deviation in the new session and the session types to the user, helping the user to judge whether the new session is accessed abnormally, and then manually marking whether the new session is accessed abnormally by the user. If the new session is marked as normal, the new session is singly marked as a session type, so that the abnormal recognition of the new session at the next time is facilitated. And if the new session is similar to a certain session type, namely a similar session type is matched, combining the weight value of the SQL template in the new session and the maximum weight value and the minimum weight value of the SQL template in the session type similar to the new session to identify abnormal access. Illustratively, the variance is calculated according to the weight values of all SQL templates contained in similar session types, then the variance is added to the maximum weight value of each SQL template as an upper limit, and the variance is subtracted from the minimum weight value as a lower limit, so as to obtain the range of the SQL template. The new session's SQL template weight value is then compared to the corresponding SQL template range in the similar session type. If the session is within the range, the new session is indicated as normal access; if the current session is not within the range, the new session is indicated to be abnormal access, and the user is reminded of finding the abnormality. By the method, the relatively fine abnormal behaviors in the conversation can be found, and the accuracy of abnormal recognition is improved.
In conclusion, the method performs multilayer similarity calculation based on the historical conversation information of the user to perform cluster analysis on the database access behaviors of the user, so that all the access behaviors are aggregated into a plurality of conversation types. And then, the new session is subjected to similarity matching with all session types, so that whether the new session belongs to abnormal access can be judged and identified, a rule is not required to be designed for each access behavior for independent judgment, the identification efficiency is improved, and the obtained identification result is more accurate. In addition, the group number of the aggregated session types can be adjusted manually to improve the grouping effect, so that the method has higher flexibility.
Based on the same inventive concept, the invention further provides a system for identifying abnormal access of a database based on session clustering, and please refer to fig. 4, where fig. 4 is a structural block diagram of the system for identifying abnormal access of a database based on session clustering provided in the embodiment of the present application. The system comprises:
the session information acquisition module 11 is configured to acquire a historical audit log, obtain session information from the historical audit log, and analyze the session information to obtain a corresponding SQL template;
the session vector calculation module 12 is configured to calculate, by using a TF-IDF algorithm, a weight value corresponding to the SQL template included in each session in the session, and form a session vector corresponding to the session based on the weight value;
the session clustering module 13 is used for calculating the similarity between any two/two groups of sessions based on the session vectors, marking the two/two groups of sessions with the similarity larger than a preset threshold as a session type, calculating the session vector corresponding to the session type, and recording the maximum weight value and the minimum weight value of each SQL template under the session type;
a new session obtaining module 14, configured to obtain new session information, and calculate to obtain a session vector corresponding to the new session;
and the new session abnormal access identification module 15 is used for calculating the similarity between the new session and all session types based on the session vector, and identifying abnormal access according to the similarity and the weight value of the SQL template.
Referring to fig. 5, fig. 5 is a block diagram of an electronic device according to an embodiment of the present disclosure. The electronic device comprises a memory 1, a processor 2 and a communication interface 3, wherein the memory 1, the processor 2 and the communication interface 3 are electrically connected with each other directly or indirectly to realize the transmission or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 1 may be used to store software programs and modules, such as program instructions/modules corresponding to a system for identifying abnormal database access based on session clustering provided in an embodiment of the present application, and the processor 2 executes various functional applications and data processing by executing the software programs and modules stored in the memory 1. The communication interface 3 may be used for communication of signaling or data with other node devices.
The Memory 1 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.
The processor 2 may be an integrated circuit chip having signal processing capabilities. The Processor 2 may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
It will be appreciated that the configuration shown in fig. 5 is merely illustrative and that the electronic device may include more or fewer components than shown in fig. 5 or have a different configuration than shown in fig. 1. The components shown in fig. 5 may be implemented in hardware, software, or a combination thereof.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Claims (9)
1. A method for identifying abnormal access of a database based on session clustering is characterized by comprising the following steps:
obtaining historical audit logs, obtaining session information from the historical audit logs, analyzing the session information to obtain corresponding session information SQL A template;
calculating to obtain the content of each session by using TF-IDF algorithm SQL The template forms a corresponding weight value in the session, and a session vector corresponding to the session is formed based on the weight value; the method comprises the following steps:
counting the number of sessions each contains SQL Form and each SQL The number of times the template appears in the session; according to SQL The number of templates and the number of occurrences count the individual in each session SQL IDF value and TF value of the template, and calculating the IDF value and TF value SQL The weight value of the template corresponding to the conversation; counting and recording each individual in each session SQL Forming a multidimensional vector corresponding to the conversation by the weighted value corresponding to the template, wherein the multidimensional vector is the conversation vector;
calculating the similarity between any two/two groups of conversations based on the conversation vectors, marking the two/two groups of conversations with the similarity larger than a preset threshold as a conversation type, calculating the conversation vector corresponding to the conversation type, and recording each conversation type under the conversation type SQL The maximum weight value and the minimum weight value of the template;
acquiring new session information, and calculating to obtain a session vector corresponding to the new session;
and calculating the similarity of the new session and all session types based on the session vector, and identifying abnormal access according to the similarity and the weight value of the SQL template.
2. The method of claim 1, wherein the obtaining of the historical audit log, obtaining session information therefrom, and parsing the session information to obtain corresponding session information SQL The template comprises the following steps:
acquiring historical audit logs of all users accessing a database, and obtaining session information from the historical audit logs;
based on presettings SQL In white list vs. session information SQL Filtering the sentences to obtain filtered sentences SQL A sentence;
will be provided with SQL Removing parameters in the statement to obtain corresponding parameters SQL And (5) template.
3. The method for identifying abnormal access to database based on session clustering as claimed in claim 1, wherein the method for identifying abnormal access to database based on session vector comprises calculating similarity between any two or two groups of sessions, marking the two or two groups of sessions with similarity greater than a preset threshold as a session type, calculating to obtain the session vector corresponding to the session type, and recording each session vector under the session type SQL The steps of maximum and minimum weight values of a template include:
calculating the similarity between any two/two groups of conversations by a cosine similarity algorithm based on the conversation vectors;
judging whether the similarity between the two/two groups of conversations is larger than a preset threshold value, if so, marking the two/two groups of conversations as a conversation type; if the similarity between one session and the rest sessions is smaller than a preset threshold value, independently marking the session as a session type;
calculate each under each conversation type SQL Obtaining a session vector corresponding to the session type by the average value of the weight values of the templates;
record each under each session type SQL The maximum weight value and the minimum weight value of the template.
4. The method for identifying abnormal database access based on session clustering as claimed in claim 1, wherein the similarity between the new session and all session types is calculated based on the session vector, and the sum of the similarities is used as the basis SQL The step of identifying abnormal access by the weight value of the template comprises the following steps:
calculating the similarity between the new session and all session types based on the session vector, and judging whether the new session is similar to any session type according to the similarity;
if so, then join in the new session SQL Weight value of template and in the type of session similar to the new session SQL Identifying abnormal access by the maximum weight value and the minimum weight value of the template;
if none are similar, combine the new session with all session types SQL The weighted value of the template is used for manually marking whether the new session is abnormal or not by a user; if the flag is normal, the new session is individually marked as a session type.
5. The method for identifying abnormal database access based on session clustering according to claim 1, further comprising:
carrying out dimensionality reduction on the session vector by utilizing a PCA dimensionality reduction algorithm to obtain a low-dimensional vector;
drawing a PCA image based on the low-dimensional vector, and checking the classification condition of the conversation through the PCA image.
6. The method for identifying abnormal access to a database based on session clustering as claimed in claim 1, further comprising normalizing the session vectors.
7. A system for identifying abnormal database access based on session clustering, comprising:
a session information acquisition module for acquiring the historical audit log, obtaining the session information therefrom, and analyzing the session information to obtain the corresponding session information SQL A template;
a session vector calculation module for calculating the parameters contained in each session by using TF-IDF algorithm SQL The template forms a corresponding weight value in the session, and a session vector corresponding to the session is formed based on the weight value;
the conversation clustering module is used for calculating the similarity between any two or two groups of conversations based on the conversation vector, marking the two or two groups of conversations with the similarity larger than a preset threshold as a conversation type, and calculating to obtain the conversation direction corresponding to the conversation typeAmount, record each under the session type SQL The maximum weight value and the minimum weight value of the template;
the new session acquisition module is used for acquiring new session information and calculating to obtain a session vector corresponding to the new session;
a new session abnormal access identification module for calculating the similarity between the new session and all session types based on the session vector and according to the similarity sum SQL The weight value of the template identifies an abnormal access.
8. An electronic device, comprising:
a memory for storing one or more programs;
a processor;
the one or more programs, when executed by the processor, implement the method of any of claims 1-6.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211408695.8A CN115587132B (en) | 2022-11-11 | 2022-11-11 | Method and system for identifying abnormal access of database based on session clustering |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211408695.8A CN115587132B (en) | 2022-11-11 | 2022-11-11 | Method and system for identifying abnormal access of database based on session clustering |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN115587132A CN115587132A (en) | 2023-01-10 |
| CN115587132B true CN115587132B (en) | 2023-03-10 |
Family
ID=84781218
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202211408695.8A Active CN115587132B (en) | 2022-11-11 | 2022-11-11 | Method and system for identifying abnormal access of database based on session clustering |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115587132B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116112265B (en) * | 2023-02-13 | 2023-07-28 | 山东云天安全技术有限公司 | Abnormal session determining method, electronic equipment and storage medium |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9043355B1 (en) * | 2009-10-16 | 2015-05-26 | Iqor U.S. Inc. | Apparatuses, methods and systems for a journal entry automator |
| CN108388508A (en) * | 2018-01-29 | 2018-08-10 | 华南理工大学 | A kind of test cases selection method based on user conversation and hierarchical clustering algorithm |
| CN108595655A (en) * | 2018-04-27 | 2018-09-28 | 福建师范大学 | A kind of abnormal user detection method of dialogue-based characteristic similarity fuzzy clustering |
| CN114610754A (en) * | 2022-03-10 | 2022-06-10 | 中国电信股份有限公司 | SQL log anomaly detection method and device, storage medium and electronic equipment |
| CN114640499A (en) * | 2022-02-11 | 2022-06-17 | 深圳昂楷科技有限公司 | Method and device for carrying out abnormity identification on user behavior |
-
2022
- 2022-11-11 CN CN202211408695.8A patent/CN115587132B/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9043355B1 (en) * | 2009-10-16 | 2015-05-26 | Iqor U.S. Inc. | Apparatuses, methods and systems for a journal entry automator |
| CN108388508A (en) * | 2018-01-29 | 2018-08-10 | 华南理工大学 | A kind of test cases selection method based on user conversation and hierarchical clustering algorithm |
| CN108595655A (en) * | 2018-04-27 | 2018-09-28 | 福建师范大学 | A kind of abnormal user detection method of dialogue-based characteristic similarity fuzzy clustering |
| CN114640499A (en) * | 2022-02-11 | 2022-06-17 | 深圳昂楷科技有限公司 | Method and device for carrying out abnormity identification on user behavior |
| CN114610754A (en) * | 2022-03-10 | 2022-06-10 | 中国电信股份有限公司 | SQL log anomaly detection method and device, storage medium and electronic equipment |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115587132A (en) | 2023-01-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10459971B2 (en) | Method and apparatus of generating image characteristic representation of query, and image search method and apparatus | |
| CN113254255B (en) | A cloud platform log analysis method, system, device and medium | |
| US20110078106A1 (en) | Method and system for it resources performance analysis | |
| CN110457405B (en) | Database auditing method based on blood relationship | |
| CN109656928B (en) | Method and device for obtaining relationships between tables | |
| CN112306820A (en) | Log operation and maintenance root cause analysis method and device, electronic equipment and storage medium | |
| CN111143433B (en) | Method and device for counting data in data bin | |
| CN111597550A (en) | A kind of log information analysis method and related device | |
| CN111914294A (en) | Database sensitive data identification method and system | |
| CN115587132B (en) | Method and system for identifying abnormal access of database based on session clustering | |
| CN110737650A (en) | Data quality detection method and device | |
| Tsytsarau et al. | Managing diverse sentiments at large scale | |
| CN107729337B (en) | Event monitoring method and device | |
| CN112860808A (en) | User portrait analysis method, device, medium and equipment based on data tag | |
| CN114647636B (en) | Big data anomaly detection method and system | |
| WO2023198284A1 (en) | Method for summarizing operational log data | |
| CN119990322A (en) | A table question answering method and system based on large language model | |
| CN107169356B (en) | Statistical analysis method and device | |
| CN111667023B (en) | Method and device for acquiring articles of target category | |
| CN114064757A (en) | Application program optimization method, device, equipment and medium | |
| US20220207049A1 (en) | Methods, devices and systems for processing and analysing data from multiple sources | |
| CN113705625A (en) | Method and device for identifying abnormal life guarantee application families and electronic equipment | |
| CN118210680A (en) | Database health evaluation system | |
| CN115510289B (en) | Data cube configuration method and device, electronic equipment and storage medium | |
| CN112100229B (en) | Data state multi-dimensional monitoring method, system, equipment and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |