CN118445416A

CN118445416A - Topic cluster analysis method and device for unstructured data

Info

Publication number: CN118445416A
Application number: CN202410668802.3A
Authority: CN
Inventors: 侯梦凡; 郭佳灵; 莫享尔; 王超
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2024-05-28
Filing date: 2024-05-28
Publication date: 2024-08-06

Abstract

The invention provides a topic cluster analysis method and device of unstructured data, which relate to the technical field of artificial intelligence and finance, and the method comprises the following steps: performing data preprocessing on unstructured data to be analyzed; extracting global features and time sequence features from the unstructured data after pretreatment respectively; introducing the time dimension into the LDA model to form a time LDA topic clustering model; and taking the global features and the time sequence features as the input of the time LDA topic clustering model to output unstructured data topic clustering results. According to the topic cluster analysis method and device for unstructured data, the time dimension is introduced into the LDA model to form the time LDA topic cluster model, so that the cluster result can reflect the time change trend of the unstructured data. The method not only improves the efficiency and accuracy of data analysis, but also comprehensively utilizes global features and time sequence features, and provides more comprehensive data representation.

Description

Topic cluster analysis method and device for unstructured data

Technical Field

The invention relates to the technical field of artificial intelligence and finance, in particular to a topic cluster analysis method and device for unstructured data.

Background

In banking, unstructured data is often characterized by high dimensionality, diversity, and noise. These data often contain a large amount of information, but due to their irregular structure, it is often difficult to perform efficient analysis and visualization. For example, the off-site data analysis of the operation risk management system operation condition evaluation by the internal audit of the bank mainly depends on an operation risk loss event list provided by an audited unit, the data size of the data table is large, unstructured description texts are complex and various, and the data analysis efficiency of identifying a risk subject through a manual means is not high. Therefore, how to efficiently cluster and categorize topics for unstructured data is an important challenge.

Common topic clustering methods are K-Means, isolated forests, implicit dirichlet Allocation (LDA), and the like. The main advantage of the K-Means clustering algorithm is that it is simple and easy to use, but it has the disadvantage that it is not applicable to clusters that are not convex in shape and is relatively sensitive to the choice of initialization points. The main advantage of isolated forest clustering is the ability to handle high-dimensional data and avoid the "hot spot" problem in K-Means clustering, but it has the disadvantage of requiring a large amount of training data and computational resources and being relatively sensitive to cluster shape and size. The LDA is used for predicting the topic analysis of the documents, can give out the topic of each document in the document set in the form of probability distribution, can cluster the topics or classify the texts according to the topic distribution after extracting the topic distribution by analyzing some documents, has been successfully applied in various fields, and can help users to automatically identify the topic content in the texts and extract useful information from the topic content. However, the trend in text over time cannot be analyzed using the LDA model alone.

Disclosure of Invention

In view of the foregoing, the present invention provides a method and apparatus for topic cluster analysis of unstructured data to solve at least one of the problems mentioned above.

In order to achieve the above purpose, the present invention adopts the following scheme:

According to a first aspect of the present invention, there is provided a method of topic cluster analysis of unstructured data, the method comprising: performing data preprocessing on unstructured data to be analyzed; extracting global features and time sequence features from the unstructured data after pretreatment respectively; introducing the time dimension into the LDA model to form a time LDA topic clustering model; and taking the global feature and the time sequence feature as inputs of the time LDA topic clustering model to output the unstructured data topic clustering result.

As one embodiment of the present invention, the method of introducing a time dimension into an LDA model to form a temporal LDA topic cluster model includes: collecting unstructured data containing a timestamp as pre-training data; preprocessing the pre-training data; extracting global features from the preprocessed unstructured data by using a TF-IDF method; dividing the unstructured data after pretreatment according to different time windows; defining model parameters of a time LDA topic clustering model and initializing topic distribution of each document, word distribution of each topic and topic distribution of each time window, wherein the model parameters comprise topic number, document number, time window number, dirichlet prior of topic distribution, dirichlet prior of word distribution and Dirichlet prior of time distribution; performing iterative updating of the time LDA topic clustering model by using Gibbs sampling until the time LDA topic clustering model converges; and evaluating the topic distribution of each document, the word distribution of each topic and the topic distribution of each time window output by the time LDA topic clustering model to check the accuracy and stability of the model.

As an embodiment of the present invention, the iterative updating of the temporal LDA topic clustering model by using Gibbs sampling in the above method includes: for each word of each document, its topic assignment is sampled using:

In the above formula, p represents a conditional probability; z _di is a random variable representing the topic assignment of the ith word in document d; w _di represents the ith word in document d; t _d denotes a time stamp of the document d; alpha represents the Dirichlet a priori of the topic distribution; beta represents the Dirichlet a priori of the word distribution; gamma represents the Dirichlet a priori of the time distribution; oc represents "proportional to"; a count representing a topic k in document d that does not include the current word i; A total count representing all words in document d that do not include the current word i; A count representing word w in topic k that does not include current word i; A total count representing all words in the topic k that do not include the current word i; A count representing a topic k in the time window t that does not include the current word i; k represents the total number of topics; v represents the size of the vocabulary; t represents the total number of time windows; and updating the topic distribution of each document, the word distribution of each topic and the topic distribution of each time window according to the sampling result.

As one embodiment of the present invention, the method of introducing a time dimension into an LDA model to form a temporal LDA topic cluster model includes: collecting unstructured data containing a timestamp as pre-training data; calculating the similarity between time windows by using JS divergence, and selecting the window size with large theme content difference between the time windows and small theme difference in the windows for slicing; a set of temporally successive topics is identified based on the slicing results.

As an embodiment of the present invention, the extracting global features from the unstructured data after preprocessing in the method includes: the word frequency, the inverse document frequency and the TF-IDF vector of each word are extracted from the unstructured data after preprocessing based on the TF-IDF method.

As an embodiment of the present invention, the extracting the time-series feature from the unstructured data after preprocessing in the method includes: dividing the unstructured data after pretreatment into different time windows according to the time stamp; within each time window, the TF-IDF based method extracts keywords and phrases and generates corresponding feature vectors.

As an embodiment of the present invention, after extracting global features and time-series features from the unstructured data after preprocessing, the method further includes: combining the global feature and the time series feature into a data matrix; carrying out data standardization processing on the data matrix; calculating a covariance matrix by using the marked data matrix; performing eigenvalue decomposition on the covariance matrix to obtain eigenvalues and eigenvectors, wherein the eigenvalues represent variances of each principal component, and the eigenvectors represent directions of the principal components; selecting the first k main components according to the magnitude of the characteristic value; and projecting the combined data matrix onto the selected principal component to obtain feature data after dimension reduction.

According to a second aspect of the present invention, there is provided a topic cluster analysis apparatus for unstructured data, the apparatus comprising: the pretreatment unit is used for carrying out data pretreatment on unstructured data to be analyzed; the feature extraction unit is used for respectively extracting global features and time sequence features from the pre-processed unstructured data; the time model generation unit is used for introducing the time dimension into the LDA model to form a time LDA topic clustering model; and the topic clustering unit is used for taking the global feature and the time sequence feature as the input of the time LDA topic clustering model to output the unstructured data topic clustering result.

As an embodiment of the present invention, the above-described time model generation unit includes: a data collection module for collecting unstructured data including a time stamp as pre-training data; the preprocessing module is used for preprocessing the pre-training data; the global feature extraction module is used for extracting global features from the unstructured data after pretreatment by using a TF-IDF method; the time window dividing module is used for dividing the unstructured data after pretreatment according to different time windows; the initialization module is used for defining model parameters of the time LDA topic clustering model and initializing topic distribution of each document, word distribution of each topic and topic distribution of each time window, wherein the model parameters comprise topic numbers, document numbers, time window numbers, dirichlet priors of the topic distribution, dirichlet priors of the word distribution and Dirichlet priors of the time distribution; the iteration updating module is used for carrying out iteration updating on the time LDA topic clustering model by utilizing Gibbs sampling until the time LDA topic clustering model converges; and the model evaluation module is used for evaluating the topic distribution of each document, the word distribution of each topic and the topic distribution of each time window output by the time LDA topic clustering model so as to check the accuracy and the stability of the model.

As an embodiment of the present invention, the iterative updating module performs iterative updating of the temporal LDA topic cluster model by using Gibbs sampling includes: for each word of each document, its topic assignment is sampled using:

As an embodiment of the present invention, the above-described time model generation unit includes: a data collection module for collecting unstructured data including a time stamp as pre-training data; the window determining module is used for calculating the similarity between the time windows by using JS divergence, and selecting the window size with large theme content difference between the time windows and small theme difference in the window for slicing; and the theme identification module is used for identifying a time-continuous theme set based on the slicing result.

As an embodiment of the present invention, the feature extraction unit extracts global features from the unstructured data after preprocessing, including: the word frequency, the inverse document frequency and the TF-IDF vector of each word are extracted from the unstructured data after preprocessing based on the TF-IDF method.

As an embodiment of the present invention, the feature extraction unit extracts time-series features from the unstructured data after preprocessing, including: dividing the unstructured data after pretreatment into different time windows according to the time stamp; within each time window, the TF-IDF based method extracts keywords and phrases and generates corresponding feature vectors.

As an embodiment of the present invention, the foregoing apparatus further includes a dimension reduction unit configured to perform dimension reduction processing on the global feature and the time-series feature, where the dimension reduction unit includes: a combination module for combining the global feature and the time series feature into a data matrix; the normalization module is used for performing data normalization processing on the data matrix; the covariance calculation module is used for calculating a covariance matrix by using the marked data matrix; the decomposition module is used for carrying out eigenvalue decomposition on the covariance matrix to obtain eigenvalues and eigenvectors, wherein the eigenvalues represent the variances of each principal component, and the eigenvectors represent the directions of the principal components; the characteristic value selection module is used for selecting the first k main components according to the magnitude of the characteristic value; and the data projection module is used for projecting the combined data matrix onto the selected principal component to obtain feature data after dimension reduction.

According to a third aspect of the present invention there is provided an electronic device comprising a memory, a processor and a computer program stored on said memory and executable on said processor, the processor implementing the steps of the above method when executing said computer program.

According to a fourth aspect of the present invention there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.

According to the technical scheme, the method and the device for topic cluster analysis of unstructured data provided by the invention have the advantages that the time dimension is introduced into the LDA model to form the time LDA topic cluster model, so that the clustering result can reflect the time change trend of the unstructured data. The method not only improves the efficiency and accuracy of data analysis, but also comprehensively utilizes global features and time sequence features, and provides more comprehensive data representation. Particularly in banking industry, the method can effectively cope with challenges of huge background traffic and complex and various descriptive contents, reduces manual identification workload and cost, improves the visualization and understanding level of data, helps banks to more effectively identify and cope with operation risks, and improves business safety and stability.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:

FIG. 1 is a schematic flow chart of a topic cluster analysis method for unstructured data according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a training flow of a time LDA topic clustering model provided by an embodiment of the application;

FIG. 3 is a schematic diagram of a training process of a temporal LDA topic clustering model according to another embodiment of the present application;

FIG. 4 is a flowchart of a method for topic cluster analysis of unstructured data according to another embodiment of the present invention;

FIG. 5 is a schematic flow chart of a dimension reduction process according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a topic cluster analysis device for unstructured data according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a time model generating unit according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a time model generating unit according to another embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a topic cluster analysis device for unstructured data according to another embodiment of the present application;

FIG. 10 is a schematic diagram of a dimension reduction unit according to an embodiment of the present invention;

fig. 11 is a schematic block diagram of a system configuration of an electronic device provided in an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. The exemplary embodiments of the present invention and their descriptions herein are for the purpose of explaining the present invention, but are not to be construed as limiting the invention.

The information collected in the technical scheme is information and data which are authorized by a user or are fully authorized by each party, and the related data are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, so that the information complies with related laws and regulations and standards of related countries and regions, necessary security measures are adopted, the public welfare is not violated, and a corresponding operation entrance is provided for the user to select authorization or rejection. Providing a corresponding operation inlet for the user, and enabling the user to select to agree or reject the automatic decision result; if the user selects refusal, the expert decision flow is entered.

Fig. 1 is a schematic flow chart of a topic clustering analysis method for unstructured data according to an embodiment of the present invention, where the method includes the following steps:

Step S101: and carrying out data preprocessing on unstructured data to be analyzed.

Unstructured data includes data that is not of a predefined structure or rule, which is often irregular and difficult to model in a conventional manner, and contains information that can be very extensive and diverse, such as text, images, and audio.

The application firstly carries out data preprocessing on the original unstructured data to be analyzed, and specific operations can comprise stop word removal, punctuation mark removal, digit removal and the like. The raw data sources involved in this step mainly include unstructured text inventory data to be analyzed, and deactivated vocabulary templates and specialized vocabulary templates. The rules for presenting stop words by using the stop word templates generally include some common abbreviations, all check words, query words, etc. to avoid nonsensical words when generating text, thereby improving the model effect. The professional vocabulary templates are used for representing professional terms in some specific fields, help the model to more accurately express knowledge when generating text, and improve the expertise of the model.

Step S102: global features and time series features are extracted from the pre-processed unstructured data, respectively.

After the data preprocessing is completed, features are extracted from the unstructured data after cleaning, global features and time sequence features are extracted at the same time, wherein the global features represent features of the whole data set, the global features can be directly input into an LDA model for topic modeling and cluster analysis, the purpose of the time sequence feature extraction is to analyze feature changes in different time periods, the evolution of topics along with time is captured, and the time sequence features are input into the LDA model, so that the model can be helped to understand the changes and trends of the topics in different time periods.

Preferably, extracting the global feature from the pre-processed unstructured data in this step may include: the word frequency, the inverse document frequency and the TF-IDF vector of each word are extracted from the unstructured data after preprocessing based on the TF-IDF method.

Wherein extracting word frequencies of the words is calculating the frequency of occurrence of each word in the document; extracting the inverse document frequency is calculating the inverse of the frequency of word occurrence in the whole document set; the TF-IDF vector is calculated by combining the word frequency and the inverse document frequency, and the TF-IDF value of each word is calculated to obtain a vector, and each element of the vector represents the importance of the word in the document. Through these steps, representative global features can be extracted from the pre-processed unstructured data for subsequent topic cluster analysis.

Preferably, extracting the time-series feature from the unstructured data after preprocessing in this step may include: dividing the unstructured data after pretreatment into different time windows according to the time stamp; within each time window, the TF-IDF based method extracts keywords and phrases and generates corresponding feature vectors.

The unstructured data after preprocessing is divided according to the time stamp. In particular, an appropriate time window size may be selected according to business requirements, such as dividing the data into multiple time windows by day, week, month, etc. Each time window contains all events or records that occur during that time period.

Then, in each time window, calculating the occurrence frequency of each word in all documents in the time window, namely word frequency (TF); meanwhile, calculating the inverse value (IDF) of the frequency of the words in the whole document set, wherein the higher the IDF value is, the less common the words in the whole document set are, and the higher the degree of distinction is; then, the TF-IDF value of each word is calculated by combining the word frequency and the inverse document frequency. The higher the TF-IDF value, the higher the importance of the word in the document in the current time window; finally, the TF-IDF values of all documents in each time window are combined to generate a feature vector, and each element of the feature vector represents the TF-IDF value of a word in the time window.

Through the steps, the time sequence features can be extracted from the unstructured data after pretreatment, so that the clustering result can reflect the time change trend of the data, and the time dynamic characteristics of the data can be better understood and analyzed.

Step S103: the time dimension is introduced into the LDA model to form a temporal LDA topic cluster model.

In the step, a time dimension is introduced into an LDA model to form a time LDA topic clustering model, the LDA model estimates topic distribution of a document by analyzing word frequency distribution in the document, and after the time dimension is introduced, the model considers not only the distribution of words in the document but also the distribution change of the words in time, so that the evolution trend of the topic in time can be captured more accurately. Because operational risk events, regulatory rules, customer service requirements, etc. of a bank change over time, the temporal changes of words must be considered for the clustering analysis of unstructured data of the bank, and the final clustering analysis results can be more accurate.

Step S104: and taking the global feature and the time sequence feature as inputs of the time LDA topic clustering model to output the unstructured data topic clustering result.

In the last step, the extracted global features and time sequence features are input into a time LDA topic clustering model for modeling and analysis. The model uses the characteristics to carry out theme modeling to obtain theme distribution of each document. And dividing the documents into different topics by calculating the similarity of the documents in the topic space, and outputting a topic clustering result of unstructured data. Finally, the clustering result can be visually displayed, and data analysis is carried out on the event concrete description according to the dimension of the risk event attribution mechanism and the dimension of the business strip line, so that a user (such as an auditor) can more intuitively check the overall condition of the analysis result.

According to the technical scheme, the method for topic clustering analysis of unstructured data provided by the invention has the advantages that the time dimension is introduced into the LDA model to form the time LDA topic clustering model, so that the clustering result can reflect the time change trend of the unstructured data. The method not only improves the efficiency and accuracy of data analysis, but also comprehensively utilizes global features and time sequence features, and provides more comprehensive data representation. Particularly in banking industry, the method can effectively cope with challenges of huge background traffic and complex and various descriptive contents, reduces manual identification workload and cost, improves the visualization and understanding level of data, helps banks to more effectively identify and cope with operation risks, and improves business safety and stability.

The training process of the time LDA topic clustering model is further described below, and as shown in fig. 2, a training flow diagram of the time LDA topic clustering model provided by the embodiment of the application is shown, which includes the following steps:

Step S201: unstructured data containing time stamps is collected as pre-training data.

First, the source of unstructured data that needs to be analyzed is determined. For banking systems, possible data sources include operational risk event listings, customer service records, audit discovery problem records, regulatory rule documents, intra-line regulatory documents, and the like. Relevant unstructured data, typically in the form of text, may include descriptive text, reports, mail, logs, etc., is then collected from the determined data sources.

It is ensured that each piece of unstructured data contains time stamp information, which may be the date and time of data generation, for subsequent time window partitioning and time series analysis.

The collected unstructured data can be formatted, and each formatted data record is ensured to contain the following fields: 1. text content, primary content of unstructured data; 2. time stamp, date and time of data generation.

The formatted data is stored in a suitable storage medium, such as a database, file system, or distributed storage system, to ensure the reliability and accessibility of the data storage for subsequent data preprocessing and model training.

Preferably, the collected data can be subjected to quality inspection, so that the integrity and the accuracy of the data are ensured, whether missing time stamps or text contents exist or whether repeated data exist or not is checked. For missing or abnormal data, appropriate processing such as complementing missing information, deleting duplicate data, and the like may be performed.

Through the steps, unstructured data containing time stamps can be collected, and basic data support is provided for subsequent preprocessing, feature extraction and training of the LDA topic clustering model.

Step S202: and preprocessing the pre-training data.

The preprocessing is used for ensuring the quality and consistency of the data and providing high-quality input data for subsequent training of the LDA topic clustering model. The preprocessing means may include stop word removal, punctuation removal, digit removal, text normalization, word segmentation processing, and the like.

In this embodiment, for subsequent model training, the preprocessed data may be divided into a training set, a validation set, and a test set, such as 80% of the data used for training, 10% of the data used for validation, and 10% of the data used for testing.

Preferably, the pre-processed text data may also be converted into a format suitable for model input. For example, text data is converted into a Bag of Words model (Bag of Words) or Word Vectors (Word Vectors) for input into the LDA model for training.

Step S203: and extracting global features from the unstructured data after preprocessing by using a TF-IDF method.

In this embodiment, the word frequency, the inverse document frequency, and the TF-IDF vector of each word may be extracted from the unstructured data after preprocessing based on the TF-IDF method, and the extracted word frequency, inverse document frequency, and TF-IDF vector may be used as global features.

Step S204: dividing the unstructured data after preprocessing according to different time windows.

In this embodiment, the length of the time window may be determined according to the analysis requirement, for example, the time windows may be selected to be divided according to different days, weeks, months, or quarters. The data is then ordered according to the time stamps, ensuring that the data is arranged in time order. The data is then divided into a plurality of time segments according to the determined time window length, for example, if monthly division is selected, the data is grouped by time range of each month. For each time window, a new data set is created, containing all the data records within that time window.

In addition, the embodiment can process the data records on the boundary of the time window, ensure that each data record is accurately distributed in the corresponding time window, and if the time stamp of the data record is just on the boundary of two time windows, the data records are distributed according to a predefined rule.

Finally, in this embodiment, the divided time window data set may also be verified to ensure accuracy and integrity of data division, for example, checking the data amount and time range of each time window, so as to ensure that no data is missing or repeated.

Through the steps, the pre-processed unstructured data can be divided according to different time windows, and time sequence data support is provided for subsequent LDA topic clustering analysis, so that topic changes and trends of the data in different periods can be analyzed, and timeliness and accuracy of the data analysis are improved.

Step S205: model parameters of a time LDA topic clustering model are defined and the topic distribution of each document, the word distribution of each topic and the topic distribution of each time window are initialized, wherein the model parameters comprise the topic number, the document number, the time window number, the Dirichlet prior of the topic distribution, the Dirichlet prior of the word distribution and the Dirichlet prior of the time distribution.

The number of topics (K), i.e., the number of potential topics, the initial number of topics may be determined by grid search or empirical determination; the document number (D) is the total number of documents contained in the preprocessed data set; the time window number (T) is the total number of time periods after the data set is divided according to the time window; the Dirichlet a priori (α) of the topic distribution is a hyper-parameter of the Dirichlet distribution, which controls the sparseness of the topic distribution in each document, which can be set to 50/K; the Dirichlet a priori (β) Dirichlet distributed superparameter of the word distribution, which controls the sparseness of the word distribution in each topic, may be set to 0.01; the Dirichlet a priori (γ) of the time distribution is a hyper-parameter of the Dirichlet distribution that controls the sparseness of the subject distribution in each time window, which can be determined by cross-validation or empirical values.

The following describes initializing the topic distribution of each document, the word distribution of each topic, and the topic distribution of each time window, respectively:

1. Initializing a topic distribution of each document:

Defining a matrix theta, the size of which is D multiplied by K, and representing the distribution of each document to the topics; the topic distribution is then initialized for each document by sampling using the Dirichlet distribution and the hyper-parameter α by:

θ_d～Dirichlet(α)

Where θ _d represents the topic distribution vector of document d.

2. Initializing word distribution of each topic:

Definition matrix The size is KXV, representing the distribution of each topic to words, where V is the vocabulary size; the word distribution is then initialized for each topic by sampling using the Dirichlet distribution and the super parameter β as follows:

Wherein the method comprises the steps of A word distribution vector representing the topic k.

3. Theme distribution for each time window:

Defining a matrix, the size of which is T multiplied by K, representing the distribution of the topics within each time window; the topic distribution is then initialized for each time window by sampling using the Dirichlet distribution and the hyper-parameters γ by:

ω_t～Dirichlet(γ)

where ω _t represents the topic distribution vector of the time window t.

Through the steps, model parameters and distribution required by the time LDA topic clustering model are successfully defined and initialized, and the parameters and the initialization process provide a basis for subsequent model training and topic inference.

Step S206: and performing iterative updating of the time LDA topic clustering model by using Gibbs sampling until the time LDA topic clustering model converges.

Preferably, for each word of each document, its topic assignment is sampled using the following formula:

In the above formula, p represents a conditional probability; z _di is a random variable representing the topic assignment of the ith word in document d; w _di represents the ith word in document d; t _d denotes a time stamp of the document d; alpha represents the Dirichlet a priori of the topic distribution; beta represents the Dirichlet a priori of the word distribution; gamma represents the Dirichlet a priori of the time distribution; oc represents "proportional to"; a count representing a topic k in document d that does not include the current word i; A total count representing all words in document d that do not include the current word i; A count representing word w in topic k that does not include current word i; A total count representing all words in the topic k that do not include the current word i; A count representing a topic k in the time window t that does not include the current word i; k represents the total number of topics; v represents the size of the vocabulary; t represents the total number of time windows.

And then updating the topic distribution of each document, the word distribution of each topic and the topic distribution of each time window according to the sampling result.

Step S207: and evaluating the topic distribution of each document, the word distribution of each topic and the topic distribution of each time window output by the time LDA topic clustering model to check the accuracy and stability of the model.

In this embodiment, an appropriate evaluation index and method may be selected to measure the performance of the time LDA model, where one or a combination of confusion (Perplexity), log-Likelihood (Log-Likelihood), and Topic Coherence (Topic Coherence) may be selected to measure the performance of the time LDA model.

1) The confusion degree is an index for measuring the quality of the generated data of the language model, the smaller the value is, the better the model is, the confusion degree can be calculated by using a training set and a verification set respectively, and the consistency of the model on different data sets is ensured.

In the present embodiment, the calculation formula of the confusion degree is as follows:

In the above formula, D is the total number of all documents; d represents an index of the document; p (w _d) represents the probability of generating a word in document d; n _d represents the number of words in document d.

2) The log likelihood is used for quantifying the interpretation capability of the model to the data, and the calculation formula of the log likelihood is as follows:

in the above formula, D is the total number of all documents; alpha and beta are hyper-parameters, dirichlet a priori distributions for topic distribution and word distribution, respectively; The logarithmic probability of the topic distribution theta _d of the document d under the given Dirichlet prior parameter alpha is calculated; Is to calculate word distribution of all topics Logarithmic probability given the Dirichlet a priori parameter β; is to calculate the topic distribution theta _d of each word w _dn in each document and the word distribution of all topics in the document The lower logarithm generates the probability.

By summing these log-likelihood values, the overall formula gives the log-likelihood values for the document set given the model parameters (the topic distribution of the document and the word distribution of the topic). The larger this value, the stronger the interpretation of the data by the model, and therefore can be used to gauge the quality of the model.

3) The topic consistency is a topic with higher consistency by measuring the mutual consistency between high-frequency words in the same topic, and generally, the words have stronger semantic association.

The subject matter consistency can be calculated by:

in the above formula, the consistency score of the topic k is represented; A word distribution vector representing a topic k; m represents the first M high frequency words selected, typically selected from the word distribution of the topic k; w _m and w _l represent the mth and the ith high-frequency word (where m > l) in the topic k, respectively; d (w _m,w_l) represents the number of documents containing both words w _m and w _l; d (w _l) represents the number of documents containing the word w _l; e is a small smooth term, usually a small positive number (e.g., e=1) to avoid the case where the denominator is zero.

Through the evaluation step, the accuracy and the stability of the time LDA topic clustering model can be comprehensively detected, and a reliable basis is provided for subsequent application and optimization.

The training process of the time LDA topic clustering model is further described below, and as shown in fig. 3, a training flow diagram of the time LDA topic clustering model according to another embodiment of the present application is provided, which includes the following steps:

Step S301: unstructured data containing time stamps is collected as pre-training data.

This step is the same as step S201 described above, and will not be described again here.

Step S302: and calculating the similarity between the time windows by using JS divergence, and selecting the window size with large theme content difference between the time windows and small theme difference in the windows for slicing.

First, based on a preliminarily defined time window, the similarity of the topic distribution between different time windows is calculated. The Jensen-Shannon (JS) divergence is used here to measure the similarity between two probability distributions. The JS divergence formula is as follows:

Wherein, I.e., the average of P and Q; p and Q are two probability distributions, which can be considered herein as vectors of document topic probability distributions over different time windows; d _KL (p|m) is the Kullback-Leibler divergence (KL divergence), which is a measure of the difference between the distribution P and the distribution M; d _KL (qil M) is the KL divergence of distribution Q and distribution M.

And then according to the calculation result of JS divergence, searching time points with obvious theme content change, taking the time points as the boundaries of time windows, and preferentially selecting the sizes of the time windows with small theme difference in the windows and large theme difference among the windows so as to ensure the theme consistency in each slice.

Step S303: a set of temporally successive topics is identified based on the slicing results.

For the slice data for each time window, a time LDA (LATENT DIRICHLET Allocation) model is applied to identify and extract topics within the time period. And analyzing the continuity of the extracted topics in each time window along with time, and correlating the topics of adjacent time windows through time sequence to form a time continuous topic set. The identified time-continuous set of topics may be used for further tracking and evolution analysis to observe changes and development of topics over different time periods.

Through the steps, the time LDA topic clustering model can be effectively trained, and a time-continuous topic set is extracted from unstructured data containing time stamps. The process not only considers the change of the theme content, but also ensures the consistency of the theme in different time windows and the obvious change of the theme among windows through reasonable time slicing and JS divergence calculation, so that the extracted theme has more practical significance and explanatory power.

Fig. 4 is a schematic flow chart of a topic cluster analysis method for unstructured data according to another embodiment of the present invention, where the method includes the following steps:

step S401: and carrying out data preprocessing on unstructured data to be analyzed.

Step S402: global features and time series features are extracted from the pre-processed unstructured data, respectively.

Step S403: and performing dimension reduction processing on the global features and the time sequence features.

Preferably, as shown in fig. 5, the present step may further include the following sub-steps:

Step S4031: combining the global feature and the time series feature into a data matrix. Global features are overall information extracted from the whole data set, including statistical features, overall trends, etc., time series features: the time-varying characteristics reflect the specific behavior of the data at each time node. The features are constructed as a matrix from samples, where each row represents a sample and each column represents a feature.

Step S4032: and carrying out data standardization processing on the data matrix. The purpose of data normalization is to eliminate the difference in the dimensions of different features so that they have the same dimensions, and a common method is to convert the data into a normal distribution with a mean value of 0 and a standard deviation of 1, for example, Z-score normalization can be used.

Step S4033: the covariance matrix is calculated by using the annotated data matrix, and describes the linear relationship between the features, and in this embodiment, the covariance matrix can be calculated by using the following formula:

in the above equation, Σ is the calculated covariance matrix, m is the number of samples, Is a transpose of the normalized data matrix, and X _S is the normalized data matrix.

Step S4034: and carrying out eigenvalue decomposition on the covariance matrix to obtain eigenvalues and eigenvectors, wherein the eigenvalues represent variances of each principal component, and the eigenvectors represent directions of the principal components.

Step S4035: the first k principal components are selected according to the magnitude of the eigenvalue. And sorting the feature values from large to small, selecting feature vectors corresponding to the first k maximum feature values, and forming a new projection base by the selected feature vectors.

Step S4036: and projecting the combined data matrix onto the selected principal component to obtain feature data after dimension reduction.

The method comprises the steps of taking k selected principal components (eigenvectors) as projection bases, projecting a standardized data matrix onto the principal components to obtain dimension-reduced data, wherein the dimension-reduced data can be specifically realized by the following formula:

X_r＝X_S*V_k

In the above formula, V _k is a matrix formed by selecting the first k feature vectors; x _r is the reduced dimension data matrix.

Step S404: the time dimension is introduced into the LDA model to form a temporal LDA topic cluster model.

Step S405: and taking the global features and the time sequence features after the dimension reduction processing as the input of the time LDA topic clustering model to output the unstructured data topic clustering result.

According to the technical scheme, the method for topic clustering analysis of unstructured data provided by the invention has the advantages that the time dimension is introduced into the LDA model to form the time LDA topic clustering model, so that the clustering result can reflect the time change trend of the unstructured data. The method not only improves the efficiency and accuracy of data analysis, comprehensively utilizes the global features and the time sequence features, provides more comprehensive data representation, reduces the data processing amount and improves the data processing efficiency. Particularly in banking industry, the method can effectively cope with challenges of huge background traffic and complex and various descriptive contents, reduces manual identification workload and cost, improves the visualization and understanding level of data, helps banks to more effectively identify and cope with operation risks, and improves business safety and stability.

Fig. 6 is a schematic structural diagram of a device for topic cluster analysis of unstructured data according to an embodiment of the present application, where the device includes: the preprocessing unit 610, the feature extraction unit 620, the time model generation unit 630, and the topic clustering unit 640 are sequentially connected therebetween. Wherein:

A preprocessing unit 610, configured to perform data preprocessing on unstructured data to be analyzed.

The feature extraction unit 620 is configured to extract global features and time-series features from the unstructured data after preprocessing, respectively.

A temporal model generation unit 630, configured to introduce a temporal dimension into the LDA model to form a temporal LDA topic cluster model.

And a topic clustering unit 640, configured to take the global feature and the time-series feature as input of the temporal LDA topic clustering model to output the unstructured data topic clustering result.

Preferably, as shown in fig. 7, the time model generating unit 630 includes:

the data collection module 631 is configured to collect unstructured data including a time stamp as pre-training data.

A preprocessing module 632 is configured to preprocess the pre-training data.

The global feature extraction module 633 is configured to extract global features from the unstructured data after preprocessing by using TF-IDF method.

The time window dividing module 634 is configured to divide the unstructured data after preprocessing according to different time windows.

An initialization module 635, configured to define model parameters of the temporal LDA topic clustering model and initialize topic distribution of each document, word distribution of each topic, and topic distribution of each time window, where the model parameters include topic number, document number, time window number, dirichlet a priori of topic distribution, dirichlet a priori of word distribution, and Dirichlet a priori of time distribution.

The iterative updating module 636 is configured to perform iterative updating of the temporal LDA topic cluster model by using Gibbs sampling until the temporal LDA topic cluster model converges.

The model evaluation module 637 is configured to evaluate the topic distribution of each document, the word distribution of each topic, and the topic distribution of each time window output by the temporal LDA topic clustering model to verify the accuracy and stability of the model.

Preferably, the iterative updating module 636 performs iterative updating of the temporal LDA topic cluster model by using Gibbs sampling, including: for each word of each document, its topic assignment is sampled using:

Preferably, as shown in fig. 8, the time model generating unit 630 includes:

The data collection module 6301 is configured to collect unstructured data including a timestamp as pre-training data.

The window determining module 6302 is configured to calculate a similarity between time windows by using JS divergence, and select a window size with a large difference in subject content between the time windows and a small difference in subject content within the window for slicing.

The topic identification module 6303 is configured to identify a set of topics that are continuous in time based on the slicing result.

Preferably, the feature extraction unit 620 extracts global features from the unstructured data after preprocessing, including: the word frequency, the inverse document frequency and the TF-IDF vector of each word are extracted from the unstructured data after preprocessing based on the TF-IDF method.

Preferably, the extracting unit 620 extracts time-series features from the unstructured data after preprocessing, including: dividing the unstructured data after pretreatment into different time windows according to the time stamp; within each time window, the TF-IDF based method extracts keywords and phrases and generates corresponding feature vectors.

Preferably, as shown in fig. 9, the apparatus further includes a dimension reduction unit 650, configured to perform dimension reduction processing on the global feature and the time series feature, and further, as shown in fig. 10, the dimension reduction unit 650 includes:

a combining module 651 for combining the global feature and the time series feature into a data matrix.

A normalization module 652, configured to perform data normalization processing on the data matrix.

The covariance calculation module 653 is configured to calculate a covariance matrix by using the annotated data matrix.

And the decomposition module 654 is configured to decompose the eigenvalue of the covariance matrix to obtain eigenvalues and eigenvectors, where the eigenvalues represent variances of each principal component, and the eigenvectors represent directions of the principal components.

The feature value selection module 655 is configured to select the first k principal components according to the magnitude of the feature value.

And the data projection module 656 is configured to project the combined data matrix onto the selected principal component to obtain feature data after dimension reduction.

According to the technical scheme, the topic cluster analysis device for unstructured data provided by the invention is capable of forming a time LDA topic cluster model by introducing the time dimension into the LDA model, so that the cluster result can reflect the time change trend of the unstructured data. The method not only improves the efficiency and accuracy of data analysis, but also comprehensively utilizes global features and time sequence features, and provides more comprehensive data representation. Particularly in banking industry, the method can effectively cope with challenges of huge background traffic and complex and various descriptive contents, reduces manual identification workload and cost, improves the visualization and understanding level of data, helps banks to more effectively identify and cope with operation risks, and improves business safety and stability.

The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the method when executing the program.

The embodiment of the invention also provides a computer readable storage medium, and the computer readable storage medium stores a computer program for executing the method.

As shown in fig. 11, the electronic device 600 may further include: a communication module 110, an input unit 120, an audio processor 130, a display 160, a power supply 170. It is noted that the electronic device 600 need not include all of the components shown in FIG. 11; in addition, the electronic device 600 may further include components as shown in fig. 11, to which reference may be made.

As shown in fig. 11, the central processor 100, sometimes also referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, which central processor 100 receives inputs and controls the operation of the various components of the electronic device 600.

The memory 140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information about failure may be stored, and a program for executing the information may be stored. And the central processor 100 can execute the program stored in the memory 140 to realize information storage or processing, etc.

The input unit 120 provides an input to the central processor 100. The input unit 120 is, for example, a key or a touch input device. The power supply 170 is used to provide power to the electronic device 600. The display 160 is used for displaying display objects such as images and characters. The display may be, for example, but not limited to, an LCD display.

The memory 140 may be a solid state memory such as Read Only Memory (ROM), random Access Memory (RAM), SIM card, or the like. But also a memory which holds information even when powered down, can be selectively erased and provided with further data, an example of which is sometimes referred to as EPROM or the like. Memory 140 may also be some other type of device. Memory 140 includes a buffer memory 141 (sometimes referred to as a buffer). The memory 140 may include an application/function storage 142, the application/function storage 142 for storing application and function programs or a flow for executing operations of the electronic device 600 by the central processor 100.

The memory 140 may also include a data store 143, the data store 143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by the electronic device. The driver storage 144 of the memory 140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, address book applications, etc.).

The communication module 110 is a transmitter/receiver 110 that transmits and receives signals via an antenna 111. A communication module (transmitter/receiver) 110 is coupled to the central processor 100 to provide an input signal and receive an output signal, which may be the same as in the case of a conventional mobile communication terminal.

Based on different communication technologies, a plurality of communication modules 110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, etc., may be provided in the same electronic device. The communication module (transmitter/receiver) 110 is also coupled to a speaker 131 and a microphone 132 via an audio processor 130 to provide audio output via the speaker 131 and to receive audio input from the microphone 132 to implement usual telecommunication functions. The audio processor 130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 130 is also coupled to the central processor 100 so that sound can be recorded locally through the microphone 132 and so that sound stored locally can be played through the speaker 131.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principles and embodiments of the present invention have been described in detail with reference to specific examples, which are provided to facilitate understanding of the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1.A method of topic cluster analysis of unstructured data, the method comprising:

performing data preprocessing on unstructured data to be analyzed;

Extracting global features and time sequence features from the unstructured data after pretreatment respectively;

introducing the time dimension into the LDA model to form a time LDA topic clustering model;

and taking the global feature and the time sequence feature as inputs of the time LDA topic clustering model to output the unstructured data topic clustering result.

2. The method of topic cluster analysis of unstructured data of claim 1 wherein said introducing a time dimension into an LDA model to form a temporal LDA topic cluster model comprises:

Collecting unstructured data containing a timestamp as pre-training data;

preprocessing the pre-training data;

Extracting global features from the preprocessed unstructured data by using a TF-IDF method;

Dividing the unstructured data after pretreatment according to different time windows;

defining model parameters of a time LDA topic clustering model and initializing topic distribution of each document, word distribution of each topic and topic distribution of each time window, wherein the model parameters comprise topic number, document number, time window number, dirichlet prior of topic distribution, dirichlet prior of word distribution and Dirichlet prior of time distribution;

Performing iterative updating of the time LDA topic clustering model by using Gibbs sampling until the time LDA topic clustering model converges;

And evaluating the topic distribution of each document, the word distribution of each topic and the topic distribution of each time window output by the time LDA topic clustering model to check the accuracy and stability of the model.

3. The topic cluster analysis method of unstructured data of claim 2, wherein said iterative updating of the temporal LDA topic cluster model with Gibbs sampling comprises:

for each word of each document, its topic assignment is sampled using:

In the above formula, p represents a conditional probability; z _di is a random variable representing the topic assignment of the ith word in document d; w _di represents the ith word in document d; t _d denotes a time stamp of the document d; alpha represents the Dirichlet a priori of the topic distribution; beta represents the Dirichlet a priori of the word distribution; beta represents a Dirichlet a priori of the time distribution; oc represents "proportional to"; a count representing a topic k in document d that does not include the current word i; A total count representing all words in document d that do not include the current word i; A count representing word w in topic k that does not include current word i; A total count representing all words in the topic k that do not include the current word i; A count representing a topic k in the time window t that does not include the current word i; k represents the total number of topics; v represents the size of the vocabulary; t represents the total number of time windows;

and updating the topic distribution of each document, the word distribution of each topic and the topic distribution of each time window according to the sampling result.

4. The method of topic cluster analysis of unstructured data of claim 1 wherein the introducing a time dimension into an LDA model to form a temporal LDA topic cluster model comprises:

Collecting unstructured data containing a timestamp as pre-training data;

calculating the similarity between time windows by using JS divergence, and selecting the window size with large theme content difference between the time windows and small theme difference in the windows for slicing;

a set of temporally successive topics is identified based on the slicing results.

5. The method of topic cluster analysis of unstructured data of claim 1, wherein extracting global features from the preprocessed unstructured data comprises:

the word frequency, the inverse document frequency and the TF-IDF vector of each word are extracted from the unstructured data after preprocessing based on the TF-IDF method.

6. The method of topic cluster analysis of unstructured data of claim 1, wherein said extracting time series features from the preprocessed unstructured data comprises:

dividing the unstructured data after pretreatment into different time windows according to the time stamp;

within each time window, the TF-IDF based method extracts keywords and phrases and generates corresponding feature vectors.

7. The method for topic cluster analysis of unstructured data according to claim 1, wherein after the global features and time-series features are extracted from the preprocessed unstructured data, the method further comprises:

combining the global feature and the time series feature into a data matrix;

Carrying out data standardization processing on the data matrix;

calculating a covariance matrix by using the marked data matrix;

Performing eigenvalue decomposition on the covariance matrix to obtain eigenvalues and eigenvectors, wherein the eigenvalues represent variances of each principal component, and the eigenvectors represent directions of the principal components;

selecting the first k main components according to the magnitude of the characteristic value;

And projecting the combined data matrix onto the selected principal component to obtain feature data after dimension reduction.

8. A topic cluster analysis device for unstructured data, the device comprising:

the pretreatment unit is used for carrying out data pretreatment on unstructured data to be analyzed;

The feature extraction unit is used for respectively extracting global features and time sequence features from the pre-processed unstructured data;

the time model generation unit is used for introducing the time dimension into the LDA model to form a time LDA topic clustering model;

And the topic clustering unit is used for taking the global feature and the time sequence feature as the input of the time LDA topic clustering model to output the unstructured data topic clustering result.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.