CN119741156A

CN119741156A - Intellectual property operation management system based on big data

Info

Publication number: CN119741156A
Application number: CN202411807543.4A
Authority: CN
Inventors: 邹华君; 邹尚哲
Original assignee: Wuxi Guoxin Intellectual Property Service Co ltd
Current assignee: Wuxi Guoxin Intellectual Property Service Co ltd
Priority date: 2024-12-10
Filing date: 2024-12-10
Publication date: 2025-04-01

Abstract

The invention belongs to the technical field of intelligence, and discloses an intellectual property operation management system based on big data, which comprises the steps of obtaining behavioral data and basic data of intellectual property; the method comprises the steps of carrying out missing value and abnormal value processing on behavior data of intellectual property rights to obtain a preprocessing data set, predicting value index data of the intellectual property rights by using an SVR model based on the preprocessing data set, extracting, converting and loading by using an ETL tool, processing application names by using a TF-IDF technology, extracting a keyword group B, storing the keyword group B as a new field in a target database, obtaining a keyword A by using the TF-IDF technology based on user input text, matching with the keyword group B to obtain comprehensive matching degree data, setting a threshold value based on the comprehensive matching degree data, judging whether to recommend or not, and improving the efficiency and accuracy of intellectual property rights operation management.

Description

Intellectual property operation management system based on big data

The invention relates to the technical field of intelligence, in particular to an intellectual property operation management system based on big data.

Background

The existing intellectual property operation management system of big data exposes a plurality of problems that firstly, an intelligent and data-driven prediction model is not adopted, but simple rules and statistical data are relied on, and the potential commercial value, technical innovation or market influence of the intellectual property can not be accurately estimated;

most of the prior systems match keywords based on text data only, and ignore the value index of the intellectual property, which means that even if the intellectual property has higher matching degree in text, the intellectual property does not necessarily represent that the intellectual property has practical value for the user demand;

The existing system generally uses a static keyword matching rule and a fixed matching degree threshold, which means that once the matching degree standard is set, the system cannot adaptively adjust according to data change or change of user requirements, and the fixed threshold cannot dynamically adjust according to nuances among intellectual property rights, so that matching is too loose or too strict, and recommendation accuracy is affected.

In view of the above, the present invention proposes an intellectual property operation management system based on big data to solve the above-mentioned problems.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides the following technical scheme that the intellectual property operation management system based on big data comprises:

The data acquisition module is used for acquiring behavior data and basic data of intellectual property rights;

The data processing module is used for carrying out missing value and abnormal value processing on the behavior data of the intellectual property to obtain a preprocessing data set;

A prediction module for predicting value index data of intellectual property by using SVR model based on the preprocessed data set;

The data storage module is used for extracting, converting and loading the basic data, the preprocessing data set and the value index data of the intellectual property by using an ETL tool, processing the application name by using a TF-IDF technology, extracting a keyword group B, and storing the keyword group B as a new field in the target database;

the keyword matching module is used for acquiring a keyword A by using a TF-IDF technology based on a text input by a user, and matching the keyword A with a keyword group B to obtain comprehensive matching degree data;

And the recommendation module is used for setting a recommendation threshold value based on the comprehensive matching degree data and judging whether to recommend.

Further, the behavior data of the intellectual property comprises a click frequency characteristic, a collection frequency characteristic, a downloading frequency characteristic and a quotation frequency characteristic;

the basic data includes application name, application number, application date, validity period, and applicant name.

Further, the method for acquiring the preprocessing data set includes:

aiming at intellectual property behavior data, a stacking method is used for forming a data set, a forward filling method and a backward filling method are used for carrying out missing value processing on the data set to obtain a missing value processing data set, an isolated forest model is used for identifying and processing abnormal values, and the processed data is standardized to obtain a preprocessing data set;

Traversing click frequency characteristic, collection frequency characteristic, downloading frequency characteristic and reference frequency characteristic data to form a dataset by using isnull technology, identifying a missing value, obtaining a forward data value of the missing value by using a forward filling method aiming at the missing value, obtaining a backward data value by using a backward filling method, calculating an average value of the forward data value and the backward data value, and filling the missing value by using the average value.

Further, the method for identifying and processing the outliers by using the isolated forest model comprises the following steps:

Initializing hyper-parameters, namely initializing hyper-parameters of an isolated forest, wherein the hyper-parameters comprise the number SL of trees, the maximum depth SD of each tree and the number mm of samples of each tree;

selecting training subsets, namely setting SL trees, setting the size of the training subset of each tree to be mm, and selecting one subset from the missing value processing data set for training each time;

feature selection, namely selecting split features by using multiple evaluation indexes;

Selecting a segmentation point, namely randomly selecting one segmentation point for the selected segmentation feature, and dividing the data into two subsets;

Recursively segmenting, namely repeatedly performing feature selection and segmentation point selection on each subset, and continuously splitting the data into smaller subsets until the maximum depth SD of the tree is reached;

repeating the above process by constructing M_L trees, each tree trained on a different subset of the data;

calculating the path length of each data point, in each tree, calculating the path length of the data point I.e., the number of split nodes traversed from the root node to the data point, where,Representing the path length of the ith data point in the jth tree;

calculating the average path length over all trees for each data point Calculating the average path length of the data point on all treesWherein, the method comprises the steps of, wherein,Representing data pointsThe path length on the j-th tree,Representing the number of trees;

calculating outlier scores by evaluating the outlier scores by calculating the average path length of each data point Wherein, the method comprises the steps of, wherein,Is a constant calculated based on the number of samples m, representing the expected path length, and the formula is: Wherein Is a constant;

stopping iteration when the maximum depth of the tree is reached;

Output result, setting abnormal value threshold value When (when)When the data point is normal, whenAnd when the data point is abnormal, replacing and filling the abnormal data point by using a backward filling method to obtain a pre-processing data set.

Further, the specific way of selecting the splitting feature by using the multiple evaluation indexes includes:

By calculating the information gain, the base index and the variance of each feature and normalizing, the information gain, the base index and the variance are calculated by using an entropy weight method to set weights, and the comprehensive score of each feature is calculated Wherein, the method comprises the steps of, wherein,、AndRespectively represent the firstNormalized information gain, base index and variance of the individual eigenvalues,、AndWeights representing information gain, base index and variance, respectively, and selecting a composite scoreThe highest feature is the selected feature;

The entropy weight method calculates the weight by calculating the information entropy value of each evaluation index, Wherein, the method comprises the steps of, wherein,Represent the firstThe entropy of the individual evaluation indicators is determined,Represent the firstThe probability distribution of the tz-th eigenvalue of the individual evaluation index,Representing the total number of the characteristic values;

According to the entropy of information Computational redundancy;

Calculating the weight of each evaluation index according to the redundancyWherein 3 represents the total number of evaluation indexes.

Further, the setting of the outlier thresholdThe specific modes of (a) include:

Setting outlier thresholds using neighborhood distances Calculating an average value of the missing value processing datasetAnd standard deviationAnd calculate the coefficient of variationAnd set the K value asWherein, the method comprises the steps of, wherein,Representing the number of samples of the data set,Represents the adjustment coefficient of the device,,Representing a smoothing factor;

Determining a K value, for each data point, calculating the distance between the data point and K nearest neighbors by using Euclidean distance, and carrying out average calculation to obtain the average distance of the K nearest neighbors, setting a distance threshold value to be 95% according to a rule of thumb, and if the average distance of the K nearest neighbors of the data point is greater than the distance threshold value by 95%, the data point is an abnormal value.

Further, the specific way of predicting the value index data of the intellectual property right by using the SVR model based on the preprocessing data set comprises the following steps:

The sample set comprises J_o groups of samples, and each group of samples comprises a preprocessing data set and corresponding intellectual property value index data;

initializing super parameters, namely selecting a radial basis function as a kernel function, determining a penalty coefficient, epsilon parameters and gamma parameter super parameter combinations, and determining optimal super parameter combinations by using a grid search method;

Calculating initial loss function value of calculation model As a starting point for the optimization, among others,The representation is regularized in such a way that,A weight vector representing the input feature,The term of the bias is indicated,The term of the error is represented as,For penalty coefficients, n2 represents the total number of training samples, o represents the sample index,Representing the error in the upper deviation of the o-th sample,Representing the error in the lower bias representing the o-th sample;

Iterative optimization, in which model parameters are updated by using gradient descent method in each iterative process AndCalculating a new loss function in the updating process;

and (3) checking convergence conditions, namely comparing the current loss function value with the change of the previous round after each round of iteration, and stopping iteration when the loss change between two adjacent iterations is smaller than a preset error threshold value.

Further, the specific ways of extracting, converting and loading the basic data, the preprocessed data set and the value index data of the intellectual property right by using the ETL tool include:

Extracting, namely realizing intellectual property basic data obtained from an L_L data source in real time through a real-time data flow technology APACHE KAFKA, directly transmitting the intellectual property basic data into an ETL tool for processing, and simultaneously guiding value index data predicted according to intellectual property behavior data and a processed data set into the ETL tool;

Converting, namely, aiming at intellectual property basic data, value index data and a processing data set which are imported into an ETL tool, performing duplication removal, missing value processing, unifying date formats and adjusting field names, and converting original data into a standard format of a target database;

importing the converted data into a target database, and designing a table structure, a field type and an index of the target database;

further, the specific method for extracting the keyword group B from the application name by using the TF-IDF technology comprises the following steps:

the application name preprocessing, namely, using HanLP tools to apply for the name for word segmentation to obtain sub words, removing stop words, and converting capital letters of a text into a lowercase form to obtain a preprocessed text;

calculating the occurrence frequency of each subword in the preprocessed text Wherein, the method comprises the steps of, wherein,The frequency of occurrence of the word d is indicated,Representing the number of times word d appears in the pre-processed text,Representing the total word number of the preprocessed text;

the inverse text frequency is calculated, the attraction degree of a sub-word in the corpus is calculated, the larger the IDF is, the more the word has identification degree to a specific document, and the formula is as follows: Wherein, the method comprises the steps of, wherein, Representing the total number of texts,The amount of text containing the subword d is expressed,The inverse text frequency representing the subword d;

Calculating a TF-IDF value, and multiplying the occurrence frequency TF of each sub-word by the inverse text frequency to obtain the TF-IDF value, wherein the TF-IDF value reflects the importance of the word t in the document d, and the higher the sub-word TF-IDF value is, the more the word is a keyword;

sequencing according to TF-IDF values of each sub word, setting a threshold value A_Q, and when the TF-IDF value is larger than the threshold value, using the sub word as a keyword to obtain a keyword group B;

the method for obtaining comprehensive matching degree data based on the text input by the user and matching with the keyword group B comprises the following steps of:

preprocessing based on a text input by a user, and extracting a keyword group A by using a TF-IDF technology;

the keyword group B in the database is called and matched with the keyword group A input by the user,

Obtaining vectors of a keyword group A and a keyword group B, and constructing the vectors by TF-IDF values of each sub-word in the keyword group;

the cosine similarity is used for calculating the similarity of the keyword group A and the keyword group B: Wherein, the method comprises the steps of, wherein, A vector representing the keyword group a,A vector representing the keyword group B,The inner product is represented by the number of the inner products,AndThe calculated cosine similarity value is between 0 and 1, and the larger the value is, the more similar the two key word groups are;

And calculating Jaccard similarity, wherein the formula is as follows: Wherein, the method comprises the steps of, wherein, Represents the size of the intersection of the keyword group a and the keyword group B,The union size of the key phrase A and the key phrase B is represented, the closer the value is to 1, the more similar the two key phrases are represented, and the smaller the conversely is;

The cosine similarity and Jaccard similarity are calculated, and the comprehensive matching degree is calculated by combining the intellectual property value index: Wherein, the method comprises the steps of, wherein, 、AndRespectively represent the weights of cosine similarity, jaccard similarity and intellectual property value index,Representing an intellectual property value index.

Further, the specific way of setting the recommendation threshold based on the comprehensive matching degree data and judging whether to recommend includes:

setting a threshold value by using a sliding window method, and setting a window with a fixed length C_K;

Calculating the mean value and standard deviation of the comprehensive matching degree in the window, and setting an initial recommendation threshold value: Wherein, the method comprises the steps of, wherein, Representing the average of the integrated matches in the window,The standard deviation representing the overall degree of matching in the window,Represents the adjustment coefficient of the device,;

When a new comprehensive matching degree is received, the window slides forward, the oldest comprehensive matching degree is removed, the new matching degree is added, an updated window is obtained, and the mean value and the standard deviation are recalculated, so that the recommended threshold value is obtainedWherein, the method comprises the steps of,Representing the mean value of the update window,Representing the standard deviation of the update window;

When (when) When the method is used, the intellectual property is recommended, the matching degree of the intellectual property is high, and the requirement of a user is met;

When (when) When the method is used, the intellectual property is not recommended, the matching degree of the intellectual property is low, and the requirement of a user is not met.

The intellectual property operation management system based on big data has the technical effects and advantages that:

the invention integrates behavioral data, basic data and predicted value indexes of intellectual property through a big data technology, provides a data-driven intellectual property operation management solution, predicts the value of the intellectual property by utilizing a SVR model, realizes keyword extraction by using a TF-IDF technology, calculates comprehensive matching degree through cosine similarity and Jaccard similarity, provides accurate intellectual property recommendation for users, dynamically adjusts recommendation threshold values through a sliding window method, ensures that the system can adapt to changed data and user requirements in real time, and further improves matching accuracy and recommendation effectiveness;

Meanwhile, keyword matching not only depends on text content, but also combines value indexes of the intellectual property, thereby improving the correlation of matching results, and a dynamic threshold adjustment mechanism ensures that a system can flexibly adjust recommendation standards according to actual data changes, avoids matching deviation caused by static rules, and further optimizes user experience and recommendation efficiency.

Drawings

FIG. 1 is a schematic diagram of an intellectual property operation management system based on big data according to the present invention;

fig. 2 is a schematic diagram of an intellectual property operation management method based on big data in the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, the intellectual property operation management system based on big data according to the present embodiment includes:

the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring behavior data and basic data of intellectual property rights, wherein the behavior data of the intellectual property rights comprises click frequency characteristics, collection frequency characteristics, downloading frequency characteristics and reference frequency characteristic data;

the basic data comprises an application name, an application number, an application date, an expiration date and an applicant name;

The behavior data of the intellectual property comprises click frequency characteristics, collection frequency characteristics, downloading frequency characteristics and quotation frequency characteristics, and the behavior data of the intellectual property is obtained by regularly grabbing the intellectual property website by utilizing a crawler technology;

The application name, the application number, the application date, the validity period and the applicant name are obtained through a public query interface or an API interface provided by an intellectual property library;

The acquisition mode of the preprocessing data set comprises the following steps:

The method for identifying and processing the abnormal value by using the isolated forest model comprises the following steps:

Stopping the condition, namely iterating when reaching the maximum depth characteristic of the tree;

The specific way of selecting split features using multiple evaluation metrics includes:

Here, the Because the smaller the base index is, the better the information gain is, and the directions of the two indexes are opposite, the base index needs to be inverted, so that the two indexes become the 'larger and better' indexes;

According to the entropy of information Computational redundancy;

Setting an outlier thresholdThe specific modes of (a) include:

Based on the preprocessed data set, specific ways of predicting value index data of intellectual property using the SVR model include:

Calculating initial loss function value of calculation model As a starting point for the optimization, among others,The representation is regularized in such a way that,A weight vector representing the input feature,The term of the bias is indicated,The term of the error is represented as,For penalty coefficients, n2 represents the total number of training samples, o represents the sample index,Representing the sum of errors in the upper deviation (predicted value is greater than true value) of the o-th sampleRepresenting the error in the lower deviation (predicted value is smaller than true value) representing the o-th sample;

checking convergence conditions, namely comparing the current loss function value with the change of the previous round after each round of iteration, and stopping iteration when the loss change between two adjacent iterations is smaller than a preset error threshold;

In the SVR model, the converged error threshold is set by rule of thumb, set to a fixed value When the change in the loss function between adjacent conjunctions is less than this value, then the model has converged.

Specific ways of extracting, converting and loading the base data, the preprocessed data set and the value index data of the intellectual property right by using the ETL tool include:

the data sources include intellectual property databases, scientific paper databases, business analysis platforms, user behavior data, social media, and enterprise internal systems.

The specific method for extracting the keyword group B from the application name by using the TF-IDF technology comprises the following steps:

wherein the threshold A_Q is empirically set to be 70%;

HanLP is a high-efficiency and open-source natural language processing tool kit, is mainly focused on the processing of Chinese texts, provides rich functions including word segmentation, part-of-speech labeling, named entity recognition, syntactic analysis, emotion analysis and the like, can process complex language characteristics, is suitable for various application scenes such as text mining, search engines, information extraction and the like, has high accuracy and flexibility, supports multiple languages and customized training, and is an important tool in the field of Chinese processing.

Based on the text input by the user, the TF-IDF technology is used for obtaining the keyword A, and the keyword A is matched with the keyword group B, and the method for obtaining the comprehensive matching degree data comprises the following steps:

The method comprises the steps of extracting keywords through TF-IDF and calculating similarity, ensuring accurate understanding of user input, evaluating the similarity of the keywords from different angles by cosine similarity and Jaccard similarity, enhancing the comprehensiveness of matching, and simultaneously, combining with value indexes of intellectual property rights, further screening the intellectual property rights with practical value on the basis of text similarity, thereby improving the accuracy and practicability of recommendation;

Setting a user input text box, wherein a user can input according to the needs of the user to obtain a user input text, preprocessing the user input text, and extracting a keyword group A by using a TF-IDF technology;

The key word group B in the database is called and matched with the key word group A input by the user, the vectors of the key word group A and the key word group B are obtained, and the vector is constructed through the TF-IDF value of each sub word in the key word group;

Based on the comprehensive matching degree data, a recommendation threshold is set, and the specific mode for judging whether to recommend comprises the following steps:

When (when) When the method is used, the intellectual property is recommended, the matching degree of the intellectual property is high, the requirement of a user is met, and the method has value for the user;

When (when) When the method is used, the intellectual property is not recommended, the matching degree of the intellectual property is low, the requirement of the user is not met, and the method has no value for the user.

According to the embodiment, behavioral data, basic data and predicted value indexes of intellectual property are integrated through a big data technology, a data-driven intellectual property operation management solution is provided, the intellectual property value prediction is carried out through an SVR model, the key word extraction is realized through a TF-IDF technology, the comprehensive matching degree is calculated through cosine similarity and Jaccard similarity, accurate intellectual property recommendation is provided for users, a recommendation threshold is dynamically adjusted through a sliding window method, and the system is ensured to adapt to changed data and user requirements in real time, so that the matching accuracy and recommendation effectiveness are improved;

In addition, the keyword matching not only depends on text content, but also combines value indexes of the intellectual property, thereby improving the correlation of matching results, and the adjustment mechanism of the dynamic threshold ensures that the system can flexibly adjust recommendation standards according to actual data changes, avoids matching deviation caused by static rules, and further optimizes user experience and recommendation efficiency.

Example 2

Referring to fig. 2, the detailed description of the embodiment is not shown in the description of embodiment 1, and an intellectual property operation management method based on big data is provided, which includes:

Step S1, collecting behavior data and basic data of intellectual property rights;

S2, carrying out missing value and abnormal value processing on behavior data of intellectual property rights to obtain a preprocessing data set;

S3, inputting the preprocessed data set into the SVR model, and predicting to obtain value index data of intellectual property;

S4, extracting, converting and loading intellectual property basic data, a processing data set and value index data by utilizing an ETL tool, extracting a keyword group B from an application name by utilizing a TF-IDF technology, and storing the keyword group B as a new field in a target database;

Step S5, acquiring a keyword A by using a TF-IDF technology based on a text input by a user, and matching the keyword A with a keyword group B to obtain comprehensive matching degree data;

and S6, setting a recommendation threshold based on the comprehensive matching degree data, and judging whether to recommend.

Example 3

The embodiment discloses an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the running mode of the intellectual property operation management system based on big data when executing the computer program.

Since the electronic device described in this embodiment is an electronic device used for implementing an intellectual property operation management system based on big data in the embodiment of the present application, based on the intellectual property operation management system based on big data described in the embodiment of the present application, a person skilled in the art can understand a specific implementation manner of the electronic device and various modifications thereof, so how to implement the method in the embodiment of the present application for this electronic device will not be described in detail herein. As long as the person skilled in the art implements an electronic device adopted by the intellectual property operation management system based on big data in the embodiment of the application, the electronic device belongs to the scope of protection intended by the application.

The above formulas are all formulas with dimensionality removed and numerical calculation, the formulas are formulas with the latest real situation obtained by software simulation through collecting a large amount of data, and preset parameters and threshold selection in the formulas are set by those skilled in the art according to the actual situation.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention are intended to be comprehended within the scope of the present invention.

Claims

1. A big data-based intellectual property operation and management system, characterized by comprising:

Data acquisition module: obtains behavioral data and basic data of intellectual property rights;

Data processing module: Process missing values and outliers for the behavior data of intellectual property rights to obtain a preprocessed data set;

Prediction module: Based on the preprocessed data set, use the SVR model to predict the value indicator data of intellectual property rights;

Data storage module: Use ETL tools to extract, transform and load the basic data, pre-processed data sets and value indicator data of intellectual property rights, and use TF-IDF technology to process the application name, extract keyword group B, and store keyword group B as a new field in the target database;

Keyword matching module: Based on the user input text, use TF-IDF technology to obtain keyword A; and match it with keyword group B to obtain comprehensive matching data;

Recommendation module: Based on the comprehensive matching data, set the recommendation threshold and determine whether to make a recommendation.

2. According to claim 1, the intellectual property operation and management system based on big data is characterized in that the behavior data of the intellectual property includes click count characteristics, collection count characteristics, download count characteristics and citation count characteristics;

Basic data includes application name, application number, application date, validity period and applicant name.

3. According to the big data-based intellectual property operation and management system of claim 2, it is characterized in that the acquisition method of the pre-processed data set includes:

For the intellectual property behavior data, the stacking method is used to form a data set, and the forward filling method and backward filling method are used to process the missing values of the data set to obtain the missing value processing data set. Then, the isolation forest model is used to identify and process the outliers, and the processed data is standardized to obtain the preprocessed data set.

The isnull technology is used to traverse the data set composed of click count features, favorite count features, download count features and citation count features, and identify missing values. For missing values, a forward filling method is used to obtain the forward data value of the missing value, and a backward filling method is used to obtain the backward data value. The average of the forward data value and the backward data value is calculated, and the missing value is filled with the average value.

4. According to the big data-based intellectual property operation and management system of claim 3, the method of using the isolation forest model to identify outliers and process them includes:

Initialize hyperparameters: Initialize the hyperparameters of isolation forest;

Select training subset: set SL trees, set the training subset size of each tree to mm, and select a subset from the missing value processing data set for training each time;

Feature selection: Use multiple evaluation metrics to select split features;

Split point selection: For the selected split feature, a split point is randomly selected to divide the data into two subsets;

Recursive splitting: For each subset, repeat feature selection and split point selection, and continue to split the data into smaller subsets until the maximum depth SD of the tree is reached;

Repeat the above process: build M_L trees, each tree is trained on a different subset of the data;

Calculate the path length of each data point: In each tree, calculate the path length of the data point , that is, the number of split nodes from the root node to the data point, where Represents the path length of the i-th data point in the j-th tree;

Calculate the average path length over all trees: For each data point , calculate the average path length of the data point in all trees ,in, Represents data points The path length in the jth tree, Indicates the number of trees;

Calculate outlier score: Evaluate the outlier score of each data point by calculating its average path length ,in, is a constant calculated based on the number of samples m, representing the expected path length, and the formula is: ,in is a constant;

Stop condition: stop iteration when the maximum depth of the tree is reached;

Output: Setting the outlier threshold ,when When , it indicates that the data point is normal; when When , it means that the data point is abnormal. The backward filling method is used to replace and fill the abnormal data points to obtain the pre-processed data set.

5. According to the big data-based intellectual property operation and management system of claim 4, it is characterized in that the specific method of using multiple evaluation indicators to select split features includes:

By calculating the information gain, Gini index and variance of each feature and standardizing them, the entropy weight method is used to calculate the information gain, Gini index and variance to set the weight, and the comprehensive score of each feature is calculated , select Comprehensive Rating The highest feature is used as the selected feature;

in, , and Respectively represent The standardized information gain, Gini index and variance of the eigenvalues, , and Represent the weights of information gain, Gini index and variance respectively;

The entropy weight method calculates the weight by calculating the information entropy value of each evaluation indicator. ,in, Indicates The entropy of the evaluation index, Indicates The probability distribution of the tz-th eigenvalue of the evaluation indicator, Indicates the total number of eigenvalues;

According to information entropy Calculating redundancy ;

Calculate the weight of each evaluation indicator according to the redundancy , where 3 represents the total number of evaluation indicators.

6. According to the big data-based intellectual property operation and management system of claim 5, it is characterized in that the abnormal value threshold is set The specific methods include:

Using neighborhood distance to set outlier threshold , calculate the mean value of the missing value processing data set and standard deviation , and calculate the coefficient of variation , and set the K value to ,in, represents the number of samples in the dataset, represents the adjustment coefficient, , represents the smoothing factor;

Determine the K value. For each data point, use the Euclidean distance to calculate the distance between the data point and its K nearest neighbors, and perform an average calculation to obtain the average distance of the K nearest neighbors. According to the rule of thumb, set the distance threshold to; when the average distance of the K nearest neighbors of a data point is greater than the distance threshold, the data point is an outlier.

7. According to claim 6, a big data-based intellectual property operation and management system, characterized in that the specific method of using the SVR model to predict the value index data of intellectual property based on the pre-processed data set includes:

The sample set includes J_o groups of samples, each of which includes a preprocessed data set and corresponding intellectual property value indicator data;

Initialize hyperparameters: Select radial basis function as kernel function, determine the penalty coefficient, epsilon parameter, gamma parameter hyperparameter combination, and use grid search method to determine the optimal hyperparameter combination;

Calculate the initial loss function: Calculate the initial loss function value of the model , as the starting point of optimization, where represents regularization, represents the weight vector of the input features, represents the bias term, represents the error term, is the penalty coefficient, n2 represents the total number of training samples, o represents the sample index, represents the error of the oth sample on the upper deviation, represents the error of the o-th sample on the lower deviation;

Iterative optimization: Use gradient descent to update model parameters in each iteration and , calculate the new loss function during the update process;

Check convergence conditions: After each round of iteration, compare the change in the current loss function value with that of the previous round. When the loss change between two adjacent iterations is less than the preset error threshold, stop the iteration.

8. According to the big data-based intellectual property operation and management system of claim 7, it is characterized in that the specific method of using TF-IDF technology to process the application name and extract keyword group B includes:

Application name preprocessing: Use the HanLP tool to segment the application name and obtain subwords; remove stop words and convert the uppercase letters of the text to lowercase to obtain preprocessed text;

Count the frequency of each subword in the preprocessed text ,in, represents the frequency of occurrence of word d, Indicates the number of times word d appears in the preprocessed text, Indicates the total number of words in the preprocessed text;

Calculate the inverse text frequency, the formula is as follows: ,in, Indicates the total number of texts, represents the amount of text containing subword d, represents the inverse text frequency of subword d;

Calculate the TF-IDF value by multiplying the occurrence frequency TF of each subword by the inverse text frequency to get the TF-IDF value;

Sort by the TF-IDF value of each subword, set a threshold A_Q, and when the TF-IDF value is greater than the threshold, the subword is used as a keyword to obtain keyword group B.

9. According to claim 8, a big data-based intellectual property operation and management system, characterized in that the specific method of obtaining keyword A based on the user input text using TF-IDF technology and matching it with keyword group B to obtain comprehensive matching degree data includes:

Use TF-IDF technology to extract keyword group A from user input text;

Get the vectors of keyword group A and keyword group B, and use cosine similarity to calculate the similarity between keyword group A and keyword group B: ,in, A vector representing the keyword group A, represents the vector of keyword group B, represents the inner product, and represents the Euclidean norm;

Calculate the Jaccard similarity, the formula is: ,in, Indicates the size of the intersection of keyword group A and keyword group B. Indicates the union size of keyword group A and keyword group B;

By calculating the cosine similarity and Jaccard similarity, combined with the intellectual property value index, the comprehensive matching degree is calculated: ,in, , and Respectively represent the weights of cosine similarity, Jaccard similarity and intellectual property value index, Represents the intellectual property value indicator.

10. According to claim 9, a big data-based intellectual property operation and management system is characterized in that the specific method of setting a recommendation threshold based on the comprehensive matching data and determining whether to make a recommendation includes:

Use the sliding window method to set the threshold and set a window of fixed length C_K;

Calculate the mean and standard deviation of the comprehensive matching degree within the window and set the initial recommendation threshold: ,in, Represents the mean of the comprehensive matching degree in the window, Represents the standard deviation of the comprehensive matching degree in the window, represents the adjustment factor, ;

When a new comprehensive match is received, the window slides forward, removes the oldest comprehensive match, adds the new match, obtains an updated window, and recalculates the mean and standard deviation to obtain the recommended threshold. ;in, represents the mean of the update window, represents the standard deviation of the update window;

when , the IP is recommended, and the IP has a high matching degree and meets the needs of users;

when , the intellectual property is not recommended, the matching degree of the intellectual property is low and does not meet the user's needs.