CN119741156A - Intellectual property operation management system based on big data - Google Patents
Intellectual property operation management system based on big data Download PDFInfo
- Publication number
- CN119741156A CN119741156A CN202411807543.4A CN202411807543A CN119741156A CN 119741156 A CN119741156 A CN 119741156A CN 202411807543 A CN202411807543 A CN 202411807543A CN 119741156 A CN119741156 A CN 119741156A
- Authority
- CN
- China
- Prior art keywords
- data
- intellectual property
- value
- keyword group
- calculate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services
- G06Q50/184—Intellectual property management
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Technology Law (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Tourism & Hospitality (AREA)
- Entrepreneurship & Innovation (AREA)
- Operations Research (AREA)
- Probability & Statistics with Applications (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of intelligence, and discloses an intellectual property operation management system based on big data, which comprises the steps of obtaining behavioral data and basic data of intellectual property; the method comprises the steps of carrying out missing value and abnormal value processing on behavior data of intellectual property rights to obtain a preprocessing data set, predicting value index data of the intellectual property rights by using an SVR model based on the preprocessing data set, extracting, converting and loading by using an ETL tool, processing application names by using a TF-IDF technology, extracting a keyword group B, storing the keyword group B as a new field in a target database, obtaining a keyword A by using the TF-IDF technology based on user input text, matching with the keyword group B to obtain comprehensive matching degree data, setting a threshold value based on the comprehensive matching degree data, judging whether to recommend or not, and improving the efficiency and accuracy of intellectual property rights operation management.
Description
The invention relates to the technical field of intelligence, in particular to an intellectual property operation management system based on big data.
Background
The existing intellectual property operation management system of big data exposes a plurality of problems that firstly, an intelligent and data-driven prediction model is not adopted, but simple rules and statistical data are relied on, and the potential commercial value, technical innovation or market influence of the intellectual property can not be accurately estimated;
most of the prior systems match keywords based on text data only, and ignore the value index of the intellectual property, which means that even if the intellectual property has higher matching degree in text, the intellectual property does not necessarily represent that the intellectual property has practical value for the user demand;
The existing system generally uses a static keyword matching rule and a fixed matching degree threshold, which means that once the matching degree standard is set, the system cannot adaptively adjust according to data change or change of user requirements, and the fixed threshold cannot dynamically adjust according to nuances among intellectual property rights, so that matching is too loose or too strict, and recommendation accuracy is affected.
In view of the above, the present invention proposes an intellectual property operation management system based on big data to solve the above-mentioned problems.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides the following technical scheme that the intellectual property operation management system based on big data comprises:
The data acquisition module is used for acquiring behavior data and basic data of intellectual property rights;
The data processing module is used for carrying out missing value and abnormal value processing on the behavior data of the intellectual property to obtain a preprocessing data set;
A prediction module for predicting value index data of intellectual property by using SVR model based on the preprocessed data set;
The data storage module is used for extracting, converting and loading the basic data, the preprocessing data set and the value index data of the intellectual property by using an ETL tool, processing the application name by using a TF-IDF technology, extracting a keyword group B, and storing the keyword group B as a new field in the target database;
the keyword matching module is used for acquiring a keyword A by using a TF-IDF technology based on a text input by a user, and matching the keyword A with a keyword group B to obtain comprehensive matching degree data;
And the recommendation module is used for setting a recommendation threshold value based on the comprehensive matching degree data and judging whether to recommend.
Further, the behavior data of the intellectual property comprises a click frequency characteristic, a collection frequency characteristic, a downloading frequency characteristic and a quotation frequency characteristic;
the basic data includes application name, application number, application date, validity period, and applicant name.
Further, the method for acquiring the preprocessing data set includes:
aiming at intellectual property behavior data, a stacking method is used for forming a data set, a forward filling method and a backward filling method are used for carrying out missing value processing on the data set to obtain a missing value processing data set, an isolated forest model is used for identifying and processing abnormal values, and the processed data is standardized to obtain a preprocessing data set;
Traversing click frequency characteristic, collection frequency characteristic, downloading frequency characteristic and reference frequency characteristic data to form a dataset by using isnull technology, identifying a missing value, obtaining a forward data value of the missing value by using a forward filling method aiming at the missing value, obtaining a backward data value by using a backward filling method, calculating an average value of the forward data value and the backward data value, and filling the missing value by using the average value.
Further, the method for identifying and processing the outliers by using the isolated forest model comprises the following steps:
Initializing hyper-parameters, namely initializing hyper-parameters of an isolated forest, wherein the hyper-parameters comprise the number SL of trees, the maximum depth SD of each tree and the number mm of samples of each tree;
selecting training subsets, namely setting SL trees, setting the size of the training subset of each tree to be mm, and selecting one subset from the missing value processing data set for training each time;
feature selection, namely selecting split features by using multiple evaluation indexes;
Selecting a segmentation point, namely randomly selecting one segmentation point for the selected segmentation feature, and dividing the data into two subsets;
Recursively segmenting, namely repeatedly performing feature selection and segmentation point selection on each subset, and continuously splitting the data into smaller subsets until the maximum depth SD of the tree is reached;
repeating the above process by constructing M_L trees, each tree trained on a different subset of the data;
calculating the path length of each data point, in each tree, calculating the path length of the data point I.e., the number of split nodes traversed from the root node to the data point, where,Representing the path length of the ith data point in the jth tree;
calculating the average path length over all trees for each data point Calculating the average path length of the data point on all treesWherein, the method comprises the steps of, wherein,Representing data pointsThe path length on the j-th tree,Representing the number of trees;
calculating outlier scores by evaluating the outlier scores by calculating the average path length of each data point Wherein, the method comprises the steps of, wherein,Is a constant calculated based on the number of samples m, representing the expected path length, and the formula is: Wherein Is a constant;
stopping iteration when the maximum depth of the tree is reached;
Output result, setting abnormal value threshold value When (when)When the data point is normal, whenAnd when the data point is abnormal, replacing and filling the abnormal data point by using a backward filling method to obtain a pre-processing data set.
Further, the specific way of selecting the splitting feature by using the multiple evaluation indexes includes:
By calculating the information gain, the base index and the variance of each feature and normalizing, the information gain, the base index and the variance are calculated by using an entropy weight method to set weights, and the comprehensive score of each feature is calculated Wherein, the method comprises the steps of, wherein,、AndRespectively represent the firstNormalized information gain, base index and variance of the individual eigenvalues,、AndWeights representing information gain, base index and variance, respectively, and selecting a composite scoreThe highest feature is the selected feature;
The entropy weight method calculates the weight by calculating the information entropy value of each evaluation index, Wherein, the method comprises the steps of, wherein,Represent the firstThe entropy of the individual evaluation indicators is determined,Represent the firstThe probability distribution of the tz-th eigenvalue of the individual evaluation index,Representing the total number of the characteristic values;
According to the entropy of information Computational redundancy;
Calculating the weight of each evaluation index according to the redundancyWherein 3 represents the total number of evaluation indexes.
Further, the setting of the outlier thresholdThe specific modes of (a) include:
Setting outlier thresholds using neighborhood distances Calculating an average value of the missing value processing datasetAnd standard deviationAnd calculate the coefficient of variationAnd set the K value asWherein, the method comprises the steps of, wherein,Representing the number of samples of the data set,Represents the adjustment coefficient of the device,,Representing a smoothing factor;
Determining a K value, for each data point, calculating the distance between the data point and K nearest neighbors by using Euclidean distance, and carrying out average calculation to obtain the average distance of the K nearest neighbors, setting a distance threshold value to be 95% according to a rule of thumb, and if the average distance of the K nearest neighbors of the data point is greater than the distance threshold value by 95%, the data point is an abnormal value.
Further, the specific way of predicting the value index data of the intellectual property right by using the SVR model based on the preprocessing data set comprises the following steps:
The sample set comprises J_o groups of samples, and each group of samples comprises a preprocessing data set and corresponding intellectual property value index data;
initializing super parameters, namely selecting a radial basis function as a kernel function, determining a penalty coefficient, epsilon parameters and gamma parameter super parameter combinations, and determining optimal super parameter combinations by using a grid search method;
Calculating initial loss function value of calculation model As a starting point for the optimization, among others,The representation is regularized in such a way that,A weight vector representing the input feature,The term of the bias is indicated,The term of the error is represented as,For penalty coefficients, n2 represents the total number of training samples, o represents the sample index,Representing the error in the upper deviation of the o-th sample,Representing the error in the lower bias representing the o-th sample;
Iterative optimization, in which model parameters are updated by using gradient descent method in each iterative process AndCalculating a new loss function in the updating process;
and (3) checking convergence conditions, namely comparing the current loss function value with the change of the previous round after each round of iteration, and stopping iteration when the loss change between two adjacent iterations is smaller than a preset error threshold value.
Further, the specific ways of extracting, converting and loading the basic data, the preprocessed data set and the value index data of the intellectual property right by using the ETL tool include:
Extracting, namely realizing intellectual property basic data obtained from an L_L data source in real time through a real-time data flow technology APACHE KAFKA, directly transmitting the intellectual property basic data into an ETL tool for processing, and simultaneously guiding value index data predicted according to intellectual property behavior data and a processed data set into the ETL tool;
Converting, namely, aiming at intellectual property basic data, value index data and a processing data set which are imported into an ETL tool, performing duplication removal, missing value processing, unifying date formats and adjusting field names, and converting original data into a standard format of a target database;
importing the converted data into a target database, and designing a table structure, a field type and an index of the target database;
further, the specific method for extracting the keyword group B from the application name by using the TF-IDF technology comprises the following steps:
the application name preprocessing, namely, using HanLP tools to apply for the name for word segmentation to obtain sub words, removing stop words, and converting capital letters of a text into a lowercase form to obtain a preprocessed text;
calculating the occurrence frequency of each subword in the preprocessed text Wherein, the method comprises the steps of, wherein,The frequency of occurrence of the word d is indicated,Representing the number of times word d appears in the pre-processed text,Representing the total word number of the preprocessed text;
the inverse text frequency is calculated, the attraction degree of a sub-word in the corpus is calculated, the larger the IDF is, the more the word has identification degree to a specific document, and the formula is as follows: Wherein, the method comprises the steps of, wherein, Representing the total number of texts,The amount of text containing the subword d is expressed,The inverse text frequency representing the subword d;
Calculating a TF-IDF value, and multiplying the occurrence frequency TF of each sub-word by the inverse text frequency to obtain the TF-IDF value, wherein the TF-IDF value reflects the importance of the word t in the document d, and the higher the sub-word TF-IDF value is, the more the word is a keyword;
sequencing according to TF-IDF values of each sub word, setting a threshold value A_Q, and when the TF-IDF value is larger than the threshold value, using the sub word as a keyword to obtain a keyword group B;
the method for obtaining comprehensive matching degree data based on the text input by the user and matching with the keyword group B comprises the following steps of:
preprocessing based on a text input by a user, and extracting a keyword group A by using a TF-IDF technology;
the keyword group B in the database is called and matched with the keyword group A input by the user,
Obtaining vectors of a keyword group A and a keyword group B, and constructing the vectors by TF-IDF values of each sub-word in the keyword group;
the cosine similarity is used for calculating the similarity of the keyword group A and the keyword group B: Wherein, the method comprises the steps of, wherein, A vector representing the keyword group a,A vector representing the keyword group B,The inner product is represented by the number of the inner products,AndThe calculated cosine similarity value is between 0 and 1, and the larger the value is, the more similar the two key word groups are;
And calculating Jaccard similarity, wherein the formula is as follows: Wherein, the method comprises the steps of, wherein, Represents the size of the intersection of the keyword group a and the keyword group B,The union size of the key phrase A and the key phrase B is represented, the closer the value is to 1, the more similar the two key phrases are represented, and the smaller the conversely is;
The cosine similarity and Jaccard similarity are calculated, and the comprehensive matching degree is calculated by combining the intellectual property value index: Wherein, the method comprises the steps of, wherein, 、AndRespectively represent the weights of cosine similarity, jaccard similarity and intellectual property value index,Representing an intellectual property value index.
Further, the specific way of setting the recommendation threshold based on the comprehensive matching degree data and judging whether to recommend includes:
setting a threshold value by using a sliding window method, and setting a window with a fixed length C_K;
Calculating the mean value and standard deviation of the comprehensive matching degree in the window, and setting an initial recommendation threshold value: Wherein, the method comprises the steps of, wherein, Representing the average of the integrated matches in the window,The standard deviation representing the overall degree of matching in the window,Represents the adjustment coefficient of the device,;
When a new comprehensive matching degree is received, the window slides forward, the oldest comprehensive matching degree is removed, the new matching degree is added, an updated window is obtained, and the mean value and the standard deviation are recalculated, so that the recommended threshold value is obtainedWherein, the method comprises the steps of,Representing the mean value of the update window,Representing the standard deviation of the update window;
When (when) When the method is used, the intellectual property is recommended, the matching degree of the intellectual property is high, and the requirement of a user is met;
When (when) When the method is used, the intellectual property is not recommended, the matching degree of the intellectual property is low, and the requirement of a user is not met.
The intellectual property operation management system based on big data has the technical effects and advantages that:
the invention integrates behavioral data, basic data and predicted value indexes of intellectual property through a big data technology, provides a data-driven intellectual property operation management solution, predicts the value of the intellectual property by utilizing a SVR model, realizes keyword extraction by using a TF-IDF technology, calculates comprehensive matching degree through cosine similarity and Jaccard similarity, provides accurate intellectual property recommendation for users, dynamically adjusts recommendation threshold values through a sliding window method, ensures that the system can adapt to changed data and user requirements in real time, and further improves matching accuracy and recommendation effectiveness;
Meanwhile, keyword matching not only depends on text content, but also combines value indexes of the intellectual property, thereby improving the correlation of matching results, and a dynamic threshold adjustment mechanism ensures that a system can flexibly adjust recommendation standards according to actual data changes, avoids matching deviation caused by static rules, and further optimizes user experience and recommendation efficiency.
Drawings
FIG. 1 is a schematic diagram of an intellectual property operation management system based on big data according to the present invention;
fig. 2 is a schematic diagram of an intellectual property operation management method based on big data in the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, the intellectual property operation management system based on big data according to the present embodiment includes:
the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring behavior data and basic data of intellectual property rights, wherein the behavior data of the intellectual property rights comprises click frequency characteristics, collection frequency characteristics, downloading frequency characteristics and reference frequency characteristic data;
the basic data comprises an application name, an application number, an application date, an expiration date and an applicant name;
The behavior data of the intellectual property comprises click frequency characteristics, collection frequency characteristics, downloading frequency characteristics and quotation frequency characteristics, and the behavior data of the intellectual property is obtained by regularly grabbing the intellectual property website by utilizing a crawler technology;
The application name, the application number, the application date, the validity period and the applicant name are obtained through a public query interface or an API interface provided by an intellectual property library;
The data processing module is used for carrying out missing value and abnormal value processing on the behavior data of the intellectual property to obtain a preprocessing data set;
A prediction module for predicting value index data of intellectual property by using SVR model based on the preprocessed data set;
The data storage module is used for extracting, converting and loading the basic data, the preprocessing data set and the value index data of the intellectual property by using an ETL tool, processing the application name by using a TF-IDF technology, extracting a keyword group B, and storing the keyword group B as a new field in the target database;
the keyword matching module is used for acquiring a keyword A by using a TF-IDF technology based on a text input by a user, and matching the keyword A with a keyword group B to obtain comprehensive matching degree data;
And the recommendation module is used for setting a recommendation threshold value based on the comprehensive matching degree data and judging whether to recommend.
The acquisition mode of the preprocessing data set comprises the following steps:
aiming at intellectual property behavior data, a stacking method is used for forming a data set, a forward filling method and a backward filling method are used for carrying out missing value processing on the data set to obtain a missing value processing data set, an isolated forest model is used for identifying and processing abnormal values, and the processed data is standardized to obtain a preprocessing data set;
Traversing click frequency characteristic, collection frequency characteristic, downloading frequency characteristic and reference frequency characteristic data to form a dataset by using isnull technology, identifying a missing value, obtaining a forward data value of the missing value by using a forward filling method aiming at the missing value, obtaining a backward data value by using a backward filling method, calculating an average value of the forward data value and the backward data value, and filling the missing value by using the average value.
The method for identifying and processing the abnormal value by using the isolated forest model comprises the following steps:
Initializing hyper-parameters, namely initializing hyper-parameters of an isolated forest, wherein the hyper-parameters comprise the number SL of trees, the maximum depth SD of each tree and the number mm of samples of each tree;
selecting training subsets, namely setting SL trees, setting the size of the training subset of each tree to be mm, and selecting one subset from the missing value processing data set for training each time;
feature selection, namely selecting split features by using multiple evaluation indexes;
Selecting a segmentation point, namely randomly selecting one segmentation point for the selected segmentation feature, and dividing the data into two subsets;
Recursively segmenting, namely repeatedly performing feature selection and segmentation point selection on each subset, and continuously splitting the data into smaller subsets until the maximum depth SD of the tree is reached;
repeating the above process by constructing M_L trees, each tree trained on a different subset of the data;
calculating the path length of each data point, in each tree, calculating the path length of the data point I.e., the number of split nodes traversed from the root node to the data point, where,Representing the path length of the ith data point in the jth tree;
calculating the average path length over all trees for each data point Calculating the average path length of the data point on all treesWherein, the method comprises the steps of, wherein,Representing data pointsThe path length on the j-th tree,Representing the number of trees;
calculating outlier scores by evaluating the outlier scores by calculating the average path length of each data point Wherein, the method comprises the steps of, wherein,Is a constant calculated based on the number of samples m, representing the expected path length, and the formula is: Wherein Is a constant;
Stopping the condition, namely iterating when reaching the maximum depth characteristic of the tree;
Output result, setting abnormal value threshold value When (when)When the data point is normal, whenAnd when the data point is abnormal, replacing and filling the abnormal data point by using a backward filling method to obtain a pre-processing data set.
The specific way of selecting split features using multiple evaluation metrics includes:
By calculating the information gain, the base index and the variance of each feature and normalizing, the information gain, the base index and the variance are calculated by using an entropy weight method to set weights, and the comprehensive score of each feature is calculated Wherein, the method comprises the steps of, wherein,、AndRespectively represent the firstNormalized information gain, base index and variance of the individual eigenvalues,、AndWeights representing information gain, base index and variance, respectively, and selecting a composite scoreThe highest feature is the selected feature;
Here, the Because the smaller the base index is, the better the information gain is, and the directions of the two indexes are opposite, the base index needs to be inverted, so that the two indexes become the 'larger and better' indexes;
The entropy weight method calculates the weight by calculating the information entropy value of each evaluation index, Wherein, the method comprises the steps of, wherein,Represent the firstThe entropy of the individual evaluation indicators is determined,Represent the firstThe probability distribution of the tz-th eigenvalue of the individual evaluation index,Representing the total number of the characteristic values;
According to the entropy of information Computational redundancy;
Calculating the weight of each evaluation index according to the redundancyWherein 3 represents the total number of evaluation indexes.
Setting an outlier thresholdThe specific modes of (a) include:
Setting outlier thresholds using neighborhood distances Calculating an average value of the missing value processing datasetAnd standard deviationAnd calculate the coefficient of variationAnd set the K value asWherein, the method comprises the steps of, wherein,Representing the number of samples of the data set,Represents the adjustment coefficient of the device,,Representing a smoothing factor;
Determining a K value, for each data point, calculating the distance between the data point and K nearest neighbors by using Euclidean distance, and carrying out average calculation to obtain the average distance of the K nearest neighbors, setting a distance threshold value to be 95% according to a rule of thumb, and if the average distance of the K nearest neighbors of the data point is greater than the distance threshold value by 95%, the data point is an abnormal value.
Based on the preprocessed data set, specific ways of predicting value index data of intellectual property using the SVR model include:
the sample set comprises J_o groups of samples, and each group of samples comprises a preprocessing data set and corresponding intellectual property value index data;
initializing super parameters, namely selecting a radial basis function as a kernel function, determining a penalty coefficient, epsilon parameters and gamma parameter super parameter combinations, and determining optimal super parameter combinations by using a grid search method;
Calculating initial loss function value of calculation model As a starting point for the optimization, among others,The representation is regularized in such a way that,A weight vector representing the input feature,The term of the bias is indicated,The term of the error is represented as,For penalty coefficients, n2 represents the total number of training samples, o represents the sample index,Representing the sum of errors in the upper deviation (predicted value is greater than true value) of the o-th sampleRepresenting the error in the lower deviation (predicted value is smaller than true value) representing the o-th sample;
Iterative optimization, in which model parameters are updated by using gradient descent method in each iterative process AndCalculating a new loss function in the updating process;
checking convergence conditions, namely comparing the current loss function value with the change of the previous round after each round of iteration, and stopping iteration when the loss change between two adjacent iterations is smaller than a preset error threshold;
In the SVR model, the converged error threshold is set by rule of thumb, set to a fixed value When the change in the loss function between adjacent conjunctions is less than this value, then the model has converged.
Specific ways of extracting, converting and loading the base data, the preprocessed data set and the value index data of the intellectual property right by using the ETL tool include:
Extracting, namely realizing intellectual property basic data obtained from an L_L data source in real time through a real-time data flow technology APACHE KAFKA, directly transmitting the intellectual property basic data into an ETL tool for processing, and simultaneously guiding value index data predicted according to intellectual property behavior data and a processed data set into the ETL tool;
Converting, namely, aiming at intellectual property basic data, value index data and a processing data set which are imported into an ETL tool, performing duplication removal, missing value processing, unifying date formats and adjusting field names, and converting original data into a standard format of a target database;
importing the converted data into a target database, and designing a table structure, a field type and an index of the target database;
the data sources include intellectual property databases, scientific paper databases, business analysis platforms, user behavior data, social media, and enterprise internal systems.
The specific method for extracting the keyword group B from the application name by using the TF-IDF technology comprises the following steps:
the application name preprocessing, namely, using HanLP tools to apply for the name for word segmentation to obtain sub words, removing stop words, and converting capital letters of a text into a lowercase form to obtain a preprocessed text;
calculating the occurrence frequency of each subword in the preprocessed text Wherein, the method comprises the steps of, wherein,The frequency of occurrence of the word d is indicated,Representing the number of times word d appears in the pre-processed text,Representing the total word number of the preprocessed text;
the inverse text frequency is calculated, the attraction degree of a sub-word in the corpus is calculated, the larger the IDF is, the more the word has identification degree to a specific document, and the formula is as follows: Wherein, the method comprises the steps of, wherein, Representing the total number of texts,The amount of text containing the subword d is expressed,The inverse text frequency representing the subword d;
Calculating a TF-IDF value, and multiplying the occurrence frequency TF of each sub-word by the inverse text frequency to obtain the TF-IDF value, wherein the TF-IDF value reflects the importance of the word t in the document d, and the higher the sub-word TF-IDF value is, the more the word is a keyword;
sequencing according to TF-IDF values of each sub word, setting a threshold value A_Q, and when the TF-IDF value is larger than the threshold value, using the sub word as a keyword to obtain a keyword group B;
wherein the threshold A_Q is empirically set to be 70%;
HanLP is a high-efficiency and open-source natural language processing tool kit, is mainly focused on the processing of Chinese texts, provides rich functions including word segmentation, part-of-speech labeling, named entity recognition, syntactic analysis, emotion analysis and the like, can process complex language characteristics, is suitable for various application scenes such as text mining, search engines, information extraction and the like, has high accuracy and flexibility, supports multiple languages and customized training, and is an important tool in the field of Chinese processing.
Based on the text input by the user, the TF-IDF technology is used for obtaining the keyword A, and the keyword A is matched with the keyword group B, and the method for obtaining the comprehensive matching degree data comprises the following steps:
The method comprises the steps of extracting keywords through TF-IDF and calculating similarity, ensuring accurate understanding of user input, evaluating the similarity of the keywords from different angles by cosine similarity and Jaccard similarity, enhancing the comprehensiveness of matching, and simultaneously, combining with value indexes of intellectual property rights, further screening the intellectual property rights with practical value on the basis of text similarity, thereby improving the accuracy and practicability of recommendation;
Setting a user input text box, wherein a user can input according to the needs of the user to obtain a user input text, preprocessing the user input text, and extracting a keyword group A by using a TF-IDF technology;
The key word group B in the database is called and matched with the key word group A input by the user, the vectors of the key word group A and the key word group B are obtained, and the vector is constructed through the TF-IDF value of each sub word in the key word group;
the cosine similarity is used for calculating the similarity of the keyword group A and the keyword group B: Wherein, the method comprises the steps of, wherein, A vector representing the keyword group a,A vector representing the keyword group B,The inner product is represented by the number of the inner products,AndThe calculated cosine similarity value is between 0 and 1, and the larger the value is, the more similar the two key word groups are;
And calculating Jaccard similarity, wherein the formula is as follows: Wherein, the method comprises the steps of, wherein, Represents the size of the intersection of the keyword group a and the keyword group B,The union size of the key phrase A and the key phrase B is represented, the closer the value is to 1, the more similar the two key phrases are represented, and the smaller the conversely is;
The cosine similarity and Jaccard similarity are calculated, and the comprehensive matching degree is calculated by combining the intellectual property value index: Wherein, the method comprises the steps of, wherein, 、AndRespectively represent the weights of cosine similarity, jaccard similarity and intellectual property value index,Representing an intellectual property value index.
Based on the comprehensive matching degree data, a recommendation threshold is set, and the specific mode for judging whether to recommend comprises the following steps:
setting a threshold value by using a sliding window method, and setting a window with a fixed length C_K;
Calculating the mean value and standard deviation of the comprehensive matching degree in the window, and setting an initial recommendation threshold value: Wherein, the method comprises the steps of, wherein, Representing the average of the integrated matches in the window,The standard deviation representing the overall degree of matching in the window,Represents the adjustment coefficient of the device,;
When a new comprehensive matching degree is received, the window slides forward, the oldest comprehensive matching degree is removed, the new matching degree is added, an updated window is obtained, and the mean value and the standard deviation are recalculated, so that the recommended threshold value is obtainedWherein, the method comprises the steps of,Representing the mean value of the update window,Representing the standard deviation of the update window;
When (when) When the method is used, the intellectual property is recommended, the matching degree of the intellectual property is high, the requirement of a user is met, and the method has value for the user;
When (when) When the method is used, the intellectual property is not recommended, the matching degree of the intellectual property is low, the requirement of the user is not met, and the method has no value for the user.
According to the embodiment, behavioral data, basic data and predicted value indexes of intellectual property are integrated through a big data technology, a data-driven intellectual property operation management solution is provided, the intellectual property value prediction is carried out through an SVR model, the key word extraction is realized through a TF-IDF technology, the comprehensive matching degree is calculated through cosine similarity and Jaccard similarity, accurate intellectual property recommendation is provided for users, a recommendation threshold is dynamically adjusted through a sliding window method, and the system is ensured to adapt to changed data and user requirements in real time, so that the matching accuracy and recommendation effectiveness are improved;
In addition, the keyword matching not only depends on text content, but also combines value indexes of the intellectual property, thereby improving the correlation of matching results, and the adjustment mechanism of the dynamic threshold ensures that the system can flexibly adjust recommendation standards according to actual data changes, avoids matching deviation caused by static rules, and further optimizes user experience and recommendation efficiency.
Example 2
Referring to fig. 2, the detailed description of the embodiment is not shown in the description of embodiment 1, and an intellectual property operation management method based on big data is provided, which includes:
Step S1, collecting behavior data and basic data of intellectual property rights;
S2, carrying out missing value and abnormal value processing on behavior data of intellectual property rights to obtain a preprocessing data set;
S3, inputting the preprocessed data set into the SVR model, and predicting to obtain value index data of intellectual property;
S4, extracting, converting and loading intellectual property basic data, a processing data set and value index data by utilizing an ETL tool, extracting a keyword group B from an application name by utilizing a TF-IDF technology, and storing the keyword group B as a new field in a target database;
Step S5, acquiring a keyword A by using a TF-IDF technology based on a text input by a user, and matching the keyword A with a keyword group B to obtain comprehensive matching degree data;
and S6, setting a recommendation threshold based on the comprehensive matching degree data, and judging whether to recommend.
Example 3
The embodiment discloses an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the running mode of the intellectual property operation management system based on big data when executing the computer program.
Since the electronic device described in this embodiment is an electronic device used for implementing an intellectual property operation management system based on big data in the embodiment of the present application, based on the intellectual property operation management system based on big data described in the embodiment of the present application, a person skilled in the art can understand a specific implementation manner of the electronic device and various modifications thereof, so how to implement the method in the embodiment of the present application for this electronic device will not be described in detail herein. As long as the person skilled in the art implements an electronic device adopted by the intellectual property operation management system based on big data in the embodiment of the application, the electronic device belongs to the scope of protection intended by the application.
The above formulas are all formulas with dimensionality removed and numerical calculation, the formulas are formulas with the latest real situation obtained by software simulation through collecting a large amount of data, and preset parameters and threshold selection in the formulas are set by those skilled in the art according to the actual situation.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention are intended to be comprehended within the scope of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411807543.4A CN119741156A (en) | 2024-12-10 | 2024-12-10 | Intellectual property operation management system based on big data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411807543.4A CN119741156A (en) | 2024-12-10 | 2024-12-10 | Intellectual property operation management system based on big data |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN119741156A true CN119741156A (en) | 2025-04-01 |
Family
ID=95126067
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411807543.4A Pending CN119741156A (en) | 2024-12-10 | 2024-12-10 | Intellectual property operation management system based on big data |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119741156A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120031046A (en) * | 2025-04-23 | 2025-05-23 | 苏州元脑智能科技有限公司 | Text semantic segmentation method, device, equipment, medium and product |
-
2024
- 2024-12-10 CN CN202411807543.4A patent/CN119741156A/en active Pending
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120031046A (en) * | 2025-04-23 | 2025-05-23 | 苏州元脑智能科技有限公司 | Text semantic segmentation method, device, equipment, medium and product |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111985228B (en) | Text keyword extraction method, text keyword extraction device, computer equipment and storage medium | |
| CN109829104A (en) | Pseudo-linear filter model information search method and system based on semantic similarity | |
| CN116756303B (en) | Automatic generation method and system for multi-topic text abstract | |
| CN110347796A (en) | Short text similarity calculating method under vector semantic tensor space | |
| CN119719312B (en) | Intelligent government affair question-answering method, device, equipment and storage medium | |
| CN110442702A (en) | Searching method, device, readable storage medium storing program for executing and electronic equipment | |
| CN117494815B (en) | File-oriented credible large language model training and reasoning method and device | |
| CN118245564B (en) | Method and device for constructing feature comparison library supporting semantic review and repayment | |
| CN119741156A (en) | Intellectual property operation management system based on big data | |
| CN110851584A (en) | Accurate recommendation system and method for legal provision | |
| CN119106322A (en) | User grouping method, device, equipment, storage medium and program product | |
| CN115062151A (en) | A text feature extraction method, text classification method and readable storage medium | |
| CN118585609A (en) | Multidimensional Data Driven Document Search System | |
| CN119578559B (en) | Intelligent agent automatic configuration method and system based on large model, knowledge base and tools | |
| CN112149424A (en) | Semantic matching method, apparatus, computer equipment and storage medium | |
| CN119578422A (en) | A method and system for building and expanding a social network of successful customers | |
| CN116932487B (en) | Quantized data analysis method and system based on data paragraph division | |
| JP5542732B2 (en) | Data extraction apparatus, data extraction method, and program thereof | |
| CN113571198B (en) | Conversion rate prediction method, conversion rate prediction device, conversion rate prediction equipment and storage medium | |
| JP2023528985A (en) | Computer-implemented method for searching large-scale unstructured data with feedback loop and data processing apparatus or system therefor | |
| CN114595305A (en) | Intent recognition method based on semantic indexing | |
| CN112667666A (en) | SQL operation time prediction method and system based on N-gram | |
| CN113157892A (en) | User intention processing method and device, computer equipment and storage medium | |
| CN120234427B (en) | Electronic government platform management method and system based on cloud data | |
| CN120257966B (en) | Human resource data management method based on big data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |