[go: up one dir, main page]

CN119741156A - Intellectual property operation management system based on big data - Google Patents

Intellectual property operation management system based on big data Download PDF

Info

Publication number
CN119741156A
CN119741156A CN202411807543.4A CN202411807543A CN119741156A CN 119741156 A CN119741156 A CN 119741156A CN 202411807543 A CN202411807543 A CN 202411807543A CN 119741156 A CN119741156 A CN 119741156A
Authority
CN
China
Prior art keywords
data
intellectual property
value
keyword group
calculate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411807543.4A
Other languages
Chinese (zh)
Inventor
邹华君
邹尚哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Guoxin Intellectual Property Service Co ltd
Original Assignee
Wuxi Guoxin Intellectual Property Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Guoxin Intellectual Property Service Co ltd filed Critical Wuxi Guoxin Intellectual Property Service Co ltd
Priority to CN202411807543.4A priority Critical patent/CN119741156A/en
Publication of CN119741156A publication Critical patent/CN119741156A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services
    • G06Q50/184Intellectual property management
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Technology Law (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Tourism & Hospitality (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of intelligence, and discloses an intellectual property operation management system based on big data, which comprises the steps of obtaining behavioral data and basic data of intellectual property; the method comprises the steps of carrying out missing value and abnormal value processing on behavior data of intellectual property rights to obtain a preprocessing data set, predicting value index data of the intellectual property rights by using an SVR model based on the preprocessing data set, extracting, converting and loading by using an ETL tool, processing application names by using a TF-IDF technology, extracting a keyword group B, storing the keyword group B as a new field in a target database, obtaining a keyword A by using the TF-IDF technology based on user input text, matching with the keyword group B to obtain comprehensive matching degree data, setting a threshold value based on the comprehensive matching degree data, judging whether to recommend or not, and improving the efficiency and accuracy of intellectual property rights operation management.

Description

Intellectual property operation management system based on big data
The invention relates to the technical field of intelligence, in particular to an intellectual property operation management system based on big data.
Background
The existing intellectual property operation management system of big data exposes a plurality of problems that firstly, an intelligent and data-driven prediction model is not adopted, but simple rules and statistical data are relied on, and the potential commercial value, technical innovation or market influence of the intellectual property can not be accurately estimated;
most of the prior systems match keywords based on text data only, and ignore the value index of the intellectual property, which means that even if the intellectual property has higher matching degree in text, the intellectual property does not necessarily represent that the intellectual property has practical value for the user demand;
The existing system generally uses a static keyword matching rule and a fixed matching degree threshold, which means that once the matching degree standard is set, the system cannot adaptively adjust according to data change or change of user requirements, and the fixed threshold cannot dynamically adjust according to nuances among intellectual property rights, so that matching is too loose or too strict, and recommendation accuracy is affected.
In view of the above, the present invention proposes an intellectual property operation management system based on big data to solve the above-mentioned problems.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides the following technical scheme that the intellectual property operation management system based on big data comprises:
The data acquisition module is used for acquiring behavior data and basic data of intellectual property rights;
The data processing module is used for carrying out missing value and abnormal value processing on the behavior data of the intellectual property to obtain a preprocessing data set;
A prediction module for predicting value index data of intellectual property by using SVR model based on the preprocessed data set;
The data storage module is used for extracting, converting and loading the basic data, the preprocessing data set and the value index data of the intellectual property by using an ETL tool, processing the application name by using a TF-IDF technology, extracting a keyword group B, and storing the keyword group B as a new field in the target database;
the keyword matching module is used for acquiring a keyword A by using a TF-IDF technology based on a text input by a user, and matching the keyword A with a keyword group B to obtain comprehensive matching degree data;
And the recommendation module is used for setting a recommendation threshold value based on the comprehensive matching degree data and judging whether to recommend.
Further, the behavior data of the intellectual property comprises a click frequency characteristic, a collection frequency characteristic, a downloading frequency characteristic and a quotation frequency characteristic;
the basic data includes application name, application number, application date, validity period, and applicant name.
Further, the method for acquiring the preprocessing data set includes:
aiming at intellectual property behavior data, a stacking method is used for forming a data set, a forward filling method and a backward filling method are used for carrying out missing value processing on the data set to obtain a missing value processing data set, an isolated forest model is used for identifying and processing abnormal values, and the processed data is standardized to obtain a preprocessing data set;
Traversing click frequency characteristic, collection frequency characteristic, downloading frequency characteristic and reference frequency characteristic data to form a dataset by using isnull technology, identifying a missing value, obtaining a forward data value of the missing value by using a forward filling method aiming at the missing value, obtaining a backward data value by using a backward filling method, calculating an average value of the forward data value and the backward data value, and filling the missing value by using the average value.
Further, the method for identifying and processing the outliers by using the isolated forest model comprises the following steps:
Initializing hyper-parameters, namely initializing hyper-parameters of an isolated forest, wherein the hyper-parameters comprise the number SL of trees, the maximum depth SD of each tree and the number mm of samples of each tree;
selecting training subsets, namely setting SL trees, setting the size of the training subset of each tree to be mm, and selecting one subset from the missing value processing data set for training each time;
feature selection, namely selecting split features by using multiple evaluation indexes;
Selecting a segmentation point, namely randomly selecting one segmentation point for the selected segmentation feature, and dividing the data into two subsets;
Recursively segmenting, namely repeatedly performing feature selection and segmentation point selection on each subset, and continuously splitting the data into smaller subsets until the maximum depth SD of the tree is reached;
repeating the above process by constructing M_L trees, each tree trained on a different subset of the data;
calculating the path length of each data point, in each tree, calculating the path length of the data point I.e., the number of split nodes traversed from the root node to the data point, where,Representing the path length of the ith data point in the jth tree;
calculating the average path length over all trees for each data point Calculating the average path length of the data point on all treesWherein, the method comprises the steps of, wherein,Representing data pointsThe path length on the j-th tree,Representing the number of trees;
calculating outlier scores by evaluating the outlier scores by calculating the average path length of each data point Wherein, the method comprises the steps of, wherein,Is a constant calculated based on the number of samples m, representing the expected path length, and the formula is: Wherein Is a constant;
stopping iteration when the maximum depth of the tree is reached;
Output result, setting abnormal value threshold value When (when)When the data point is normal, whenAnd when the data point is abnormal, replacing and filling the abnormal data point by using a backward filling method to obtain a pre-processing data set.
Further, the specific way of selecting the splitting feature by using the multiple evaluation indexes includes:
By calculating the information gain, the base index and the variance of each feature and normalizing, the information gain, the base index and the variance are calculated by using an entropy weight method to set weights, and the comprehensive score of each feature is calculated Wherein, the method comprises the steps of, wherein,AndRespectively represent the firstNormalized information gain, base index and variance of the individual eigenvalues,AndWeights representing information gain, base index and variance, respectively, and selecting a composite scoreThe highest feature is the selected feature;
The entropy weight method calculates the weight by calculating the information entropy value of each evaluation index, Wherein, the method comprises the steps of, wherein,Represent the firstThe entropy of the individual evaluation indicators is determined,Represent the firstThe probability distribution of the tz-th eigenvalue of the individual evaluation index,Representing the total number of the characteristic values;
According to the entropy of information Computational redundancy;
Calculating the weight of each evaluation index according to the redundancyWherein 3 represents the total number of evaluation indexes.
Further, the setting of the outlier thresholdThe specific modes of (a) include:
Setting outlier thresholds using neighborhood distances Calculating an average value of the missing value processing datasetAnd standard deviationAnd calculate the coefficient of variationAnd set the K value asWherein, the method comprises the steps of, wherein,Representing the number of samples of the data set,Represents the adjustment coefficient of the device,,Representing a smoothing factor;
Determining a K value, for each data point, calculating the distance between the data point and K nearest neighbors by using Euclidean distance, and carrying out average calculation to obtain the average distance of the K nearest neighbors, setting a distance threshold value to be 95% according to a rule of thumb, and if the average distance of the K nearest neighbors of the data point is greater than the distance threshold value by 95%, the data point is an abnormal value.
Further, the specific way of predicting the value index data of the intellectual property right by using the SVR model based on the preprocessing data set comprises the following steps:
The sample set comprises J_o groups of samples, and each group of samples comprises a preprocessing data set and corresponding intellectual property value index data;
initializing super parameters, namely selecting a radial basis function as a kernel function, determining a penalty coefficient, epsilon parameters and gamma parameter super parameter combinations, and determining optimal super parameter combinations by using a grid search method;
Calculating initial loss function value of calculation model As a starting point for the optimization, among others,The representation is regularized in such a way that,A weight vector representing the input feature,The term of the bias is indicated,The term of the error is represented as,For penalty coefficients, n2 represents the total number of training samples, o represents the sample index,Representing the error in the upper deviation of the o-th sample,Representing the error in the lower bias representing the o-th sample;
Iterative optimization, in which model parameters are updated by using gradient descent method in each iterative process AndCalculating a new loss function in the updating process;
and (3) checking convergence conditions, namely comparing the current loss function value with the change of the previous round after each round of iteration, and stopping iteration when the loss change between two adjacent iterations is smaller than a preset error threshold value.
Further, the specific ways of extracting, converting and loading the basic data, the preprocessed data set and the value index data of the intellectual property right by using the ETL tool include:
Extracting, namely realizing intellectual property basic data obtained from an L_L data source in real time through a real-time data flow technology APACHE KAFKA, directly transmitting the intellectual property basic data into an ETL tool for processing, and simultaneously guiding value index data predicted according to intellectual property behavior data and a processed data set into the ETL tool;
Converting, namely, aiming at intellectual property basic data, value index data and a processing data set which are imported into an ETL tool, performing duplication removal, missing value processing, unifying date formats and adjusting field names, and converting original data into a standard format of a target database;
importing the converted data into a target database, and designing a table structure, a field type and an index of the target database;
further, the specific method for extracting the keyword group B from the application name by using the TF-IDF technology comprises the following steps:
the application name preprocessing, namely, using HanLP tools to apply for the name for word segmentation to obtain sub words, removing stop words, and converting capital letters of a text into a lowercase form to obtain a preprocessed text;
calculating the occurrence frequency of each subword in the preprocessed text Wherein, the method comprises the steps of, wherein,The frequency of occurrence of the word d is indicated,Representing the number of times word d appears in the pre-processed text,Representing the total word number of the preprocessed text;
the inverse text frequency is calculated, the attraction degree of a sub-word in the corpus is calculated, the larger the IDF is, the more the word has identification degree to a specific document, and the formula is as follows: Wherein, the method comprises the steps of, wherein, Representing the total number of texts,The amount of text containing the subword d is expressed,The inverse text frequency representing the subword d;
Calculating a TF-IDF value, and multiplying the occurrence frequency TF of each sub-word by the inverse text frequency to obtain the TF-IDF value, wherein the TF-IDF value reflects the importance of the word t in the document d, and the higher the sub-word TF-IDF value is, the more the word is a keyword;
sequencing according to TF-IDF values of each sub word, setting a threshold value A_Q, and when the TF-IDF value is larger than the threshold value, using the sub word as a keyword to obtain a keyword group B;
the method for obtaining comprehensive matching degree data based on the text input by the user and matching with the keyword group B comprises the following steps of:
preprocessing based on a text input by a user, and extracting a keyword group A by using a TF-IDF technology;
the keyword group B in the database is called and matched with the keyword group A input by the user,
Obtaining vectors of a keyword group A and a keyword group B, and constructing the vectors by TF-IDF values of each sub-word in the keyword group;
the cosine similarity is used for calculating the similarity of the keyword group A and the keyword group B: Wherein, the method comprises the steps of, wherein, A vector representing the keyword group a,A vector representing the keyword group B,The inner product is represented by the number of the inner products,AndThe calculated cosine similarity value is between 0 and 1, and the larger the value is, the more similar the two key word groups are;
And calculating Jaccard similarity, wherein the formula is as follows: Wherein, the method comprises the steps of, wherein, Represents the size of the intersection of the keyword group a and the keyword group B,The union size of the key phrase A and the key phrase B is represented, the closer the value is to 1, the more similar the two key phrases are represented, and the smaller the conversely is;
The cosine similarity and Jaccard similarity are calculated, and the comprehensive matching degree is calculated by combining the intellectual property value index: Wherein, the method comprises the steps of, wherein, AndRespectively represent the weights of cosine similarity, jaccard similarity and intellectual property value index,Representing an intellectual property value index.
Further, the specific way of setting the recommendation threshold based on the comprehensive matching degree data and judging whether to recommend includes:
setting a threshold value by using a sliding window method, and setting a window with a fixed length C_K;
Calculating the mean value and standard deviation of the comprehensive matching degree in the window, and setting an initial recommendation threshold value: Wherein, the method comprises the steps of, wherein, Representing the average of the integrated matches in the window,The standard deviation representing the overall degree of matching in the window,Represents the adjustment coefficient of the device,;
When a new comprehensive matching degree is received, the window slides forward, the oldest comprehensive matching degree is removed, the new matching degree is added, an updated window is obtained, and the mean value and the standard deviation are recalculated, so that the recommended threshold value is obtainedWherein, the method comprises the steps of,Representing the mean value of the update window,Representing the standard deviation of the update window;
When (when) When the method is used, the intellectual property is recommended, the matching degree of the intellectual property is high, and the requirement of a user is met;
When (when) When the method is used, the intellectual property is not recommended, the matching degree of the intellectual property is low, and the requirement of a user is not met.
The intellectual property operation management system based on big data has the technical effects and advantages that:
the invention integrates behavioral data, basic data and predicted value indexes of intellectual property through a big data technology, provides a data-driven intellectual property operation management solution, predicts the value of the intellectual property by utilizing a SVR model, realizes keyword extraction by using a TF-IDF technology, calculates comprehensive matching degree through cosine similarity and Jaccard similarity, provides accurate intellectual property recommendation for users, dynamically adjusts recommendation threshold values through a sliding window method, ensures that the system can adapt to changed data and user requirements in real time, and further improves matching accuracy and recommendation effectiveness;
Meanwhile, keyword matching not only depends on text content, but also combines value indexes of the intellectual property, thereby improving the correlation of matching results, and a dynamic threshold adjustment mechanism ensures that a system can flexibly adjust recommendation standards according to actual data changes, avoids matching deviation caused by static rules, and further optimizes user experience and recommendation efficiency.
Drawings
FIG. 1 is a schematic diagram of an intellectual property operation management system based on big data according to the present invention;
fig. 2 is a schematic diagram of an intellectual property operation management method based on big data in the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, the intellectual property operation management system based on big data according to the present embodiment includes:
the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring behavior data and basic data of intellectual property rights, wherein the behavior data of the intellectual property rights comprises click frequency characteristics, collection frequency characteristics, downloading frequency characteristics and reference frequency characteristic data;
the basic data comprises an application name, an application number, an application date, an expiration date and an applicant name;
The behavior data of the intellectual property comprises click frequency characteristics, collection frequency characteristics, downloading frequency characteristics and quotation frequency characteristics, and the behavior data of the intellectual property is obtained by regularly grabbing the intellectual property website by utilizing a crawler technology;
The application name, the application number, the application date, the validity period and the applicant name are obtained through a public query interface or an API interface provided by an intellectual property library;
The data processing module is used for carrying out missing value and abnormal value processing on the behavior data of the intellectual property to obtain a preprocessing data set;
A prediction module for predicting value index data of intellectual property by using SVR model based on the preprocessed data set;
The data storage module is used for extracting, converting and loading the basic data, the preprocessing data set and the value index data of the intellectual property by using an ETL tool, processing the application name by using a TF-IDF technology, extracting a keyword group B, and storing the keyword group B as a new field in the target database;
the keyword matching module is used for acquiring a keyword A by using a TF-IDF technology based on a text input by a user, and matching the keyword A with a keyword group B to obtain comprehensive matching degree data;
And the recommendation module is used for setting a recommendation threshold value based on the comprehensive matching degree data and judging whether to recommend.
The acquisition mode of the preprocessing data set comprises the following steps:
aiming at intellectual property behavior data, a stacking method is used for forming a data set, a forward filling method and a backward filling method are used for carrying out missing value processing on the data set to obtain a missing value processing data set, an isolated forest model is used for identifying and processing abnormal values, and the processed data is standardized to obtain a preprocessing data set;
Traversing click frequency characteristic, collection frequency characteristic, downloading frequency characteristic and reference frequency characteristic data to form a dataset by using isnull technology, identifying a missing value, obtaining a forward data value of the missing value by using a forward filling method aiming at the missing value, obtaining a backward data value by using a backward filling method, calculating an average value of the forward data value and the backward data value, and filling the missing value by using the average value.
The method for identifying and processing the abnormal value by using the isolated forest model comprises the following steps:
Initializing hyper-parameters, namely initializing hyper-parameters of an isolated forest, wherein the hyper-parameters comprise the number SL of trees, the maximum depth SD of each tree and the number mm of samples of each tree;
selecting training subsets, namely setting SL trees, setting the size of the training subset of each tree to be mm, and selecting one subset from the missing value processing data set for training each time;
feature selection, namely selecting split features by using multiple evaluation indexes;
Selecting a segmentation point, namely randomly selecting one segmentation point for the selected segmentation feature, and dividing the data into two subsets;
Recursively segmenting, namely repeatedly performing feature selection and segmentation point selection on each subset, and continuously splitting the data into smaller subsets until the maximum depth SD of the tree is reached;
repeating the above process by constructing M_L trees, each tree trained on a different subset of the data;
calculating the path length of each data point, in each tree, calculating the path length of the data point I.e., the number of split nodes traversed from the root node to the data point, where,Representing the path length of the ith data point in the jth tree;
calculating the average path length over all trees for each data point Calculating the average path length of the data point on all treesWherein, the method comprises the steps of, wherein,Representing data pointsThe path length on the j-th tree,Representing the number of trees;
calculating outlier scores by evaluating the outlier scores by calculating the average path length of each data point Wherein, the method comprises the steps of, wherein,Is a constant calculated based on the number of samples m, representing the expected path length, and the formula is: Wherein Is a constant;
Stopping the condition, namely iterating when reaching the maximum depth characteristic of the tree;
Output result, setting abnormal value threshold value When (when)When the data point is normal, whenAnd when the data point is abnormal, replacing and filling the abnormal data point by using a backward filling method to obtain a pre-processing data set.
The specific way of selecting split features using multiple evaluation metrics includes:
By calculating the information gain, the base index and the variance of each feature and normalizing, the information gain, the base index and the variance are calculated by using an entropy weight method to set weights, and the comprehensive score of each feature is calculated Wherein, the method comprises the steps of, wherein,AndRespectively represent the firstNormalized information gain, base index and variance of the individual eigenvalues,AndWeights representing information gain, base index and variance, respectively, and selecting a composite scoreThe highest feature is the selected feature;
Here, the Because the smaller the base index is, the better the information gain is, and the directions of the two indexes are opposite, the base index needs to be inverted, so that the two indexes become the 'larger and better' indexes;
The entropy weight method calculates the weight by calculating the information entropy value of each evaluation index, Wherein, the method comprises the steps of, wherein,Represent the firstThe entropy of the individual evaluation indicators is determined,Represent the firstThe probability distribution of the tz-th eigenvalue of the individual evaluation index,Representing the total number of the characteristic values;
According to the entropy of information Computational redundancy;
Calculating the weight of each evaluation index according to the redundancyWherein 3 represents the total number of evaluation indexes.
Setting an outlier thresholdThe specific modes of (a) include:
Setting outlier thresholds using neighborhood distances Calculating an average value of the missing value processing datasetAnd standard deviationAnd calculate the coefficient of variationAnd set the K value asWherein, the method comprises the steps of, wherein,Representing the number of samples of the data set,Represents the adjustment coefficient of the device,,Representing a smoothing factor;
Determining a K value, for each data point, calculating the distance between the data point and K nearest neighbors by using Euclidean distance, and carrying out average calculation to obtain the average distance of the K nearest neighbors, setting a distance threshold value to be 95% according to a rule of thumb, and if the average distance of the K nearest neighbors of the data point is greater than the distance threshold value by 95%, the data point is an abnormal value.
Based on the preprocessed data set, specific ways of predicting value index data of intellectual property using the SVR model include:
the sample set comprises J_o groups of samples, and each group of samples comprises a preprocessing data set and corresponding intellectual property value index data;
initializing super parameters, namely selecting a radial basis function as a kernel function, determining a penalty coefficient, epsilon parameters and gamma parameter super parameter combinations, and determining optimal super parameter combinations by using a grid search method;
Calculating initial loss function value of calculation model As a starting point for the optimization, among others,The representation is regularized in such a way that,A weight vector representing the input feature,The term of the bias is indicated,The term of the error is represented as,For penalty coefficients, n2 represents the total number of training samples, o represents the sample index,Representing the sum of errors in the upper deviation (predicted value is greater than true value) of the o-th sampleRepresenting the error in the lower deviation (predicted value is smaller than true value) representing the o-th sample;
Iterative optimization, in which model parameters are updated by using gradient descent method in each iterative process AndCalculating a new loss function in the updating process;
checking convergence conditions, namely comparing the current loss function value with the change of the previous round after each round of iteration, and stopping iteration when the loss change between two adjacent iterations is smaller than a preset error threshold;
In the SVR model, the converged error threshold is set by rule of thumb, set to a fixed value When the change in the loss function between adjacent conjunctions is less than this value, then the model has converged.
Specific ways of extracting, converting and loading the base data, the preprocessed data set and the value index data of the intellectual property right by using the ETL tool include:
Extracting, namely realizing intellectual property basic data obtained from an L_L data source in real time through a real-time data flow technology APACHE KAFKA, directly transmitting the intellectual property basic data into an ETL tool for processing, and simultaneously guiding value index data predicted according to intellectual property behavior data and a processed data set into the ETL tool;
Converting, namely, aiming at intellectual property basic data, value index data and a processing data set which are imported into an ETL tool, performing duplication removal, missing value processing, unifying date formats and adjusting field names, and converting original data into a standard format of a target database;
importing the converted data into a target database, and designing a table structure, a field type and an index of the target database;
the data sources include intellectual property databases, scientific paper databases, business analysis platforms, user behavior data, social media, and enterprise internal systems.
The specific method for extracting the keyword group B from the application name by using the TF-IDF technology comprises the following steps:
the application name preprocessing, namely, using HanLP tools to apply for the name for word segmentation to obtain sub words, removing stop words, and converting capital letters of a text into a lowercase form to obtain a preprocessed text;
calculating the occurrence frequency of each subword in the preprocessed text Wherein, the method comprises the steps of, wherein,The frequency of occurrence of the word d is indicated,Representing the number of times word d appears in the pre-processed text,Representing the total word number of the preprocessed text;
the inverse text frequency is calculated, the attraction degree of a sub-word in the corpus is calculated, the larger the IDF is, the more the word has identification degree to a specific document, and the formula is as follows: Wherein, the method comprises the steps of, wherein, Representing the total number of texts,The amount of text containing the subword d is expressed,The inverse text frequency representing the subword d;
Calculating a TF-IDF value, and multiplying the occurrence frequency TF of each sub-word by the inverse text frequency to obtain the TF-IDF value, wherein the TF-IDF value reflects the importance of the word t in the document d, and the higher the sub-word TF-IDF value is, the more the word is a keyword;
sequencing according to TF-IDF values of each sub word, setting a threshold value A_Q, and when the TF-IDF value is larger than the threshold value, using the sub word as a keyword to obtain a keyword group B;
wherein the threshold A_Q is empirically set to be 70%;
HanLP is a high-efficiency and open-source natural language processing tool kit, is mainly focused on the processing of Chinese texts, provides rich functions including word segmentation, part-of-speech labeling, named entity recognition, syntactic analysis, emotion analysis and the like, can process complex language characteristics, is suitable for various application scenes such as text mining, search engines, information extraction and the like, has high accuracy and flexibility, supports multiple languages and customized training, and is an important tool in the field of Chinese processing.
Based on the text input by the user, the TF-IDF technology is used for obtaining the keyword A, and the keyword A is matched with the keyword group B, and the method for obtaining the comprehensive matching degree data comprises the following steps:
The method comprises the steps of extracting keywords through TF-IDF and calculating similarity, ensuring accurate understanding of user input, evaluating the similarity of the keywords from different angles by cosine similarity and Jaccard similarity, enhancing the comprehensiveness of matching, and simultaneously, combining with value indexes of intellectual property rights, further screening the intellectual property rights with practical value on the basis of text similarity, thereby improving the accuracy and practicability of recommendation;
Setting a user input text box, wherein a user can input according to the needs of the user to obtain a user input text, preprocessing the user input text, and extracting a keyword group A by using a TF-IDF technology;
The key word group B in the database is called and matched with the key word group A input by the user, the vectors of the key word group A and the key word group B are obtained, and the vector is constructed through the TF-IDF value of each sub word in the key word group;
the cosine similarity is used for calculating the similarity of the keyword group A and the keyword group B: Wherein, the method comprises the steps of, wherein, A vector representing the keyword group a,A vector representing the keyword group B,The inner product is represented by the number of the inner products,AndThe calculated cosine similarity value is between 0 and 1, and the larger the value is, the more similar the two key word groups are;
And calculating Jaccard similarity, wherein the formula is as follows: Wherein, the method comprises the steps of, wherein, Represents the size of the intersection of the keyword group a and the keyword group B,The union size of the key phrase A and the key phrase B is represented, the closer the value is to 1, the more similar the two key phrases are represented, and the smaller the conversely is;
The cosine similarity and Jaccard similarity are calculated, and the comprehensive matching degree is calculated by combining the intellectual property value index: Wherein, the method comprises the steps of, wherein, AndRespectively represent the weights of cosine similarity, jaccard similarity and intellectual property value index,Representing an intellectual property value index.
Based on the comprehensive matching degree data, a recommendation threshold is set, and the specific mode for judging whether to recommend comprises the following steps:
setting a threshold value by using a sliding window method, and setting a window with a fixed length C_K;
Calculating the mean value and standard deviation of the comprehensive matching degree in the window, and setting an initial recommendation threshold value: Wherein, the method comprises the steps of, wherein, Representing the average of the integrated matches in the window,The standard deviation representing the overall degree of matching in the window,Represents the adjustment coefficient of the device,;
When a new comprehensive matching degree is received, the window slides forward, the oldest comprehensive matching degree is removed, the new matching degree is added, an updated window is obtained, and the mean value and the standard deviation are recalculated, so that the recommended threshold value is obtainedWherein, the method comprises the steps of,Representing the mean value of the update window,Representing the standard deviation of the update window;
When (when) When the method is used, the intellectual property is recommended, the matching degree of the intellectual property is high, the requirement of a user is met, and the method has value for the user;
When (when) When the method is used, the intellectual property is not recommended, the matching degree of the intellectual property is low, the requirement of the user is not met, and the method has no value for the user.
According to the embodiment, behavioral data, basic data and predicted value indexes of intellectual property are integrated through a big data technology, a data-driven intellectual property operation management solution is provided, the intellectual property value prediction is carried out through an SVR model, the key word extraction is realized through a TF-IDF technology, the comprehensive matching degree is calculated through cosine similarity and Jaccard similarity, accurate intellectual property recommendation is provided for users, a recommendation threshold is dynamically adjusted through a sliding window method, and the system is ensured to adapt to changed data and user requirements in real time, so that the matching accuracy and recommendation effectiveness are improved;
In addition, the keyword matching not only depends on text content, but also combines value indexes of the intellectual property, thereby improving the correlation of matching results, and the adjustment mechanism of the dynamic threshold ensures that the system can flexibly adjust recommendation standards according to actual data changes, avoids matching deviation caused by static rules, and further optimizes user experience and recommendation efficiency.
Example 2
Referring to fig. 2, the detailed description of the embodiment is not shown in the description of embodiment 1, and an intellectual property operation management method based on big data is provided, which includes:
Step S1, collecting behavior data and basic data of intellectual property rights;
S2, carrying out missing value and abnormal value processing on behavior data of intellectual property rights to obtain a preprocessing data set;
S3, inputting the preprocessed data set into the SVR model, and predicting to obtain value index data of intellectual property;
S4, extracting, converting and loading intellectual property basic data, a processing data set and value index data by utilizing an ETL tool, extracting a keyword group B from an application name by utilizing a TF-IDF technology, and storing the keyword group B as a new field in a target database;
Step S5, acquiring a keyword A by using a TF-IDF technology based on a text input by a user, and matching the keyword A with a keyword group B to obtain comprehensive matching degree data;
and S6, setting a recommendation threshold based on the comprehensive matching degree data, and judging whether to recommend.
Example 3
The embodiment discloses an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the running mode of the intellectual property operation management system based on big data when executing the computer program.
Since the electronic device described in this embodiment is an electronic device used for implementing an intellectual property operation management system based on big data in the embodiment of the present application, based on the intellectual property operation management system based on big data described in the embodiment of the present application, a person skilled in the art can understand a specific implementation manner of the electronic device and various modifications thereof, so how to implement the method in the embodiment of the present application for this electronic device will not be described in detail herein. As long as the person skilled in the art implements an electronic device adopted by the intellectual property operation management system based on big data in the embodiment of the application, the electronic device belongs to the scope of protection intended by the application.
The above formulas are all formulas with dimensionality removed and numerical calculation, the formulas are formulas with the latest real situation obtained by software simulation through collecting a large amount of data, and preset parameters and threshold selection in the formulas are set by those skilled in the art according to the actual situation.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention are intended to be comprehended within the scope of the present invention.

Claims (10)

1.一种基于大数据的知识产权运营管理系统,其特征在于,包括:1. A big data-based intellectual property operation and management system, characterized by comprising: 数据获取模块:获取知识产权的行为数据以及基础数据;Data acquisition module: obtains behavioral data and basic data of intellectual property rights; 数据处理模块:针对知识产权的行为数据进行缺失值和异常值处理,得到预处理数据集;Data processing module: Process missing values and outliers for the behavior data of intellectual property rights to obtain a preprocessed data set; 预测模块:基于预处理数据集,使用SVR模型预测知识产权的价值指标数据;Prediction module: Based on the preprocessed data set, use the SVR model to predict the value indicator data of intellectual property rights; 数据存储模块:使用ETL工具将知识产权的基础数据、预处理数据集以及价值指标数据进行提取、转换和加载,并使用TF-IDF技术对申请名称进行处理,提取出关键词组B,将关键词组B作为一个新字段存储在目标数据库中;Data storage module: Use ETL tools to extract, transform and load the basic data, pre-processed data sets and value indicator data of intellectual property rights, and use TF-IDF technology to process the application name, extract keyword group B, and store keyword group B as a new field in the target database; 关键词匹配模块:基于用户输入文本,使用TF-IDF技术获取关键词A;并与关键词组B进行匹配,获得综合匹配度数据;Keyword matching module: Based on the user input text, use TF-IDF technology to obtain keyword A; and match it with keyword group B to obtain comprehensive matching data; 推荐模块:基于综合匹配度数据,设定推荐阈值,并进行判断是否进行推荐。Recommendation module: Based on the comprehensive matching data, set the recommendation threshold and determine whether to make a recommendation. 2.根据权利要求1所述的一种基于大数据的知识产权运营管理系统,其特征在于,所述知识产权的行为数据包括点击次数特征、收藏次数特征、下载次数特征以及引用次数特征;2. According to claim 1, the intellectual property operation and management system based on big data is characterized in that the behavior data of the intellectual property includes click count characteristics, collection count characteristics, download count characteristics and citation count characteristics; 基础数据包括申请名称、申请号、申请日期、有效期以及申请人名称。Basic data includes application name, application number, application date, validity period and applicant name. 3.根据权利要求2所述的一种基于大数据的知识产权运营管理系统,其特征在于,所述预处理数据集的获取方式包括:3. According to the big data-based intellectual property operation and management system of claim 2, it is characterized in that the acquisition method of the pre-processed data set includes: 针对知识产权行为数据,使用堆叠方法形成数据集,使用前向填充法和后向填充法对数据集进行缺失值处理,得到缺失值处理数据集,再使用孤立森林模型识别出异常值并进行处理,将处理后的数据再进行标准化,得到预处理数据集;For the intellectual property behavior data, the stacking method is used to form a data set, and the forward filling method and backward filling method are used to process the missing values of the data set to obtain the missing value processing data set. Then, the isolation forest model is used to identify and process the outliers, and the processed data is standardized to obtain the preprocessed data set. 所述使用isnull技术进行遍历点击次数特征、收藏次数特征、下载次数特征以及引用次数特征数据组成数据集,并识别出缺失值,针对缺失值,使用前向填充法获得缺失值的前向数据值,使用后向填充法获得后向数据值,并计算前向数据值和后向数据值的平均值,使用平均值进行填充缺失值。The isnull technology is used to traverse the data set composed of click count features, favorite count features, download count features and citation count features, and identify missing values. For missing values, a forward filling method is used to obtain the forward data value of the missing value, and a backward filling method is used to obtain the backward data value. The average of the forward data value and the backward data value is calculated, and the missing value is filled with the average value. 4.根据权利要求3所述的一种基于大数据的知识产权运营管理系统,其特征在于,所述使用孤立森林模型识别出异常值并进行处理的方式包括:4. According to the big data-based intellectual property operation and management system of claim 3, the method of using the isolation forest model to identify outliers and process them includes: 初始化超参数:初始化孤立森林的超参数;Initialize hyperparameters: Initialize the hyperparameters of isolation forest; 选择训练子集:设定有SL棵树,每棵树的训练子集大小设定为mm,每次训练从缺失值处理数据集中选取一个子集进行训练;Select training subset: set SL trees, set the training subset size of each tree to mm, and select a subset from the missing value processing data set for training each time; 特征选择:使用多评估指标来选择分裂特征;Feature selection: Use multiple evaluation metrics to select split features; 切分点选择:对选定的分裂特征,随机选择一个切分点,将数据划分成两子集;Split point selection: For the selected split feature, a split point is randomly selected to divide the data into two subsets; 递归切分:对于每个子集,重复进行特征选择和切分点选择,继续将数据分裂成更小的子集,直到达到树的最大深度SD停止;Recursive splitting: For each subset, repeat feature selection and split point selection, and continue to split the data into smaller subsets until the maximum depth SD of the tree is reached; 重复以上过程:构建M_L棵树,每棵树在数据的不同子集上进行训练;Repeat the above process: build M_L trees, each tree is trained on a different subset of the data; 计算每个数据点的路径长度:在每棵树中,计算数据点的路径长度,即从根节点到该数据点所经过的分裂节点数,其中,表示第i个数据点在第j棵树中的路径长度;Calculate the path length of each data point: In each tree, calculate the path length of the data point , that is, the number of split nodes from the root node to the data point, where Represents the path length of the i-th data point in the j-th tree; 计算所有树上的平均路径长度:针对每个数据点,计算该数据点在所有树上的平均路径长度,其中,表示数据点在第j棵树上的路径长度,表示树的数量;Calculate the average path length over all trees: For each data point , calculate the average path length of the data point in all trees ,in, Represents data points The path length in the jth tree, Indicates the number of trees; 计算异常值分数:通过计算每个数据点的平均路径长度来评估其异常值分数,其中,是基于样本数m计算的常数,表示期望路径长度,公式为:,其中为常数;Calculate outlier score: Evaluate the outlier score of each data point by calculating its average path length ,in, is a constant calculated based on the number of samples m, representing the expected path length, and the formula is: ,in is a constant; 停止条件:当达到树的最大深度停止迭代;Stop condition: stop iteration when the maximum depth of the tree is reached; 输出结果:设定异常值阈值,当时,表示该数据点正常;当时,表示该数据点异常,针对异常数据点使用后向填充法进行替换填充,得到预处数据集。Output: Setting the outlier threshold ,when When , it indicates that the data point is normal; when When , it means that the data point is abnormal. The backward filling method is used to replace and fill the abnormal data points to obtain the pre-processed data set. 5.根据权利要求4所述的一种基于大数据的知识产权运营管理系统,其特征在于,所述使用多评估指标来选择分裂特征的具体方式包括:5. According to the big data-based intellectual property operation and management system of claim 4, it is characterized in that the specific method of using multiple evaluation indicators to select split features includes: 通过计算每个特征的信息增益、基尼指数和方差,并进行标准化,利用熵权法计算信息增益、基尼指数和方差设置权重,并计算每个特征的综合评分,选择综合评分最高的特征作为选择的特征;By calculating the information gain, Gini index and variance of each feature and standardizing them, the entropy weight method is used to calculate the information gain, Gini index and variance to set the weight, and the comprehensive score of each feature is calculated , select Comprehensive Rating The highest feature is used as the selected feature; 其中,分别表示第个特征值的标准化的信息增益、基尼指数和方差,分别表示信息增益、基尼指数和方差的权重;in, , and Respectively represent The standardized information gain, Gini index and variance of the eigenvalues, , and Represent the weights of information gain, Gini index and variance respectively; 所述熵权法计算权重的方式包括:通过计算每个评估指标的信息熵值,,其中,表示第个评估指标的熵,表示第个评估指标的第tz个特征值的概率分布,表示特征值的总个数;The entropy weight method calculates the weight by calculating the information entropy value of each evaluation indicator. ,in, Indicates The entropy of the evaluation index, Indicates The probability distribution of the tz-th eigenvalue of the evaluation indicator, Indicates the total number of eigenvalues; 根据信息熵计算冗余度According to information entropy Calculating redundancy ; 根据冗余度计算每个评估指标的权重,其中,3表示评估指标的总个数。Calculate the weight of each evaluation indicator according to the redundancy , where 3 represents the total number of evaluation indicators. 6.根据权利要求5所述的一种基于大数据的知识产权运营管理系统,其特征在于,所述设定异常值阈值的具体方式包括:6. According to the big data-based intellectual property operation and management system of claim 5, it is characterized in that the abnormal value threshold is set The specific methods include: 使用邻域距离来设定异常值阈值,计算缺失值处理数据集的平均值和标准差,并计算出变异系数,并设定K值为,其中,表示数据集样本个数,表示调节系数,表示平滑因子;Using neighborhood distance to set outlier threshold , calculate the mean value of the missing value processing data set and standard deviation , and calculate the coefficient of variation , and set the K value to ,in, represents the number of samples in the dataset, represents the adjustment coefficient, , represents the smoothing factor; 确定K值,对于每个数据点,使用欧氏距离来计算该数据点与其最近的K个邻居的距离,并进行平均计算,得到K个最近邻的平均距离;根据经验法则,设置距离阈值为;当数据点的K个最近邻的平均距离大于距离阈值,则该数据点为异常值。Determine the K value. For each data point, use the Euclidean distance to calculate the distance between the data point and its K nearest neighbors, and perform an average calculation to obtain the average distance of the K nearest neighbors. According to the rule of thumb, set the distance threshold to; when the average distance of the K nearest neighbors of a data point is greater than the distance threshold, the data point is an outlier. 7.根据权利要求6所述的一种基于大数据的知识产权运营管理系统,其特征在于,所述基于预处理数据集,使用SVR模型预测知识产权的价值指标数据的具体方式包括:7. According to claim 6, a big data-based intellectual property operation and management system, characterized in that the specific method of using the SVR model to predict the value index data of intellectual property based on the pre-processed data set includes: 样本集包括J_o组样本,每组样本包括一个预处理数据集和对应的知识产权价值指标数据;The sample set includes J_o groups of samples, each of which includes a preprocessed data set and corresponding intellectual property value indicator data; 初始化超参数:选择径向基函数作为核函数,确定惩罚系数、epsilon参数、gamma参数超参数组合,使用网格搜索方法确定最优超参数组合;Initialize hyperparameters: Select radial basis function as kernel function, determine the penalty coefficient, epsilon parameter, gamma parameter hyperparameter combination, and use grid search method to determine the optimal hyperparameter combination; 计算初始损失函数:计算模型的初始损失函数值,作为优化的起点,其中,表示正则化,表示输入特征的权重向量,表示偏置项,表示误差项,为惩罚系数,n2表示训练样本总数,o表示样本索引,表示第o个样本在上偏差上的误差,表示表示第o个样本在下偏差上的误差;Calculate the initial loss function: Calculate the initial loss function value of the model , as the starting point of optimization, where represents regularization, represents the weight vector of the input features, represents the bias term, represents the error term, is the penalty coefficient, n2 represents the total number of training samples, o represents the sample index, represents the error of the oth sample on the upper deviation, represents the error of the o-th sample on the lower deviation; 迭代优化:在每次迭代过程中使用梯度下降法更新模型参数,在更新的过程中计算新的损失函数;Iterative optimization: Use gradient descent to update model parameters in each iteration and , calculate the new loss function during the update process; 检查收敛条件:在每轮迭代后,比较当前损失函数值与上一轮的变化,当相邻两次迭代之间的损失变化小于预设的误差阈值,停止迭代。Check convergence conditions: After each round of iteration, compare the change in the current loss function value with that of the previous round. When the loss change between two adjacent iterations is less than the preset error threshold, stop the iteration. 8.根据权利要求7所述的一种基于大数据的知识产权运营管理系统,其特征在于,所述使用TF-IDF技术对申请名称进行处理,提取出关键词组B的具体方式包括:8. According to the big data-based intellectual property operation and management system of claim 7, it is characterized in that the specific method of using TF-IDF technology to process the application name and extract keyword group B includes: 申请名称预处理:使用HanLP工具申请名称进行分词,得到子词;并去除停用词,将文本的大写字母转换为小写形式,得到预处理文本;Application name preprocessing: Use the HanLP tool to segment the application name and obtain subwords; remove stop words and convert the uppercase letters of the text to lowercase to obtain preprocessed text; 计算每个子词在预处理文本中的出现频率,其中,表示词d的出现频率,表示词d在预处理文本中出现的次数,表示预处理文本的总词个数;Count the frequency of each subword in the preprocessed text ,in, represents the frequency of occurrence of word d, Indicates the number of times word d appears in the preprocessed text, Indicates the total number of words in the preprocessed text; 计算逆文本频率,公式如下:,其中,表示文本总数,表示包含子词d的文本量,表示子词d的逆文本频率;Calculate the inverse text frequency, the formula is as follows: ,in, Indicates the total number of texts, represents the amount of text containing subword d, represents the inverse text frequency of subword d; 计算TF-IDF值,将每个子词的出现频率TF和逆文本频率相乘得到TF-IDF值;Calculate the TF-IDF value by multiplying the occurrence frequency TF of each subword by the inverse text frequency to get the TF-IDF value; 根据每个子词的TF-IDF值进行排序,设定阈值A_Q,当TF-IDF的值大于阈值,则该子词作为关键词,获得关键词组B。Sort by the TF-IDF value of each subword, set a threshold A_Q, and when the TF-IDF value is greater than the threshold, the subword is used as a keyword to obtain keyword group B. 9.根据权利要求8所述的一种基于大数据的知识产权运营管理系统,其特征在于,所述基于用户输入文本,使用TF-IDF技术获取关键词A;并与关键词组B进行匹配,获得综合匹配度数据的具体方式包括:9. According to claim 8, a big data-based intellectual property operation and management system, characterized in that the specific method of obtaining keyword A based on the user input text using TF-IDF technology and matching it with keyword group B to obtain comprehensive matching degree data includes: 使用TF-IDF技术对用户输入文本进行关键词组A提取;Use TF-IDF technology to extract keyword group A from user input text; 获取关键词组A和关键词组B的向量,使用余弦相似度来计算关键词组A与关键词组B的相似度:,其中,表示关键词组A的向量,表示关键词组B的向量,表示内积,表示欧几里得范数;Get the vectors of keyword group A and keyword group B, and use cosine similarity to calculate the similarity between keyword group A and keyword group B: ,in, A vector representing the keyword group A, represents the vector of keyword group B, represents the inner product, and represents the Euclidean norm; 计算Jaccard相似度,公式为:,其中,表示关键词组A和关键词组B的交集大小,表示关键词组A和关键词组B的并集大小;Calculate the Jaccard similarity, the formula is: ,in, Indicates the size of the intersection of keyword group A and keyword group B. Indicates the union size of keyword group A and keyword group B; 通过计算余弦相似度和Jaccard相似度,结合知识产权价值指标来计算综合匹配度:,其中,分别表示余弦相似度、Jaccard相似度以及知识产权价值指标的权重,表示知识产权价值指标。By calculating the cosine similarity and Jaccard similarity, combined with the intellectual property value index, the comprehensive matching degree is calculated: ,in, , and Respectively represent the weights of cosine similarity, Jaccard similarity and intellectual property value index, Represents the intellectual property value indicator. 10.根据权利要求9所述的一种基于大数据的知识产权运营管理系统,其特征在于,所述基于综合匹配度数据,设定推荐阈值,并进行判断是否进行推荐的具体方式包括:10. According to claim 9, a big data-based intellectual property operation and management system is characterized in that the specific method of setting a recommendation threshold based on the comprehensive matching data and determining whether to make a recommendation includes: 使用滑动窗口法来设定阈值,设定一个固定长度C_K的窗口;Use the sliding window method to set the threshold and set a window of fixed length C_K; 计算窗口内的综合匹配度的均值和标准差,设定初始推荐阈值:,其中,表示窗口中的综合匹配度的均值,表示窗口中的综合匹配度的标准差,表示调整系数,Calculate the mean and standard deviation of the comprehensive matching degree within the window and set the initial recommendation threshold: ,in, Represents the mean of the comprehensive matching degree in the window, Represents the standard deviation of the comprehensive matching degree in the window, represents the adjustment factor, ; 在接收到一条新的综合匹配度时,窗口向前滑动,去掉最旧的综合匹配度,加入新的匹配度,得到更新的窗口,并重新计算均值和标准差,得到推荐阈值;其中,表示更新窗口的均值,表示更新窗口的标准差;When a new comprehensive match is received, the window slides forward, removes the oldest comprehensive match, adds the new match, obtains an updated window, and recalculates the mean and standard deviation to obtain the recommended threshold. ;in, represents the mean of the update window, represents the standard deviation of the update window; 时,则推荐该知识产权,知识产权的匹配度较高,符合用户的需求;when , the IP is recommended, and the IP has a high matching degree and meets the needs of users; 时,则不推荐该知识产权,知识产权的匹配度较低,不符合用户需求。when , the intellectual property is not recommended, the matching degree of the intellectual property is low and does not meet the user's needs.
CN202411807543.4A 2024-12-10 2024-12-10 Intellectual property operation management system based on big data Pending CN119741156A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411807543.4A CN119741156A (en) 2024-12-10 2024-12-10 Intellectual property operation management system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411807543.4A CN119741156A (en) 2024-12-10 2024-12-10 Intellectual property operation management system based on big data

Publications (1)

Publication Number Publication Date
CN119741156A true CN119741156A (en) 2025-04-01

Family

ID=95126067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411807543.4A Pending CN119741156A (en) 2024-12-10 2024-12-10 Intellectual property operation management system based on big data

Country Status (1)

Country Link
CN (1) CN119741156A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120031046A (en) * 2025-04-23 2025-05-23 苏州元脑智能科技有限公司 Text semantic segmentation method, device, equipment, medium and product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120031046A (en) * 2025-04-23 2025-05-23 苏州元脑智能科技有限公司 Text semantic segmentation method, device, equipment, medium and product

Similar Documents

Publication Publication Date Title
CN111985228B (en) Text keyword extraction method, text keyword extraction device, computer equipment and storage medium
CN109829104A (en) Pseudo-linear filter model information search method and system based on semantic similarity
CN116756303B (en) Automatic generation method and system for multi-topic text abstract
CN110347796A (en) Short text similarity calculating method under vector semantic tensor space
CN119719312B (en) Intelligent government affair question-answering method, device, equipment and storage medium
CN110442702A (en) Searching method, device, readable storage medium storing program for executing and electronic equipment
CN117494815B (en) File-oriented credible large language model training and reasoning method and device
CN118245564B (en) Method and device for constructing feature comparison library supporting semantic review and repayment
CN119741156A (en) Intellectual property operation management system based on big data
CN110851584A (en) Accurate recommendation system and method for legal provision
CN119106322A (en) User grouping method, device, equipment, storage medium and program product
CN115062151A (en) A text feature extraction method, text classification method and readable storage medium
CN118585609A (en) Multidimensional Data Driven Document Search System
CN119578559B (en) Intelligent agent automatic configuration method and system based on large model, knowledge base and tools
CN112149424A (en) Semantic matching method, apparatus, computer equipment and storage medium
CN119578422A (en) A method and system for building and expanding a social network of successful customers
CN116932487B (en) Quantized data analysis method and system based on data paragraph division
JP5542732B2 (en) Data extraction apparatus, data extraction method, and program thereof
CN113571198B (en) Conversion rate prediction method, conversion rate prediction device, conversion rate prediction equipment and storage medium
JP2023528985A (en) Computer-implemented method for searching large-scale unstructured data with feedback loop and data processing apparatus or system therefor
CN114595305A (en) Intent recognition method based on semantic indexing
CN112667666A (en) SQL operation time prediction method and system based on N-gram
CN113157892A (en) User intention processing method and device, computer equipment and storage medium
CN120234427B (en) Electronic government platform management method and system based on cloud data
CN120257966B (en) Human resource data management method based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination