[go: up one dir, main page]

CN114385775B - Sensitive word recognition method based on big data - Google Patents

Sensitive word recognition method based on big data Download PDF

Info

Publication number
CN114385775B
CN114385775B CN202111636920.9A CN202111636920A CN114385775B CN 114385775 B CN114385775 B CN 114385775B CN 202111636920 A CN202111636920 A CN 202111636920A CN 114385775 B CN114385775 B CN 114385775B
Authority
CN
China
Prior art keywords
sensitive
word
words
text
trie
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111636920.9A
Other languages
Chinese (zh)
Other versions
CN114385775A (en
Inventor
周洁琴
周金明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Inspector Intelligent Technology Co ltd
Original Assignee
Nanjing Inspector Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Inspector Intelligent Technology Co ltd filed Critical Nanjing Inspector Intelligent Technology Co ltd
Priority to CN202111636920.9A priority Critical patent/CN114385775B/en
Publication of CN114385775A publication Critical patent/CN114385775A/en
Application granted granted Critical
Publication of CN114385775B publication Critical patent/CN114385775B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a sensitive word recognition method based on big data, which comprises the following steps: and step 1, collecting text data by using crawler software, performing sensitive marking on the text data to obtain a sensitive text D 1 and a normal text D 2, performing sensitive word classification and grade marking on sensitive words, and storing the sensitive words in a sensitive word list S. Step 2, new word discovery is carried out through an N-gram model, and sensitive word list S is amplified: and 3, carrying out deformation processing on each sensitive word in each sensitive word list S to obtain deformed sensitive words. Step 4, filtering the sensitive words in the sensitive word list S based on the Trie and the BERT model; the method improves the accuracy and the efficiency of auditing and identifying the sensitive words.

Description

Sensitive word recognition method based on big data
Technical Field
The invention relates to the field of big data research, in particular to a natural language processing method, and specifically relates to a sensitive word recognition method based on big data.
Background
With the continuous development of the internet, people can see a large amount of text information through various platforms of the internet. Some of these information contains sensitive information such as terrorist tendencies and the like, which, if not distinguished and controlled, can interfere with social stability and impair social public benefits. The quality of the text information is controlled, the sensitive information is recognized and processed in time, the released content is ensured to contain no sensitive information, and a healthy network environment is built.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art: the traditional technology mainly adopts a rule matching method to identify, namely, a sensitive word list is constructed, the sensitive word list is mainly collected by manual operation, the labor cost is high, the efficiency is low, the sensitive word list is traversed to be matched with each text message, and if the sensitive word is found, the sensitive word list needs to be submitted to an auditor for manual audit. The method has the defects that: firstly, along with the increase of the number of the sensitive words, the deformation of the sensitive words is increased, the word list of the sensitive words is increased, and the search speed is reduced due to cyclic matching; secondly, the method still needs auditing personnel to carry out manual judgment, and the manual workload is high; thirdly, the method can only detect whether the sensitive word appears or not, ignores the context of the appearance of the sensitive word, and is easy to cause false alarm.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a sensitive word recognition method based on big data, which improves the accuracy and efficiency of auditing and recognition of the sensitive word. The technical proposal is as follows:
the invention provides a sensitive word recognition method based on big data, which mainly comprises the following steps:
And step 1, collecting text data by using crawler software, performing sensitive marking on the text data to obtain a sensitive text D 1 and a normal text D 2, performing sensitive word classification and grade marking on sensitive words, and storing the sensitive words in a sensitive word list S.
Step 2, new word discovery is carried out through an N-gram model, and sensitive word list S is amplified:
Dividing the sensitive text D 1 in the step 1 into words by adopting an N-gram model, and dividing the original words according to the length N to obtain a plurality of spliced words with the length N; counting word frequency of each spliced word, calculating frequency P, and selecting the spliced word with frequency larger than a set threshold value a as a candidate word w': wherein count w represents the sensitivity Wen Benshu containing the splice word w, and N represents the total number of sensitive texts.
The degree of solidification I (x; y) of each candidate word is calculated,Wherein, P (x, y) represents the probability of the co-occurrence of the word x and the word y in the candidate word, P (x) represents the probability of the word x occurring alone, and P (y) represents the probability of the word y occurring alone.
The degree of freedom H (w ') of each candidate word w' is calculated, and the calculation formula is as follows:
Wherein s l is the set of left adjacency words of candidate word w'; s r is the set of right adjacency words of candidate word w'; p (w 'l |w') is the conditional probability of the occurrence of the left adjacency word w 'l in the case where the candidate word w' occurs; p (w 'r |w') is the conditional probability that the adjacency word w 'r appears if the candidate word w' appears.
The following will be satisfied at the same time: the candidate words with I (x; y) larger than the solidification degree threshold value b and H (w) larger than the freedom degree threshold value c are used as new words, the sensitivity level of the new words is set to be low-risk, the sensitivity is classified into the classification of the sensitive text where the new words are located, and the new words, the sensitivity level and the sensitivity classification are stored in the sensitive vocabulary S.
And 3, carrying out deformation processing on each sensitive word m in each sensitive word list S to obtain deformed sensitive words m'. The variant comprises: adding special characters in the middle of the sensitive words, replacing one or more characters of the sensitive words with pinyin, splitting one or more characters of the sensitive words, and replacing one or more characters of the sensitive words with traditional Chinese characters.
After the deformation processing, the deformed sensitive word m' is stored in a sensitive word list S, the sensitive classification and the sensitive grade are the classification and the grade of the original sensitive word m, and the sensitive word list S is stored in a database.
Step 4, filtering the sensitive words in the sensitive word list S based on the Trie and the BERT model;
and generating a sensitive word Trie according to the sensitive words, and searching the text content with search in the sensitive word Trie according to the text sequence to obtain all the sensitive words contained in the text content.
And (3) putting the sensitive text D 1 and the normal text D 2 together, randomly dividing the sensitive text D 1 and the normal text D 2 into a training set and a testing set, training a BERT model, and identifying and filtering sensitive words of the input detection text by combining a Trie and the BERT model.
Determining whether the input detection text contains sensitive words according to the Trie: and generating a sensitive word Trie according to the sensitive word library, and performing Chinese matching according to the sensitive word Trie.
And further judging the matched result according to the BERT model:
If the sensitive word is not contained, directly passing the auditing;
If the sensitive words are contained, judging through a BERT model, judging whether the text is a sensitive text, if the text is a sensitive text, and if the contained sensitive words are high-risk, directly filtering the text; if the text is sensitive and the contained sensitive words are low-risk, replacing the sensitive words contained in the text by 'x'; and if the text is judged to be normal, performing manual auditing.
Preferably, in step 1, the sensitive words are classified and marked according to the level, specifically: the sensitive words are divided into five categories of C 1、C2、C3、C4、C5, and the sensitive words are divided into two grades of high-risk sensitive words and low-risk sensitive words.
Preferably, in step 4, the matching of chinese is performed according to the Trie of the sensitive word, specifically: splitting an input detection text into single words by using a regular expression, searching a first character of the detection text from a root node, and if the first character is not found, searching a next character from the root node until a character meeting the condition is found; if the character meeting the condition is found, continuing to search the node of the next character under the descendant node of the node corresponding to the character until the leaf node is reached. And after the cycle traversal is completed, returning all the matched characters.
Preferably, the method further comprises continuously updating the training set, and updating the training BERT model according to the training set.
Further, according to the result of the manual auditing in the step 4, if the BERT model is judged to be a normal text, but the manual auditing is judged to be a sensitive text, the text is used as training data, and the BERT model is retrained.
Compared with the prior art, one of the technical schemes has the following beneficial effects: by amplifying the sensitive words based on the new word discovery, the new sensitive words can be automatically mined, and the labor consumption is reduced. The sensitive word screening method based on the Trie and the BERT model effectively improves the auditing speed, reduces the manual intervention times and improves the sensitive word recognition accuracy.
Drawings
Fig. 1 is a schematic diagram of a sensitive word Trie provided in an embodiment of the present disclosure.
Detailed Description
In order to clarify the technical scheme and working principle of the present invention, the following describes the embodiments of the present disclosure in further detail with reference to the accompanying drawings. Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.
The terms "step 1," "step 2," "step 3," and the like in the description and in the claims and in the foregoing drawings, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those described herein.
The embodiment of the disclosure provides a sensitive word recognition method based on big data, which mainly comprises the following steps:
And step 1, collecting text data by using crawler software, performing sensitive marking on the text data to obtain a sensitive text D 1 and a normal text D 2, performing sensitive word classification and grade marking on sensitive words, and storing the sensitive words in a sensitive word list S. Preferably, the sensitive words are classified into five categories of C 1、C2、C3、C4、C5, and the sensitive words are classified into two grades of high-risk sensitive words and low-risk sensitive words.
Step 2, new word discovery is carried out through an N-gram model, and sensitive word list S is amplified:
The sensitive text D 1 in the step 1 is segmented by adopting an N-gram model, and original words are segmented according to the length of N to obtain a plurality of spliced words with the length of N, for example, when N is taken to be 2, people put high interest and loan are obtained, and the obtained spliced words are { ("people"), ("people put"), ("high interest"), ("interest and loan") }.
Counting word frequency of each spliced word, calculating frequency P, and selecting the spliced word with frequency larger than a set threshold value a as a candidate word w': Wherein count w represents the sensitivity Wen Benshu containing the splice word w, and N represents the total number of sensitive texts.
The degree of solidification I (x; y) of each candidate word is calculated,Wherein, P (x, y) represents the probability of the co-occurrence of the word x and the word y in the candidate word, P (x) represents the probability of the word x occurring alone, and P (y) represents the probability of the word y occurring alone. If I (x; y) is greater than the threshold b of coagulability, the candidate word w' composed of the word x and the word y satisfies one of the conditions of the new word.
The degree of freedom H (w ') of each candidate word w' is calculated, and the calculation formula is as follows:
Wherein s l is the set of left adjacency words of candidate word w'; s r is the set of right adjacency words of candidate word w'; p (w 'l |w') is the conditional probability of the occurrence of the left adjacency word w 'l in the case where the candidate word w' occurs; p (w 'r |w') is the conditional probability that the adjacency word w 'r appears if the candidate word w' appears.
If H (w) is greater than the degree of freedom threshold c, the candidate word is described as a new word in the degree of freedom by indicating that the number of the left and right neighbors of the candidate word is large, wherein the degree of solidification threshold b and the degree of freedom threshold c are determined according to the new word result.
The following will be satisfied at the same time: the candidate words with I (x; y) larger than the solidification degree threshold value b and H (w) larger than the freedom degree threshold value c are used as new words, the sensitivity level of the new words is set to be low-risk, the sensitivity is classified into the classification of the sensitive text where the new words are located, and the new words, the sensitivity level and the sensitivity classification are stored in the sensitive vocabulary S.
And 3, carrying out deformation processing on each sensitive word m in each sensitive word list S to obtain deformed sensitive words m'. The variant comprises: ① Adding special characters in the middle of the sensitive word, such as 'Gaoli' and 'Gaoli#' in the middle of the sensitive word; ② The word or words of the sensitive word are replaced by pinyin, such as Gaoli credit, which can obtain gaolidai, gaoli lidai, gaoli credit and the like; ③ Splitting one or more words in the sensitive words, for example, only the "Gaoli loan" word can be split, so that the splitting deformation of the "Gaoli loan" can obtain "Gao Lidai shellfish"; ④ The traditional Chinese characters are used for replacing one or more characters in the sensitive words, such as 'high interest lending', and the traditional Chinese characters of 'high interest lending' are modified by 'high interest lending' only because the traditional Chinese characters of 'high interest' are the same as the simple traditional Chinese characters of 'interest'.
After the deformation processing, the deformed sensitive word m' is stored in a sensitive word list S, the sensitive classification and the sensitive grade are the classification and the grade of the original sensitive word m, and the sensitive word list S is stored in a database.
Step 4, filtering the sensitive words in the sensitive word list S based on the Trie and the BERT model;
And generating a sensitive word Trie according to the sensitive words, and searching the text content with search in the sensitive word Trie according to the text sequence to obtain all the sensitive words contained in the text content.
The sensitive text D 1 and the normal text D 2 are put together and randomly divided into a training set and a testing set, so as to train the BERT model.
And through the combination of the Trie and the BERT model, the input detection text is subjected to sensitive word recognition and filtering, so that the misjudgment rate is reduced.
Determining whether the input detection text contains sensitive words according to the Trie: and generating a sensitive word Trie according to the sensitive word library, and performing Chinese matching according to the sensitive word Trie.
A sensitive word Trie tree is generated according to a sensitive word library, taking "high interest lending" as an example, as shown in FIG. 1, and FIG. 1 is a schematic diagram of the sensitive word Trie tree provided in the embodiment of the disclosure.
Splitting the "high interest credit" into three words of high interest and credit;
Checking whether the root node has a character 'high' node, if so, adding 'good' and 'credit' nodes in turn, and if not, adding 'high' under the root node.
Chinese matching is carried out according to the sensitive word Trie tree, taking 'someone releasing interest loan' as an example: splitting the text content into single words by using a regular expression, searching the first character 'having' from the root node, and if so, continuing to find the next 'person'; if not, the next character is searched by 'people' until the character meeting the condition is found. If a character meeting the condition, such as "high", is found, the node of the character "good" is searched under the node of the character "high". And after the cycle traversal is completed, returning all the matched characters.
And further judging the matched result according to the BERT model:
If the sensitive word is not contained, directly passing the auditing;
If the sensitive words are contained, judging through a BERT model, judging whether the text is a sensitive text, if the text is a sensitive text, and if the contained sensitive words are high-risk, directly filtering the text; if the text is sensitive and the contained sensitive words are low-risk, replacing the sensitive words contained in the text by 'x'; and if the text is judged to be normal, performing manual auditing.
Preferably, the method further comprises the step of continuously updating the training set and updating the training BERT model according to the training set.
Preferably, according to the result of the manual auditing in the step 4, if the BERT model is judged to be a normal text, but the manual auditing is judged to be a sensitive text, the text is used as training data, the BERT model is retrained, and the accuracy of the BERT model judgment is improved.
While the invention has been described above by way of example with reference to the accompanying drawings, it is to be understood that the invention is not limited to the particular embodiments described, but is capable of numerous insubstantial modifications of the inventive concepts and technical solutions; or the above conception and technical scheme of the invention are directly applied to other occasions without improvement and equivalent replacement, and all are within the protection scope of the invention.

Claims (4)

1. The sensitive word recognition method based on big data is characterized by mainly comprising the following steps of:
Step 1, collecting text data by using crawler software, performing sensitive marking on the text data to obtain a sensitive text D 1 and a normal text D 2, performing sensitive word classification and grade marking on sensitive words, and storing the sensitive words into a sensitive word list S;
Step 2, new word discovery is carried out through an N-gram model, and sensitive word list S is amplified:
Dividing the sensitive text D 1 in the step 1 into words by adopting an N-gram model, and dividing the original words according to the length N to obtain a plurality of spliced words with the length N; counting word frequency of each spliced word, calculating frequency P, and selecting the spliced word with frequency larger than a set threshold value a as a candidate word w': wherein count w represents the sensitivity Wen Benshu containing the splice word w, and N represents the total number of sensitive texts;
the degree of solidification I (x; y) of each candidate word is calculated, Wherein P (x, y) represents the probability of co-occurrence of the word x and the word y in the candidate word, P (x) represents the probability of single occurrence of the word x, and P (y) represents the probability of single occurrence of the word y;
The degree of freedom H (w ') of each candidate word w' is calculated, and the calculation formula is as follows: Wherein s l is the set of left adjacency words of candidate word w'; s r is the set of right adjacency words of candidate word w'; p (w 'l |w') is the conditional probability of the occurrence of the left adjacency word w 'l in the case where the candidate word w' occurs; p (w 'r |w') is the conditional probability that the adjacency word w 'r appears if the candidate word w' appears;
the following will be satisfied at the same time: the candidate words with I (x; y) larger than the solidification degree threshold b and H (w) larger than the freedom degree threshold c are used as new words, the sensitivity level of the new words is set to be low-risk, the sensitivity is classified into the classification of the sensitive text where the new words are located, and the new words, the sensitivity level and the sensitivity classification are stored in a sensitive vocabulary S;
Step 3, carrying out deformation processing on each sensitive word m in each sensitive word list S to obtain deformed sensitive words m', wherein the deformed bodies comprise: adding special characters in the middle of the sensitive words, replacing one or more characters of the sensitive words with pinyin, splitting one or more characters of the sensitive words, and replacing one or more characters of the sensitive words with traditional characters;
after the deformation processing, storing the deformed sensitive word m' into a sensitive word list S, wherein the sensitive classification and the sensitive grade are the classification and the grade of the original sensitive word m, and storing the sensitive word list S into a database;
Step 4, filtering the sensitive words in the sensitive word list S based on the Trie and the BERT model;
Generating a sensitive word Trie according to the sensitive words, and searching the text content with search in the sensitive word Trie according to the text sequence to obtain all the sensitive words contained in the text content;
The sensitive text D 1 and the normal text D 2 are put together and randomly divided into a training set and a testing set, a BERT model is trained, and the sensitive word recognition and the filtering are carried out on the input detection text by combining a Trie and the BERT model;
determining whether the input detection text contains sensitive words according to the Trie: generating a sensitive word Trie according to the sensitive word library, and performing Chinese matching according to the sensitive word Trie;
And further judging the matched result according to the BERT model:
If the sensitive word is not contained, directly passing the auditing;
If the sensitive words are contained, judging through a BERT model, judging whether the text is a sensitive text, if the text is a sensitive text, and if the contained sensitive words are high-risk, directly filtering the text; if the text is sensitive and the contained sensitive words are low-risk, replacing the sensitive words contained in the text by 'x'; if the text is judged to be normal text, performing manual auditing;
the Chinese matching is carried out according to the sensitive word Trie, specifically: splitting an input detection text into single words by using a regular expression, searching a first character of the detection text from a root node, and if the first character is not found, searching a next character from the root node until a character meeting the condition is found; if the character meeting the condition is found, continuing to search the node of the next character under the descendant node of the node corresponding to the character until the leaf node is reached, and returning all the matched characters after the circulation traversal is completed.
2. The big data-based sensitive word recognition method according to claim 1, wherein in step 1, sensitive words are classified and labeled according to grades, specifically: the sensitive words are divided into five categories of C1, C2, C3, C4 and C5, and the sensitive words are divided into two grades of high-risk sensitive words and low-risk sensitive words.
3. A method of big data based sensitive word recognition according to any of claims 1-2, further comprising continuously updating the training set, and updating the training BERT model based on the training set.
4. The big data based sensitive word recognition method of claim 3, wherein according to the result of the manual review in step 4, if the BERT model is judged to be a normal text, but the manual review is judged to be a sensitive text, the text is used as training data, and the BERT model is retrained.
CN202111636920.9A 2021-12-29 2021-12-29 Sensitive word recognition method based on big data Active CN114385775B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111636920.9A CN114385775B (en) 2021-12-29 2021-12-29 Sensitive word recognition method based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111636920.9A CN114385775B (en) 2021-12-29 2021-12-29 Sensitive word recognition method based on big data

Publications (2)

Publication Number Publication Date
CN114385775A CN114385775A (en) 2022-04-22
CN114385775B true CN114385775B (en) 2024-06-04

Family

ID=81199172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111636920.9A Active CN114385775B (en) 2021-12-29 2021-12-29 Sensitive word recognition method based on big data

Country Status (1)

Country Link
CN (1) CN114385775B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114936553A (en) * 2022-06-22 2022-08-23 深圳市百川数安科技有限公司 Method, device and storage medium for extracting sensitive words in Internet community
CN115510500B (en) * 2022-11-18 2023-02-28 北京国科众安科技有限公司 Sensitive analysis method and system for text content
CN116028750B (en) * 2022-12-30 2024-05-07 北京百度网讯科技有限公司 Web page text review method and device, electronic device and medium
CN116562297B (en) * 2023-07-07 2023-09-26 北京电子科技学院 Chinese sensitive word deformation identification method and system based on HTRIE tree

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280130A (en) * 2017-12-22 2018-07-13 中国电子科技集团公司第三十研究所 A method of finding sensitive data in text big data
CN109977416A (en) * 2019-04-03 2019-07-05 中山大学 A kind of multi-level natural language anti-spam text method and system
CN112487149A (en) * 2020-12-10 2021-03-12 浙江诺诺网络科技有限公司 Text auditing method, model, equipment and storage medium
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium
CN113486654A (en) * 2021-07-28 2021-10-08 焦点科技股份有限公司 Sensitive word bank construction and expansion method based on prior topic clustering
CN113822059A (en) * 2021-09-18 2021-12-21 北京云上曲率科技有限公司 Chinese sensitive text recognition method, device, storage medium and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014075174A1 (en) * 2012-11-19 2014-05-22 Imds America Inc. Method and system for the spotting of arbitrary words in handwritten documents

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280130A (en) * 2017-12-22 2018-07-13 中国电子科技集团公司第三十研究所 A method of finding sensitive data in text big data
CN109977416A (en) * 2019-04-03 2019-07-05 中山大学 A kind of multi-level natural language anti-spam text method and system
CN112487149A (en) * 2020-12-10 2021-03-12 浙江诺诺网络科技有限公司 Text auditing method, model, equipment and storage medium
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium
CN113486654A (en) * 2021-07-28 2021-10-08 焦点科技股份有限公司 Sensitive word bank construction and expansion method based on prior topic clustering
CN113822059A (en) * 2021-09-18 2021-12-21 北京云上曲率科技有限公司 Chinese sensitive text recognition method, device, storage medium and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种基于多字互信息与邻接熵的改进新词合成算法;王欣;现代计算机(专业版);20180415(第11期);第7-11页 *
一种面向中文敏感网页识别的文本分类方法;陈欣等;测控技术;20110518(第05期);第27-31+40页 *
基于改进Trie树的变形敏感词过滤算法;叶情;现代计算机(专业版);20181125(第33期);第3-7页 *

Also Published As

Publication number Publication date
CN114385775A (en) 2022-04-22

Similar Documents

Publication Publication Date Title
CN114385775B (en) Sensitive word recognition method based on big data
CN110516067B (en) Public opinion monitoring method, system and storage medium based on topic detection
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN108733748B (en) Cross-border product quality risk fuzzy prediction method based on commodity comment public sentiment
CN111831824B (en) Public opinion positive and negative surface classification method
CN112131352A (en) A detection method and detection system for bad information in web page text
CN111966944B (en) A model construction method for multi-level user review security audit
CN110704715B (en) Network overlord ice detection method and system
CN108280130A (en) A method of finding sensitive data in text big data
CN106202372A (en) A kind of method of network text information emotional semantic classification
CN115828180A (en) A log anomaly detection method based on parsing optimization and temporal convolutional network
CN111626050B (en) Microblog emotion analysis method based on expression dictionary and emotion general knowledge
CN114265931B (en) Consumer policy perception analysis method and system based on big data text mining
CN118467985A (en) Training scoring method based on natural language
CN111190873A (en) A log pattern extraction method and system for cloud native system log training
CN116244446B (en) Social media cognitive threat detection method and system
CN118211171B (en) A target path mining method based on knowledge graph
CN118133221A (en) A privacy data classification and grading method
CN110928985A (en) Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm
CN115994531A (en) Multi-dimensional text comprehensive identification method
CN118467743B (en) Log anomaly detection method based on bidirectional parallel tree optimization log parsing
CN111597423B (en) Performance evaluation method and device of interpretable method of text classification model
Aslam et al. Web-AM: An efficient boilerplate removal algorithm for Web articles
CN114764440B (en) A main event deduplication method based on graph node selection and optimization
Wang et al. Sentiment detection and visualization of Chinese micro-blog

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant