CN114444475B - Word segmentation method and device based on corpus - Google Patents
Word segmentation method and device based on corpus Download PDFInfo
- Publication number
- CN114444475B CN114444475B CN202111333372.2A CN202111333372A CN114444475B CN 114444475 B CN114444475 B CN 114444475B CN 202111333372 A CN202111333372 A CN 202111333372A CN 114444475 B CN114444475 B CN 114444475B
- Authority
- CN
- China
- Prior art keywords
- corpus
- word
- threshold
- difference
- character combinations
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a word segmentation method based on corpus, which comprises the following steps of obtaining target corpus, splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks, intercepting adjacent characters of each sub-corpus segment according to a preset rule to obtain a plurality of character combinations with different lengths, counting the number of times that a plurality of words with different lengths respectively appear in user comments of a preset time period, and segmenting the target corpus according to the number of times of the respective occurrence. The invention can intelligently segment words, has higher new word discovery speed, and can not generate word segmentation ambiguity.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to a word segmentation method and device based on corpus.
Background
Segmentation as the basis of current text analysis, almost all text analysis-based applications require segmentation.
The existing word segmentation mode is based on word stock, and qualified keywords need to be manually filtered and screened periodically. Generally, the word library-based method is similar to mmseg, ik, paoding, and word library-based method has the following defects that the word library needs to be manually maintained, new words are found slowly, and word segmentation ambiguity is easy to generate.
Disclosure of Invention
The invention aims to at least solve one of the defects in the prior art and provides a word segmentation method and device based on corpus.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
specifically, a word segmentation method based on corpus is provided, which comprises the following steps:
Acquiring a target corpus;
splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
intercepting adjacent characters from each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths;
counting the number of times of the occurrence of the words with different lengths in user comments of a preset time period respectively;
And word segmentation is carried out on the target corpus according to the times of the respective occurrence.
Further, specifically, each sub-corpus is intercepted according to a preset rule to obtain a plurality of character combinations with different lengths, the following rules are followed,
Each sub-corpus is assumed to be ABCA 'B' c.a. divided in the manner of adjacent 2 characters and adjacent 3 characters, resulting in a combination of multiple 2 and 3 characters with AB, BC and ABC as basic structures, wherein A, B, C is an arbitrary single character.
Further, specifically, the user comments in the preset time period are the number of comments of all users in half a year.
Further, the target corpus is segmented according to the times of occurrence respectively, and each basic structure is analyzed to obtain the number n0 of occurrence of the words of the ABC characters in a preset time period, the number n1 of occurrence of the AB character combination and the number n2 of occurrence of the BC character combination;
if the difference between n1 and n2 is greater than a first threshold, judging AB as a word;
If the difference between n2 and n1 is greater than a first threshold, judging that A is a single word, BC is a word or BCA' is a word;
If the absolute value of the difference between n1 and n2 is not greater than the first threshold, then ABC is judged to be a word or ABCA' is judged to be a word.
Further, the method also comprises the steps of,
When the difference between n2 and n1 is larger than a first threshold, the number of BC and the number of BCA 'are obtained, if the difference between the number of BC and the number of BCA' is lower than a second threshold, the BCA 'is judged to be a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, the BC is judged to be a word;
when the absolute value of the difference between n1 and n2 is not greater than a first threshold, the number of ABC and the number of ABCA 'are obtained, if the difference between the number of ABC and the number of ABCA' is lower than a second threshold, the ABCA 'is judged to be a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, the ABC is judged to be a word.
Further, the method further comprises the steps of intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the occurrence times of the character combinations in user comments of a preset time period according to the formed hash structures.
The invention also provides a word segmentation device based on corpus, which is characterized by comprising the following steps:
The target corpus acquisition module is used for acquiring target corpus;
The sub-corpus segment splitting module is used for splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
The character combination acquisition module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths;
the quantity counting module is used for counting the number of times that a plurality of words with different lengths respectively appear in user comments of a preset time period;
and the word segmentation module is used for segmenting the target corpus according to the times of the respective occurrence.
Further, the device also comprises a control unit for controlling the control unit,
The hash structure generation module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the occurrence times of the character combinations in user comments of a preset time period according to the formed hash structures.
The invention also proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described in any of the preceding claims.
The beneficial effects of the invention are as follows:
According to the method, a word segmentation mathematical model is built for the corpus, adjacent characters are intercepted for each sub-corpus segment according to a preset rule to obtain a plurality of character combinations with different lengths, the number of times that the words with different lengths respectively appear in user comments in a preset time period is counted, and the target corpus is segmented according to the number of times that the words respectively appear, so that word segmentation can be intelligently performed, a faster new word finding speed is achieved, and word segmentation ambiguity cannot be generated.
Drawings
The above and other features of the present disclosure will become more apparent from the detailed description of the embodiments illustrated in the accompanying drawings, in which like reference numerals designate like or similar elements, and which, as will be apparent to those of ordinary skill in the art, are merely some examples of the present disclosure, from which other drawings may be made without inventive effort, wherein:
FIG. 1 is a flow chart of a word segmentation method based on corpus of the present invention;
Detailed Description
The conception, specific structure, and technical effects produced by the present application will be clearly and completely described below with reference to the embodiments and the drawings to fully understand the objects, aspects, and effects of the present application. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The same reference numbers will be used throughout the drawings to refer to the same or like parts.
Referring to fig. 1, embodiment 1 of the present invention proposes a corpus-based word segmentation method, including the following steps:
step 110, obtaining target corpus;
step 120, splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
Step 130, intercepting adjacent characters from each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths;
Step 140, counting the number of times that a plurality of words with different lengths respectively appear in user comments of a preset time period;
And 150, word segmentation is carried out on the target corpus according to the times of the respective occurrence.
As a preferred embodiment of the present invention, specifically, each sub-corpus is intercepted according to a preset rule to obtain a plurality of character combinations with different lengths, and the following rules are followed,
Each sub-corpus is assumed to be ABCA 'B' c.a. divided in the manner of adjacent 2 characters and adjacent 3 characters, resulting in a combination of multiple 2 and 3 characters with AB, BC and ABC as basic structures, wherein A, B, C is an arbitrary single character.
As a preferred embodiment of the present invention, specifically, the user comments in the preset time period are the number of comments of all users in half a year.
As a preferred embodiment of the present invention, specifically, word segmentation is performed on the target corpus according to the number of times of respective occurrences, including,
Analyzing each basic structure to obtain the number n0 of the words of the ABC characters, the number n1 of the AB character combinations and the number n2 of the BC character combinations in a preset time period;
if the difference between n1 and n2 is greater than a first threshold, judging AB as a word;
If the difference between n2 and n1 is greater than a first threshold, judging that A is a single word, BC is a word or BCA' is a word;
If the absolute value of the difference between n1 and n2 is not greater than the first threshold, then ABC is judged to be a word or ABCA' is judged to be a word.
The specific first threshold is defined by a user, and in general, if the threshold is greater than 5 times or more, the combination can be considered more reasonable than the separation;
in the present preferred embodiment, the corpus analyzed is given first, "it is considered that the burden is reduced not only for students. "
The corpus of the first aspect is disassembled according to punctuation marks, for example, the "everybody thinks" the burden is reduced not only the work load of students is reduced "
Mathematical model creation
Each character number is expressed as ABC three letters, then
ABC three characters may be cut into AB, BC two.
The number can be written as statistics on corpus as such in Table 1
TABLE 1
Case one:
the number of n1 is far greater than that of n2, and the possibility of representing AB as a word is extremely high
And a second case:
n1 is much smaller than n2, meaning that A is likely a single word, while BC is likely, but not necessarily, a word, possibly part of a subsequent word
And a third case:
ABC may also be a word or part of a longer word if n1 and n2 are not significantly different. ABC cannot necessarily be considered a word at this time, and the later word needs to be added as a whole for analysis.
In particular, the method comprises the steps of,
Example analysis one see Table 2
| Combination of characters | Quantity of |
| Happy capital (ABC) | 2855 |
| Well-known (AB) | 10530 |
| Jiadu (BC) | 4442 |
TABLE 2
In corpus-based statistics, it can be found that the number of times that "everything" comes out in the corpus is much greater than the number of times that "everything" appears. It can be considered that "good" is more a word and "all" is changed to a single word. I.e.
The term "all of the people" shall be "all of the people"
Example analysis II see Table 3
| Combination of characters | Quantity of |
| Are all considered (ABC) | 101 |
| All Belief (AB) | 271 |
| Consider (BC) | 3547 |
TABLE 3 Table 3
Similarly, in the present corpus analysis, "consider" to occur much more often than "all recognize", i.e., "consider" to be more like a word, and "all" to be more like a single word. The analysis mode and the example analysis one are consistent, and the recursion processing is convenient by using the program.
Example analysis results:
"everyone considers" should be segmented into "everyone", "all", "considered". The word segmentation result is correct.
Then, by using the word segmentation corpus, another example is taken
Example analysis three (4 characters): see table 4
| Character combination | Quantity of |
| Not only (ABC) | 387 |
| Not only (AB) | 977 |
| Only (BC) | 1302 |
TABLE 4 Table 4
At this time, the number of occurrences of not only both "and" only "is not large, but also the number of occurrences of not only the three words is not small, and the subsequent character needs to be introduced for the analysis. According to the corpus above, the corpus is divided into a plurality of corpus,
The recombination is "not just", then the analysis can be carried out again to integrate any two words, and then the digital model structure of ABC is recombined, namely, see Table 5
TABLE 5
Final analysis result
From the above analysis it is seen that,
The AB combination approach is more suitable than the BC group approach. I.e. "not just" should be more a word. Correct word segmentation
As a preferred embodiment of the invention, the method further comprises,
When the difference between n2 and n1 is larger than a first threshold, the number of BC and the number of BCA 'are obtained, if the difference between the number of BC and the number of BCA' is lower than a second threshold, the BCA 'is judged to be a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, the BC is judged to be a word;
when the absolute value of the difference between n1 and n2 is not greater than a first threshold, the number of ABC and the number of ABCA 'are obtained, if the difference between the number of ABC and the number of ABCA' is lower than a second threshold, the ABCA 'is judged to be a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, the ABC is judged to be a word.
As a preferred embodiment of the present invention, the method further includes, when intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and performing rapid statistics on the character combinations according to the formed hash structures.
In the present preferred embodiment of the present invention,
If the corpus word segmentation is carried out by traversing matching, the corpus word segmentation is a very slow process, namely the corpus is firstly made into a hash structure, word segmentation can be accelerated, and the corpus word segmentation method can be comparable to the analysis speed based on a word stock. And (3) carrying out character arbitrary combination on the corpus. Also analyze according to the corpus
Not only the student homework is reduced "
In the corpus, character arrangement combinations with lengths ranging from 2 to 4 are adopted. See Table 6
TABLE 6
By such permutation and combination generation, such hash structures are generated all together in the corpus. The number can be counted very quickly.
The invention also provides a word segmentation device based on corpus, which is characterized by comprising the following steps:
The target corpus acquisition module is used for acquiring target corpus;
The sub-corpus segment splitting module is used for splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
The character combination acquisition module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths;
the quantity counting module is used for counting the number of times that a plurality of words with different lengths respectively appear in user comments of a preset time period;
and the word segmentation module is used for segmenting the target corpus according to the times of the respective occurrence.
As a preferred embodiment of the invention, the device further comprises,
The hash structure generation module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the occurrence times of the character combinations in user comments of a preset time period according to the formed hash structures.
The invention also proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described in any of the preceding claims.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.
The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium may include content that is subject to appropriate increases and decreases as required by jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is not included as electrical carrier signals and telecommunication signals.
While the present invention has been described in considerable detail and with particularity with respect to several described embodiments, it is not intended to be limited to any such detail or embodiments or any particular embodiment, but is to be construed as providing broad interpretation of such claims by reference to the appended claims in view of the prior art so as to effectively encompass the intended scope of the invention. Furthermore, the foregoing description of the invention has been presented in its embodiments contemplated by the inventors for the purpose of providing a useful description, and for the purposes of providing a non-essential modification of the invention that may not be presently contemplated, may represent an equivalent modification of the invention.
The present invention is not limited to the above embodiments, but is merely preferred embodiments of the present invention, and the present invention should be construed as being limited to the above embodiments as long as the technical effects of the present invention are achieved by the same means. Various modifications and variations are possible in the technical solution and/or in the embodiments within the scope of the invention.
Claims (6)
1. The corpus-based word segmentation method is characterized by comprising the following steps of:
Acquiring a target corpus;
splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
intercepting adjacent characters from each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths;
Counting the number of times of occurrence of a plurality of character combinations with different lengths in user comments of a preset time period respectively;
word segmentation is carried out on the target corpus according to the times of the respective occurrence;
Specifically, each sub-corpus is intercepted according to preset rules to obtain a plurality of character combinations with different lengths, the following rules are followed,
Each sub-corpus segment is assumed to be ABCA 'B'C', and is divided according to the mode of adjacent 2 characters and adjacent 3 characters, so that a plurality of character combinations of the 2 characters and the 3 characters taking AB, BC and ABC as basic structures are obtained, wherein A, B, C is any single character;
Specifically, the target corpus is segmented according to the times of the respective occurrence, including,
Analyzing each basic structure to obtain the number n0 of the words of the ABC characters, the number n1 of the AB character combinations and the number n2 of the BC character combinations in a preset time period;
if the difference between n1 and n2 is greater than a first threshold, judging AB as a word;
If the difference between n2 and n1 is greater than a first threshold, judging that A is a single word, BC is a word or BCA ' is a word;
If the absolute value of the difference between n1 and n2 is not greater than a first threshold, judging ABC as a word or ABCA ' as a word;
the method may further comprise the steps of,
When the difference between n2 and n1 is greater than a first threshold, acquiring the number of BC and the number of BCA ', judging BCA ' as a word if the difference between the number of BC and the number of BCA ' is lower than a second threshold, and judging BC as a word if the difference between the number of BC and the number of BCA ' is not lower than the second threshold;
when the absolute value of the difference between n1 and n2 is not greater than a first threshold, the number of ABC and the number of ABCA ' are obtained, if the difference between the number of ABC and the number of ABCA ' is lower than a second threshold, the ABCA ' is judged to be a word, and if the difference between the number of BC and the number of BCA ' is not lower than the second threshold, the ABC is judged to be a word.
2. The corpus-based word segmentation method according to claim 1, wherein the specific user comments in the preset time period are the number of comments of all users in half a year.
3. The corpus-based word segmentation method according to claim 1, further comprising, when intercepting adjacent characters of each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the number of times that the character combinations appear in user comments of a preset time period according to the formed hash structures.
4. Word segmentation device based on corpus, which is characterized by comprising the following steps:
The target corpus acquisition module is used for acquiring target corpus;
The sub-corpus segment splitting module is used for splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
The character combination acquisition module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths;
The number statistics module is used for counting the number of times that a plurality of character combinations with different lengths respectively appear in user comments in a preset time period;
the word segmentation module is used for segmenting the target corpus according to the times of occurrence respectively;
Specifically, each sub-corpus is intercepted according to preset rules to obtain a plurality of character combinations with different lengths, the following rules are followed,
Each sub-corpus segment is assumed to be ABCA 'B'C', and is divided according to the mode of adjacent 2 characters and adjacent 3 characters, so that a plurality of character combinations of the 2 characters and the 3 characters taking AB, BC and ABC as basic structures are obtained, wherein A, B, C is any single character;
Specifically, the target corpus is segmented according to the times of the respective occurrence, including,
Analyzing each basic structure to obtain the number n0 of the words of the ABC characters, the number n1 of the AB character combinations and the number n2 of the BC character combinations in a preset time period;
if the difference between n1 and n2 is greater than a first threshold, judging AB as a word;
If the difference between n2 and n1 is greater than a first threshold, judging that A is a single word, BC is a word or BCA ' is a word;
If the absolute value of the difference between n1 and n2 is not greater than a first threshold, judging ABC as a word or ABCA ' as a word;
the method may further comprise the steps of,
When the difference between n2 and n1 is greater than a first threshold, acquiring the number of BC and the number of BCA ', judging BCA ' as a word if the difference between the number of BC and the number of BCA ' is lower than a second threshold, and judging BC as a word if the difference between the number of BC and the number of BCA ' is not lower than the second threshold;
when the absolute value of the difference between n1 and n2 is not greater than a first threshold, the number of ABC and the number of ABCA ' are obtained, if the difference between the number of ABC and the number of ABCA ' is lower than a second threshold, the ABCA ' is judged to be a word, and if the difference between the number of BC and the number of BCA ' is not lower than the second threshold, the ABC is judged to be a word.
5. The corpus-based word segmentation apparatus as set forth in claim 4, further comprising,
The hash structure generation module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the occurrence times of the character combinations in user comments of a preset time period according to the formed hash structures.
6. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any of claims 1-3.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111333372.2A CN114444475B (en) | 2021-11-11 | 2021-11-11 | Word segmentation method and device based on corpus |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111333372.2A CN114444475B (en) | 2021-11-11 | 2021-11-11 | Word segmentation method and device based on corpus |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114444475A CN114444475A (en) | 2022-05-06 |
| CN114444475B true CN114444475B (en) | 2025-01-03 |
Family
ID=81364001
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111333372.2A Active CN114444475B (en) | 2021-11-11 | 2021-11-11 | Word segmentation method and device based on corpus |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114444475B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115130472B (en) * | 2022-08-31 | 2023-02-21 | 北京澜舟科技有限公司 | Method, system and readable storage medium for segmenting subwords based on BPE |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9465791B2 (en) * | 2007-02-09 | 2016-10-11 | International Business Machines Corporation | Method and apparatus for automatic detection of spelling errors in one or more documents |
| CN106445906A (en) * | 2015-08-06 | 2017-02-22 | 北京国双科技有限公司 | Generation method and apparatus for medium-and-long phrase in domain lexicon |
| CN105117054B (en) * | 2015-08-12 | 2018-04-17 | 珠海优特物联科技有限公司 | A kind of recognition methods of handwriting input and system |
| CN111931491B (en) * | 2020-08-14 | 2023-11-14 | 中国工商银行股份有限公司 | Domain dictionary construction method and device |
-
2021
- 2021-11-11 CN CN202111333372.2A patent/CN114444475B/en active Active
Non-Patent Citations (1)
| Title |
|---|
| 基于统计的无词典分词方法;傅赛香 等;广西科学院学报;20021130;第18卷(第4期);第252页-254页 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114444475A (en) | 2022-05-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110874531B (en) | Topic analysis method and device and storage medium | |
| CN110457672B (en) | Keyword determination method and device, electronic equipment and storage medium | |
| CN110532369B (en) | Question and answer pair generation method and device and server | |
| JPH10187754A (en) | Device and method for classifying document | |
| CN102890698A (en) | Method for automatically describing microblogging topic tag | |
| CN108846431B (en) | Video bullet screen emotion classification method based on improved Bayesian model | |
| CN114444475B (en) | Word segmentation method and device based on corpus | |
| JP2019082841A (en) | Generation program, generation method and generation device | |
| CN110347934B (en) | Text data filtering method, device and medium | |
| CN106897258A (en) | The computational methods and device of a kind of text otherness | |
| Hewlett et al. | Fully unsupervised word segmentation with BVE and MDL | |
| CN105138631A (en) | Knowledge base construction method and device | |
| Akachar et al. | ACSIMCD: A 2-phase framework for detecting meaningful communities in dynamic social networks | |
| CN111611457B (en) | Page classification method, device, equipment and storage medium | |
| CN112597313A (en) | Short text clustering method and device, electronic equipment and storage medium | |
| CN110413985B (en) | Related text segment searching method and device | |
| CN115982346A (en) | Question-answer library construction method, terminal device and storage medium | |
| CN114036907A (en) | A text data augmentation method based on domain features | |
| CN115809319A (en) | Question answering method and device | |
| CN106547758B (en) | Data binning method and device | |
| JP2004341948A (en) | Concept extraction system, concept extraction method, program, and storage medium | |
| Damiran et al. | Author Identification-An Experiment Based on Mongolian Literature Using Decision Trees | |
| CN110413750A (en) | The method and apparatus for recalling standard question sentence according to user's question sentence | |
| CN115186651A (en) | Analysis method, device, terminal equipment and medium for customer fault complaint content | |
| CN106776529B (en) | Business emotion analysis method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |