[go: up one dir, main page]

CN114444475B - Word segmentation method and device based on corpus - Google Patents

Word segmentation method and device based on corpus Download PDF

Info

Publication number
CN114444475B
CN114444475B CN202111333372.2A CN202111333372A CN114444475B CN 114444475 B CN114444475 B CN 114444475B CN 202111333372 A CN202111333372 A CN 202111333372A CN 114444475 B CN114444475 B CN 114444475B
Authority
CN
China
Prior art keywords
corpus
word
threshold
difference
character combinations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111333372.2A
Other languages
Chinese (zh)
Other versions
CN114444475A (en
Inventor
李森和
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD
Original Assignee
GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD filed Critical GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD
Priority to CN202111333372.2A priority Critical patent/CN114444475B/en
Publication of CN114444475A publication Critical patent/CN114444475A/en
Application granted granted Critical
Publication of CN114444475B publication Critical patent/CN114444475B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a word segmentation method based on corpus, which comprises the following steps of obtaining target corpus, splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks, intercepting adjacent characters of each sub-corpus segment according to a preset rule to obtain a plurality of character combinations with different lengths, counting the number of times that a plurality of words with different lengths respectively appear in user comments of a preset time period, and segmenting the target corpus according to the number of times of the respective occurrence. The invention can intelligently segment words, has higher new word discovery speed, and can not generate word segmentation ambiguity.

Description

Word segmentation method and device based on corpus
Technical Field
The invention relates to the field of artificial intelligence, in particular to a word segmentation method and device based on corpus.
Background
Segmentation as the basis of current text analysis, almost all text analysis-based applications require segmentation.
The existing word segmentation mode is based on word stock, and qualified keywords need to be manually filtered and screened periodically. Generally, the word library-based method is similar to mmseg, ik, paoding, and word library-based method has the following defects that the word library needs to be manually maintained, new words are found slowly, and word segmentation ambiguity is easy to generate.
Disclosure of Invention
The invention aims to at least solve one of the defects in the prior art and provides a word segmentation method and device based on corpus.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
specifically, a word segmentation method based on corpus is provided, which comprises the following steps:
Acquiring a target corpus;
splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
intercepting adjacent characters from each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths;
counting the number of times of the occurrence of the words with different lengths in user comments of a preset time period respectively;
And word segmentation is carried out on the target corpus according to the times of the respective occurrence.
Further, specifically, each sub-corpus is intercepted according to a preset rule to obtain a plurality of character combinations with different lengths, the following rules are followed,
Each sub-corpus is assumed to be ABCA 'B' c.a. divided in the manner of adjacent 2 characters and adjacent 3 characters, resulting in a combination of multiple 2 and 3 characters with AB, BC and ABC as basic structures, wherein A, B, C is an arbitrary single character.
Further, specifically, the user comments in the preset time period are the number of comments of all users in half a year.
Further, the target corpus is segmented according to the times of occurrence respectively, and each basic structure is analyzed to obtain the number n0 of occurrence of the words of the ABC characters in a preset time period, the number n1 of occurrence of the AB character combination and the number n2 of occurrence of the BC character combination;
if the difference between n1 and n2 is greater than a first threshold, judging AB as a word;
If the difference between n2 and n1 is greater than a first threshold, judging that A is a single word, BC is a word or BCA' is a word;
If the absolute value of the difference between n1 and n2 is not greater than the first threshold, then ABC is judged to be a word or ABCA' is judged to be a word.
Further, the method also comprises the steps of,
When the difference between n2 and n1 is larger than a first threshold, the number of BC and the number of BCA 'are obtained, if the difference between the number of BC and the number of BCA' is lower than a second threshold, the BCA 'is judged to be a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, the BC is judged to be a word;
when the absolute value of the difference between n1 and n2 is not greater than a first threshold, the number of ABC and the number of ABCA 'are obtained, if the difference between the number of ABC and the number of ABCA' is lower than a second threshold, the ABCA 'is judged to be a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, the ABC is judged to be a word.
Further, the method further comprises the steps of intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the occurrence times of the character combinations in user comments of a preset time period according to the formed hash structures.
The invention also provides a word segmentation device based on corpus, which is characterized by comprising the following steps:
The target corpus acquisition module is used for acquiring target corpus;
The sub-corpus segment splitting module is used for splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
The character combination acquisition module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths;
the quantity counting module is used for counting the number of times that a plurality of words with different lengths respectively appear in user comments of a preset time period;
and the word segmentation module is used for segmenting the target corpus according to the times of the respective occurrence.
Further, the device also comprises a control unit for controlling the control unit,
The hash structure generation module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the occurrence times of the character combinations in user comments of a preset time period according to the formed hash structures.
The invention also proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described in any of the preceding claims.
The beneficial effects of the invention are as follows:
According to the method, a word segmentation mathematical model is built for the corpus, adjacent characters are intercepted for each sub-corpus segment according to a preset rule to obtain a plurality of character combinations with different lengths, the number of times that the words with different lengths respectively appear in user comments in a preset time period is counted, and the target corpus is segmented according to the number of times that the words respectively appear, so that word segmentation can be intelligently performed, a faster new word finding speed is achieved, and word segmentation ambiguity cannot be generated.
Drawings
The above and other features of the present disclosure will become more apparent from the detailed description of the embodiments illustrated in the accompanying drawings, in which like reference numerals designate like or similar elements, and which, as will be apparent to those of ordinary skill in the art, are merely some examples of the present disclosure, from which other drawings may be made without inventive effort, wherein:
FIG. 1 is a flow chart of a word segmentation method based on corpus of the present invention;
Detailed Description
The conception, specific structure, and technical effects produced by the present application will be clearly and completely described below with reference to the embodiments and the drawings to fully understand the objects, aspects, and effects of the present application. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The same reference numbers will be used throughout the drawings to refer to the same or like parts.
Referring to fig. 1, embodiment 1 of the present invention proposes a corpus-based word segmentation method, including the following steps:
step 110, obtaining target corpus;
step 120, splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
Step 130, intercepting adjacent characters from each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths;
Step 140, counting the number of times that a plurality of words with different lengths respectively appear in user comments of a preset time period;
And 150, word segmentation is carried out on the target corpus according to the times of the respective occurrence.
As a preferred embodiment of the present invention, specifically, each sub-corpus is intercepted according to a preset rule to obtain a plurality of character combinations with different lengths, and the following rules are followed,
Each sub-corpus is assumed to be ABCA 'B' c.a. divided in the manner of adjacent 2 characters and adjacent 3 characters, resulting in a combination of multiple 2 and 3 characters with AB, BC and ABC as basic structures, wherein A, B, C is an arbitrary single character.
As a preferred embodiment of the present invention, specifically, the user comments in the preset time period are the number of comments of all users in half a year.
As a preferred embodiment of the present invention, specifically, word segmentation is performed on the target corpus according to the number of times of respective occurrences, including,
Analyzing each basic structure to obtain the number n0 of the words of the ABC characters, the number n1 of the AB character combinations and the number n2 of the BC character combinations in a preset time period;
if the difference between n1 and n2 is greater than a first threshold, judging AB as a word;
If the difference between n2 and n1 is greater than a first threshold, judging that A is a single word, BC is a word or BCA' is a word;
If the absolute value of the difference between n1 and n2 is not greater than the first threshold, then ABC is judged to be a word or ABCA' is judged to be a word.
The specific first threshold is defined by a user, and in general, if the threshold is greater than 5 times or more, the combination can be considered more reasonable than the separation;
in the present preferred embodiment, the corpus analyzed is given first, "it is considered that the burden is reduced not only for students. "
The corpus of the first aspect is disassembled according to punctuation marks, for example, the "everybody thinks" the burden is reduced not only the work load of students is reduced "
Mathematical model creation
Each character number is expressed as ABC three letters, then
ABC three characters may be cut into AB, BC two.
The number can be written as statistics on corpus as such in Table 1
TABLE 1
Case one:
the number of n1 is far greater than that of n2, and the possibility of representing AB as a word is extremely high
And a second case:
n1 is much smaller than n2, meaning that A is likely a single word, while BC is likely, but not necessarily, a word, possibly part of a subsequent word
And a third case:
ABC may also be a word or part of a longer word if n1 and n2 are not significantly different. ABC cannot necessarily be considered a word at this time, and the later word needs to be added as a whole for analysis.
In particular, the method comprises the steps of,
Example analysis one see Table 2
Combination of characters Quantity of
Happy capital (ABC) 2855
Well-known (AB) 10530
Jiadu (BC) 4442
TABLE 2
In corpus-based statistics, it can be found that the number of times that "everything" comes out in the corpus is much greater than the number of times that "everything" appears. It can be considered that "good" is more a word and "all" is changed to a single word. I.e.
The term "all of the people" shall be "all of the people"
Example analysis II see Table 3
Combination of characters Quantity of
Are all considered (ABC) 101
All Belief (AB) 271
Consider (BC) 3547
TABLE 3 Table 3
Similarly, in the present corpus analysis, "consider" to occur much more often than "all recognize", i.e., "consider" to be more like a word, and "all" to be more like a single word. The analysis mode and the example analysis one are consistent, and the recursion processing is convenient by using the program.
Example analysis results:
"everyone considers" should be segmented into "everyone", "all", "considered". The word segmentation result is correct.
Then, by using the word segmentation corpus, another example is taken
Example analysis three (4 characters): see table 4
Character combination Quantity of
Not only (ABC) 387
Not only (AB) 977
Only (BC) 1302
TABLE 4 Table 4
At this time, the number of occurrences of not only both "and" only "is not large, but also the number of occurrences of not only the three words is not small, and the subsequent character needs to be introduced for the analysis. According to the corpus above, the corpus is divided into a plurality of corpus,
The recombination is "not just", then the analysis can be carried out again to integrate any two words, and then the digital model structure of ABC is recombined, namely, see Table 5
TABLE 5
Final analysis result
From the above analysis it is seen that,
The AB combination approach is more suitable than the BC group approach. I.e. "not just" should be more a word. Correct word segmentation
As a preferred embodiment of the invention, the method further comprises,
When the difference between n2 and n1 is larger than a first threshold, the number of BC and the number of BCA 'are obtained, if the difference between the number of BC and the number of BCA' is lower than a second threshold, the BCA 'is judged to be a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, the BC is judged to be a word;
when the absolute value of the difference between n1 and n2 is not greater than a first threshold, the number of ABC and the number of ABCA 'are obtained, if the difference between the number of ABC and the number of ABCA' is lower than a second threshold, the ABCA 'is judged to be a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, the ABC is judged to be a word.
As a preferred embodiment of the present invention, the method further includes, when intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and performing rapid statistics on the character combinations according to the formed hash structures.
In the present preferred embodiment of the present invention,
If the corpus word segmentation is carried out by traversing matching, the corpus word segmentation is a very slow process, namely the corpus is firstly made into a hash structure, word segmentation can be accelerated, and the corpus word segmentation method can be comparable to the analysis speed based on a word stock. And (3) carrying out character arbitrary combination on the corpus. Also analyze according to the corpus
Not only the student homework is reduced "
In the corpus, character arrangement combinations with lengths ranging from 2 to 4 are adopted. See Table 6
TABLE 6
By such permutation and combination generation, such hash structures are generated all together in the corpus. The number can be counted very quickly.
The invention also provides a word segmentation device based on corpus, which is characterized by comprising the following steps:
The target corpus acquisition module is used for acquiring target corpus;
The sub-corpus segment splitting module is used for splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
The character combination acquisition module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths;
the quantity counting module is used for counting the number of times that a plurality of words with different lengths respectively appear in user comments of a preset time period;
and the word segmentation module is used for segmenting the target corpus according to the times of the respective occurrence.
As a preferred embodiment of the invention, the device further comprises,
The hash structure generation module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the occurrence times of the character combinations in user comments of a preset time period according to the formed hash structures.
The invention also proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described in any of the preceding claims.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.
The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium may include content that is subject to appropriate increases and decreases as required by jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is not included as electrical carrier signals and telecommunication signals.
While the present invention has been described in considerable detail and with particularity with respect to several described embodiments, it is not intended to be limited to any such detail or embodiments or any particular embodiment, but is to be construed as providing broad interpretation of such claims by reference to the appended claims in view of the prior art so as to effectively encompass the intended scope of the invention. Furthermore, the foregoing description of the invention has been presented in its embodiments contemplated by the inventors for the purpose of providing a useful description, and for the purposes of providing a non-essential modification of the invention that may not be presently contemplated, may represent an equivalent modification of the invention.
The present invention is not limited to the above embodiments, but is merely preferred embodiments of the present invention, and the present invention should be construed as being limited to the above embodiments as long as the technical effects of the present invention are achieved by the same means. Various modifications and variations are possible in the technical solution and/or in the embodiments within the scope of the invention.

Claims (6)

1. The corpus-based word segmentation method is characterized by comprising the following steps of:
Acquiring a target corpus;
splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
intercepting adjacent characters from each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths;
Counting the number of times of occurrence of a plurality of character combinations with different lengths in user comments of a preset time period respectively;
word segmentation is carried out on the target corpus according to the times of the respective occurrence;
Specifically, each sub-corpus is intercepted according to preset rules to obtain a plurality of character combinations with different lengths, the following rules are followed,
Each sub-corpus segment is assumed to be ABCA 'B'C', and is divided according to the mode of adjacent 2 characters and adjacent 3 characters, so that a plurality of character combinations of the 2 characters and the 3 characters taking AB, BC and ABC as basic structures are obtained, wherein A, B, C is any single character;
Specifically, the target corpus is segmented according to the times of the respective occurrence, including,
Analyzing each basic structure to obtain the number n0 of the words of the ABC characters, the number n1 of the AB character combinations and the number n2 of the BC character combinations in a preset time period;
if the difference between n1 and n2 is greater than a first threshold, judging AB as a word;
If the difference between n2 and n1 is greater than a first threshold, judging that A is a single word, BC is a word or BCA ' is a word;
If the absolute value of the difference between n1 and n2 is not greater than a first threshold, judging ABC as a word or ABCA ' as a word;
the method may further comprise the steps of,
When the difference between n2 and n1 is greater than a first threshold, acquiring the number of BC and the number of BCA ', judging BCA ' as a word if the difference between the number of BC and the number of BCA ' is lower than a second threshold, and judging BC as a word if the difference between the number of BC and the number of BCA ' is not lower than the second threshold;
when the absolute value of the difference between n1 and n2 is not greater than a first threshold, the number of ABC and the number of ABCA ' are obtained, if the difference between the number of ABC and the number of ABCA ' is lower than a second threshold, the ABCA ' is judged to be a word, and if the difference between the number of BC and the number of BCA ' is not lower than the second threshold, the ABC is judged to be a word.
2. The corpus-based word segmentation method according to claim 1, wherein the specific user comments in the preset time period are the number of comments of all users in half a year.
3. The corpus-based word segmentation method according to claim 1, further comprising, when intercepting adjacent characters of each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the number of times that the character combinations appear in user comments of a preset time period according to the formed hash structures.
4. Word segmentation device based on corpus, which is characterized by comprising the following steps:
The target corpus acquisition module is used for acquiring target corpus;
The sub-corpus segment splitting module is used for splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;
The character combination acquisition module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths;
The number statistics module is used for counting the number of times that a plurality of character combinations with different lengths respectively appear in user comments in a preset time period;
the word segmentation module is used for segmenting the target corpus according to the times of occurrence respectively;
Specifically, each sub-corpus is intercepted according to preset rules to obtain a plurality of character combinations with different lengths, the following rules are followed,
Each sub-corpus segment is assumed to be ABCA 'B'C', and is divided according to the mode of adjacent 2 characters and adjacent 3 characters, so that a plurality of character combinations of the 2 characters and the 3 characters taking AB, BC and ABC as basic structures are obtained, wherein A, B, C is any single character;
Specifically, the target corpus is segmented according to the times of the respective occurrence, including,
Analyzing each basic structure to obtain the number n0 of the words of the ABC characters, the number n1 of the AB character combinations and the number n2 of the BC character combinations in a preset time period;
if the difference between n1 and n2 is greater than a first threshold, judging AB as a word;
If the difference between n2 and n1 is greater than a first threshold, judging that A is a single word, BC is a word or BCA ' is a word;
If the absolute value of the difference between n1 and n2 is not greater than a first threshold, judging ABC as a word or ABCA ' as a word;
the method may further comprise the steps of,
When the difference between n2 and n1 is greater than a first threshold, acquiring the number of BC and the number of BCA ', judging BCA ' as a word if the difference between the number of BC and the number of BCA ' is lower than a second threshold, and judging BC as a word if the difference between the number of BC and the number of BCA ' is not lower than the second threshold;
when the absolute value of the difference between n1 and n2 is not greater than a first threshold, the number of ABC and the number of ABCA ' are obtained, if the difference between the number of ABC and the number of ABCA ' is lower than a second threshold, the ABCA ' is judged to be a word, and if the difference between the number of BC and the number of BCA ' is not lower than the second threshold, the ABC is judged to be a word.
5. The corpus-based word segmentation apparatus as set forth in claim 4, further comprising,
The hash structure generation module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the occurrence times of the character combinations in user comments of a preset time period according to the formed hash structures.
6. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any of claims 1-3.
CN202111333372.2A 2021-11-11 2021-11-11 Word segmentation method and device based on corpus Active CN114444475B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111333372.2A CN114444475B (en) 2021-11-11 2021-11-11 Word segmentation method and device based on corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111333372.2A CN114444475B (en) 2021-11-11 2021-11-11 Word segmentation method and device based on corpus

Publications (2)

Publication Number Publication Date
CN114444475A CN114444475A (en) 2022-05-06
CN114444475B true CN114444475B (en) 2025-01-03

Family

ID=81364001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111333372.2A Active CN114444475B (en) 2021-11-11 2021-11-11 Word segmentation method and device based on corpus

Country Status (1)

Country Link
CN (1) CN114444475B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115130472B (en) * 2022-08-31 2023-02-21 北京澜舟科技有限公司 Method, system and readable storage medium for segmenting subwords based on BPE

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9465791B2 (en) * 2007-02-09 2016-10-11 International Business Machines Corporation Method and apparatus for automatic detection of spelling errors in one or more documents
CN106445906A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Generation method and apparatus for medium-and-long phrase in domain lexicon
CN105117054B (en) * 2015-08-12 2018-04-17 珠海优特物联科技有限公司 A kind of recognition methods of handwriting input and system
CN111931491B (en) * 2020-08-14 2023-11-14 中国工商银行股份有限公司 Domain dictionary construction method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于统计的无词典分词方法;傅赛香 等;广西科学院学报;20021130;第18卷(第4期);第252页-254页 *

Also Published As

Publication number Publication date
CN114444475A (en) 2022-05-06

Similar Documents

Publication Publication Date Title
CN110874531B (en) Topic analysis method and device and storage medium
CN110457672B (en) Keyword determination method and device, electronic equipment and storage medium
CN110532369B (en) Question and answer pair generation method and device and server
JPH10187754A (en) Device and method for classifying document
CN102890698A (en) Method for automatically describing microblogging topic tag
CN108846431B (en) Video bullet screen emotion classification method based on improved Bayesian model
CN114444475B (en) Word segmentation method and device based on corpus
JP2019082841A (en) Generation program, generation method and generation device
CN110347934B (en) Text data filtering method, device and medium
CN106897258A (en) The computational methods and device of a kind of text otherness
Hewlett et al. Fully unsupervised word segmentation with BVE and MDL
CN105138631A (en) Knowledge base construction method and device
Akachar et al. ACSIMCD: A 2-phase framework for detecting meaningful communities in dynamic social networks
CN111611457B (en) Page classification method, device, equipment and storage medium
CN112597313A (en) Short text clustering method and device, electronic equipment and storage medium
CN110413985B (en) Related text segment searching method and device
CN115982346A (en) Question-answer library construction method, terminal device and storage medium
CN114036907A (en) A text data augmentation method based on domain features
CN115809319A (en) Question answering method and device
CN106547758B (en) Data binning method and device
JP2004341948A (en) Concept extraction system, concept extraction method, program, and storage medium
Damiran et al. Author Identification-An Experiment Based on Mongolian Literature Using Decision Trees
CN110413750A (en) The method and apparatus for recalling standard question sentence according to user's question sentence
CN115186651A (en) Analysis method, device, terminal equipment and medium for customer fault complaint content
CN106776529B (en) Business emotion analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant