CN114444475B

CN114444475B - Word segmentation method and device based on corpus

Info

Publication number: CN114444475B
Application number: CN202111333372.2A
Authority: CN
Inventors: 李森和
Original assignee: GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD
Current assignee: GUANGZHOU JIANHE NETWORK TECHNOLOGY CO LTD
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2025-01-03
Anticipated expiration: 2041-11-11
Also published as: CN114444475A

Abstract

The invention relates to a word segmentation method based on corpus, which comprises the following steps of obtaining target corpus, splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks, intercepting adjacent characters of each sub-corpus segment according to a preset rule to obtain a plurality of character combinations with different lengths, counting the number of times that a plurality of words with different lengths respectively appear in user comments of a preset time period, and segmenting the target corpus according to the number of times of the respective occurrence. The invention can intelligently segment words, has higher new word discovery speed, and can not generate word segmentation ambiguity.

Description

Word segmentation method and device based on corpus

Technical Field

The invention relates to the field of artificial intelligence, in particular to a word segmentation method and device based on corpus.

Background

Segmentation as the basis of current text analysis, almost all text analysis-based applications require segmentation.

The existing word segmentation mode is based on word stock, and qualified keywords need to be manually filtered and screened periodically. Generally, the word library-based method is similar to mmseg, ik, paoding, and word library-based method has the following defects that the word library needs to be manually maintained, new words are found slowly, and word segmentation ambiguity is easy to generate.

Disclosure of Invention

The invention aims to at least solve one of the defects in the prior art and provides a word segmentation method and device based on corpus.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

specifically, a word segmentation method based on corpus is provided, which comprises the following steps:

Acquiring a target corpus;

splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;

intercepting adjacent characters from each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths;

counting the number of times of the occurrence of the words with different lengths in user comments of a preset time period respectively;

And word segmentation is carried out on the target corpus according to the times of the respective occurrence.

Further, specifically, each sub-corpus is intercepted according to a preset rule to obtain a plurality of character combinations with different lengths, the following rules are followed,

Each sub-corpus is assumed to be ABCA 'B' c.a. divided in the manner of adjacent 2 characters and adjacent 3 characters, resulting in a combination of multiple 2 and 3 characters with AB, BC and ABC as basic structures, wherein A, B, C is an arbitrary single character.

Further, specifically, the user comments in the preset time period are the number of comments of all users in half a year.

Further, the target corpus is segmented according to the times of occurrence respectively, and each basic structure is analyzed to obtain the number n0 of occurrence of the words of the ABC characters in a preset time period, the number n1 of occurrence of the AB character combination and the number n2 of occurrence of the BC character combination;

if the difference between n1 and n2 is greater than a first threshold, judging AB as a word;

If the difference between n2 and n1 is greater than a first threshold, judging that A is a single word, BC is a word or BCA' is a word;

If the absolute value of the difference between n1 and n2 is not greater than the first threshold, then ABC is judged to be a word or ABCA' is judged to be a word.

Further, the method also comprises the steps of,

When the difference between n2 and n1 is larger than a first threshold, the number of BC and the number of BCA 'are obtained, if the difference between the number of BC and the number of BCA' is lower than a second threshold, the BCA 'is judged to be a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, the BC is judged to be a word;

when the absolute value of the difference between n1 and n2 is not greater than a first threshold, the number of ABC and the number of ABCA 'are obtained, if the difference between the number of ABC and the number of ABCA' is lower than a second threshold, the ABCA 'is judged to be a word, and if the difference between the number of BC and the number of BCA' is not lower than the second threshold, the ABC is judged to be a word.

Further, the method further comprises the steps of intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the occurrence times of the character combinations in user comments of a preset time period according to the formed hash structures.

The invention also provides a word segmentation device based on corpus, which is characterized by comprising the following steps:

The target corpus acquisition module is used for acquiring target corpus;

The sub-corpus segment splitting module is used for splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;

The character combination acquisition module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths;

the quantity counting module is used for counting the number of times that a plurality of words with different lengths respectively appear in user comments of a preset time period;

and the word segmentation module is used for segmenting the target corpus according to the times of the respective occurrence.

Further, the device also comprises a control unit for controlling the control unit,

The hash structure generation module is used for intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the occurrence times of the character combinations in user comments of a preset time period according to the formed hash structures.

The invention also proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described in any of the preceding claims.

The beneficial effects of the invention are as follows:

According to the method, a word segmentation mathematical model is built for the corpus, adjacent characters are intercepted for each sub-corpus segment according to a preset rule to obtain a plurality of character combinations with different lengths, the number of times that the words with different lengths respectively appear in user comments in a preset time period is counted, and the target corpus is segmented according to the number of times that the words respectively appear, so that word segmentation can be intelligently performed, a faster new word finding speed is achieved, and word segmentation ambiguity cannot be generated.

Drawings

The above and other features of the present disclosure will become more apparent from the detailed description of the embodiments illustrated in the accompanying drawings, in which like reference numerals designate like or similar elements, and which, as will be apparent to those of ordinary skill in the art, are merely some examples of the present disclosure, from which other drawings may be made without inventive effort, wherein:

FIG. 1 is a flow chart of a word segmentation method based on corpus of the present invention;

Detailed Description

The conception, specific structure, and technical effects produced by the present application will be clearly and completely described below with reference to the embodiments and the drawings to fully understand the objects, aspects, and effects of the present application. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The same reference numbers will be used throughout the drawings to refer to the same or like parts.

Referring to fig. 1, embodiment 1 of the present invention proposes a corpus-based word segmentation method, including the following steps:

step 110, obtaining target corpus;

step 120, splitting the target corpus into a plurality of sub-corpus segments according to punctuation marks;

Step 130, intercepting adjacent characters from each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths;

Step 140, counting the number of times that a plurality of words with different lengths respectively appear in user comments of a preset time period;

And 150, word segmentation is carried out on the target corpus according to the times of the respective occurrence.

As a preferred embodiment of the present invention, specifically, each sub-corpus is intercepted according to a preset rule to obtain a plurality of character combinations with different lengths, and the following rules are followed,

As a preferred embodiment of the present invention, specifically, the user comments in the preset time period are the number of comments of all users in half a year.

As a preferred embodiment of the present invention, specifically, word segmentation is performed on the target corpus according to the number of times of respective occurrences, including,

Analyzing each basic structure to obtain the number n0 of the words of the ABC characters, the number n1 of the AB character combinations and the number n2 of the BC character combinations in a preset time period;

The specific first threshold is defined by a user, and in general, if the threshold is greater than 5 times or more, the combination can be considered more reasonable than the separation;

in the present preferred embodiment, the corpus analyzed is given first, "it is considered that the burden is reduced not only for students. "

The corpus of the first aspect is disassembled according to punctuation marks, for example, the "everybody thinks" the burden is reduced not only the work load of students is reduced "

Mathematical model creation

Each character number is expressed as ABC three letters, then

ABC three characters may be cut into AB, BC two.

The number can be written as statistics on corpus as such in Table 1

TABLE 1

Case one:

the number of n1 is far greater than that of n2, and the possibility of representing AB as a word is extremely high

And a second case:

n1 is much smaller than n2, meaning that A is likely a single word, while BC is likely, but not necessarily, a word, possibly part of a subsequent word

And a third case:

ABC may also be a word or part of a longer word if n1 and n2 are not significantly different. ABC cannot necessarily be considered a word at this time, and the later word needs to be added as a whole for analysis.

In particular, the method comprises the steps of,

Example analysis one see Table 2

Combination of characters	Quantity of
		Happy capital (ABC)	2855
Well-known (AB)	10530
		Jiadu (BC)	4442

TABLE 2

In corpus-based statistics, it can be found that the number of times that "everything" comes out in the corpus is much greater than the number of times that "everything" appears. It can be considered that "good" is more a word and "all" is changed to a single word. I.e.

The term "all of the people" shall be "all of the people"

Example analysis II see Table 3

Combination of characters	Quantity of
		Are all considered (ABC)	101
All Belief (AB)	271
		Consider (BC)	3547

TABLE 3 Table 3

Similarly, in the present corpus analysis, "consider" to occur much more often than "all recognize", i.e., "consider" to be more like a word, and "all" to be more like a single word. The analysis mode and the example analysis one are consistent, and the recursion processing is convenient by using the program.

Example analysis results:

"everyone considers" should be segmented into "everyone", "all", "considered". The word segmentation result is correct.

Then, by using the word segmentation corpus, another example is taken

Example analysis three (4 characters): see table 4

Character combination	Quantity of
		Not only (ABC)	387
Not only (AB)	977
		Only (BC)	1302

TABLE 4 Table 4

At this time, the number of occurrences of not only both "and" only "is not large, but also the number of occurrences of not only the three words is not small, and the subsequent character needs to be introduced for the analysis. According to the corpus above, the corpus is divided into a plurality of corpus,

The recombination is "not just", then the analysis can be carried out again to integrate any two words, and then the digital model structure of ABC is recombined, namely, see Table 5

TABLE 5

Final analysis result

From the above analysis it is seen that,

The AB combination approach is more suitable than the BC group approach. I.e. "not just" should be more a word. Correct word segmentation

As a preferred embodiment of the invention, the method further comprises,

As a preferred embodiment of the present invention, the method further includes, when intercepting adjacent characters according to a preset rule for each sub-corpus segment to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and performing rapid statistics on the character combinations according to the formed hash structures.

In the present preferred embodiment of the present invention,

If the corpus word segmentation is carried out by traversing matching, the corpus word segmentation is a very slow process, namely the corpus is firstly made into a hash structure, word segmentation can be accelerated, and the corpus word segmentation method can be comparable to the analysis speed based on a word stock. And (3) carrying out character arbitrary combination on the corpus. Also analyze according to the corpus

Not only the student homework is reduced "

In the corpus, character arrangement combinations with lengths ranging from 2 to 4 are adopted. See Table 6

TABLE 6

By such permutation and combination generation, such hash structures are generated all together in the corpus. The number can be counted very quickly.

The target corpus acquisition module is used for acquiring target corpus;

As a preferred embodiment of the invention, the device further comprises,

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium may include content that is subject to appropriate increases and decreases as required by jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is not included as electrical carrier signals and telecommunication signals.

While the present invention has been described in considerable detail and with particularity with respect to several described embodiments, it is not intended to be limited to any such detail or embodiments or any particular embodiment, but is to be construed as providing broad interpretation of such claims by reference to the appended claims in view of the prior art so as to effectively encompass the intended scope of the invention. Furthermore, the foregoing description of the invention has been presented in its embodiments contemplated by the inventors for the purpose of providing a useful description, and for the purposes of providing a non-essential modification of the invention that may not be presently contemplated, may represent an equivalent modification of the invention.

The present invention is not limited to the above embodiments, but is merely preferred embodiments of the present invention, and the present invention should be construed as being limited to the above embodiments as long as the technical effects of the present invention are achieved by the same means. Various modifications and variations are possible in the technical solution and/or in the embodiments within the scope of the invention.

Claims

1. The corpus-based word segmentation method is characterized by comprising the following steps of:

Acquiring a target corpus;

Counting the number of times of occurrence of a plurality of character combinations with different lengths in user comments of a preset time period respectively;

word segmentation is carried out on the target corpus according to the times of the respective occurrence;

Specifically, each sub-corpus is intercepted according to preset rules to obtain a plurality of character combinations with different lengths, the following rules are followed,

Each sub-corpus segment is assumed to be ABCA ^'B^'C^', and is divided according to the mode of adjacent 2 characters and adjacent 3 characters, so that a plurality of character combinations of the 2 characters and the 3 characters taking AB, BC and ABC as basic structures are obtained, wherein A, B, C is any single character;

Specifically, the target corpus is segmented according to the times of the respective occurrence, including,

If the difference between n2 and n1 is greater than a first threshold, judging that A is a single word, BC is a word or BCA ^' is a word;

If the absolute value of the difference between n1 and n2 is not greater than a first threshold, judging ABC as a word or ABCA ^' as a word;

the method may further comprise the steps of,

When the difference between n2 and n1 is greater than a first threshold, acquiring the number of BC and the number of BCA ^', judging BCA ^' as a word if the difference between the number of BC and the number of BCA ^' is lower than a second threshold, and judging BC as a word if the difference between the number of BC and the number of BCA ^' is not lower than the second threshold;

when the absolute value of the difference between n1 and n2 is not greater than a first threshold, the number of ABC and the number of ABCA ^' are obtained, if the difference between the number of ABC and the number of ABCA ^' is lower than a second threshold, the ABCA ^' is judged to be a word, and if the difference between the number of BC and the number of BCA ^' is not lower than the second threshold, the ABC is judged to be a word.

2. The corpus-based word segmentation method according to claim 1, wherein the specific user comments in the preset time period are the number of comments of all users in half a year.

3. The corpus-based word segmentation method according to claim 1, further comprising, when intercepting adjacent characters of each sub-corpus according to a preset rule to obtain a plurality of character combinations with different lengths, forming the character combinations into corresponding hash structures, and carrying out rapid statistics on the number of times that the character combinations appear in user comments of a preset time period according to the formed hash structures.

4. Word segmentation device based on corpus, which is characterized by comprising the following steps:

The target corpus acquisition module is used for acquiring target corpus;

The number statistics module is used for counting the number of times that a plurality of character combinations with different lengths respectively appear in user comments in a preset time period;

the word segmentation module is used for segmenting the target corpus according to the times of occurrence respectively;

the method may further comprise the steps of,

5. The corpus-based word segmentation apparatus as set forth in claim 4, further comprising,

6. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any of claims 1-3.