CN110750980B

CN110750980B - Phrase corpus acquisition method and phrase corpus acquisition device

Info

Publication number: CN110750980B
Application number: CN201911352915.8A
Authority: CN
Inventors: 杨萌萌; 郝玉峰; 黄宇凯; 邵志明; 曹琼; 李科
Original assignee: Beijing Speechocean Technology Co ltd
Current assignee: Beijing Speechocean Technology Co ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-05-05
Anticipated expiration: 2039-12-25
Also published as: CN110750980A

Abstract

The invention relates to the technical field of voice synthesis, and provides a phrase corpus acquisition method and a phrase corpus acquisition device. The phrase corpus acquiring method comprises the following steps: and acquiring the long sentence corpus to be processed. And splitting the long sentence corpus to be processed to obtain at least one sub-sentence corpus. And comparing the word number of the clause corpus with a preset sentence length threshold value. And if the number of words of the clause corpus is less than or equal to a preset sentence length threshold, keeping the clause corpus as a short sentence corpus. By the phrase corpus obtaining method, the long sentence corpus to be processed is split into the independent sub-sentence corpus to be processed, the utilization rate of the corpus cleaning sentences is improved, the loss of useful corpora in the long sentence corpus to be processed is reduced, and the cost of manual proofreading is saved.

Description

Phrase corpus acquisition method and phrase corpus acquisition device

Technical Field

The present invention relates generally to the field of speech synthesis technology, and more particularly, to a phrase corpus acquiring method and a phrase corpus acquiring device.

Background

In the related technology, corpus cleaning includes the steps of sentence segmentation, sentence length filtering, text duplication removal and the like. And processing the unprocessed specified corpus in the original corpus set into a text which is actually needed by the processing through corpus cleaning. In the corpus cleaning process, the length of the sentence after the sentence division is limited in range, so that the length variance of the cleaned text is ensured to be in a controllable range.

In practical applications, since some corpora have long lengths, such as news, encyclopedia, and legal corpora, about 40% -60% of sentences are filtered due to long sentence lengths during corpus cleaning. For some scarce linguistic data, such as business conversation and linguistic and technical linguistic data, in the process of linguistic data cleaning, the linguistic data cleaned and washed out by filtering according to the length of a sentence is few, so that the automatic cleaning cannot meet the use requirement of later texts.

Through above-mentioned corpus cleaning methods, lead to the corpus utilization ratio after the washing to be low, partial original corpus information is lost, can't satisfy the user demand of later stage corpus.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a phrase corpus acquiring method and a phrase corpus acquiring device.

In a first aspect, an embodiment of the present invention provides a method for obtaining phrase corpus, including: and acquiring the long sentence corpus to be processed. And splitting the long sentence corpus to be processed to obtain at least one sub-sentence corpus. And comparing the word number of the clause corpus with a preset sentence length threshold value. And if the number of words of the clause corpus is less than or equal to a preset sentence length threshold, keeping the clause corpus as a short sentence corpus.

In an embodiment, splitting a long sentence corpus to be processed to obtain at least one sub-sentence corpus includes: and judging whether the long sentence corpus to be processed has independent clauses or not through the sequence marking model. And if the long sentence corpus to be processed has independent clauses, splitting the long sentence corpus to be processed according to the punctuations to obtain the clause corpus.

In an embodiment, splitting a long sentence corpus to be processed to obtain at least one sub-sentence corpus includes: and judging whether parallel clauses exist in the long sentence corpus to be processed or not through dependency syntax analysis. If the long sentence corpus to be processed has parallel clauses, splitting the long sentence corpus to be processed into a plurality of parallel clause corpuses.

In another embodiment, the determining whether the long sentence corpus to be processed has clauses with parallel relations through dependency syntax analysis includes: and obtaining the core words of the long sentence corpus to be processed through dependency syntax analysis. And judging whether the long sentence corpus to be processed has parallel clauses or not according to whether the long sentence corpus to be processed contains parallel words having parallel relation with the core words or not based on the dependency syntax analysis. Splitting the long sentence corpus to be processed according to the parallel relation, comprising the following steps: if the long sentence corpus to be processed has the clauses containing the parallel words, the long sentence corpus to be processed is split into the clause corpus containing the core words and the clause corpus containing the parallel words.

In an embodiment, the splitting the long sentence corpus to be processed to obtain at least one sub-sentence corpus further includes: and if the long sentence corpus to be processed does not have parallel clauses, extracting the components of the long sentence corpus to be processed.

In another embodiment, the component extraction of the long sentence corpus to be processed includes: and based on dependency syntax analysis, extracting the components of the long sentence corpus to be processed according to the sentence structure of the long sentence corpus to be processed to obtain the sub-sentence corpus.

In an embodiment, the phrase corpus acquiring method further includes: if the word number of the clause corpus is larger than a preset sentence length threshold value, then: and judging whether the clause corpus has parallel clauses or not through dependency syntax analysis.

In an embodiment, the phrase corpus acquiring method further includes: and performing phrase check on the short sentence linguistic data, and reserving the short sentence linguistic data passing the phrase check.

In another embodiment, phrase checking the phrase corpus and retaining the phrase corpus that passes the phrase checking includes: and obtaining the confusion degree of the phrase corpus through a language training model. And comparing the confusion degree with a preset confusion threshold value, and reserving the short sentence corpus of which the confusion degree is smaller than the preset confusion threshold value.

In another embodiment, obtaining the long sentence corpus to be processed includes: the method comprises the steps of obtaining a corpus set to be processed, and comparing the corpus length of the corpus set to be processed with a preset corpus sentence length threshold value, wherein the corpus set to be processed comprises at least one corpus to be processed. If the length of the linguistic data to be processed is larger than or equal to a preset linguistic data sentence length threshold value, obtaining the linguistic data to be processed, wherein the linguistic data to be processed is a long sentence linguistic data to be processed. And if the length of the linguistic data to be processed is smaller than a preset linguistic data sentence length threshold value, performing phrase verification on the linguistic data to be processed, wherein the linguistic data is a short sentence linguistic data to be processed.

In a second aspect, an embodiment of the present invention provides a phrase corpus acquiring apparatus, including: and the acquisition module is used for acquiring the long sentence corpus to be processed and keeping the sub-sentence corpus as the short sentence corpus when the word number of the sub-sentence corpus is less than or equal to a preset sentence length threshold value. And the splitting module is used for splitting the long sentence corpus to be processed to obtain at least one sub-sentence corpus. And the comparison module is used for comparing the word number of the clause corpus with a preset sentence length threshold value.

In an embodiment, the splitting module splits the long sentence corpus to be processed in the following manner to obtain at least one sub-sentence corpus: and judging whether the long sentence corpus to be processed has independent clauses or not through the sequence marking model. And if the long sentence corpus to be processed has independent clauses, splitting the long sentence corpus to be processed according to the punctuations to obtain the clause corpus.

In another embodiment, the splitting module splits the long sentence corpus to be processed in the following manner to obtain at least one sub-sentence corpus: and judging whether parallel clauses exist in the long sentence corpus to be processed or not through dependency syntax analysis. If the long sentence corpus to be processed has parallel clauses, splitting the long sentence corpus to be processed into a plurality of parallel clause corpuses.

In an embodiment, the splitting module determines whether the long sentence corpus to be processed has clauses in parallel relationship by dependency syntax analysis in the following manner: and obtaining the core words of the long sentence corpus to be processed through dependency syntax analysis. And judging whether the long sentence corpus to be processed has parallel clauses or not according to whether the long sentence corpus to be processed contains parallel words having parallel relation with the core words or not based on the dependency syntax analysis. The splitting module splits the long sentence corpus to be processed according to the parallel relation by adopting the following mode: if the long sentence corpus to be processed has the clauses containing the parallel words, the long sentence corpus to be processed is split into the clause corpus containing the core words and the clause corpus containing the parallel words.

In an embodiment, the splitting module further splits the long sentence corpus to be processed in the following manner to obtain at least one sub-sentence corpus: and if the long sentence corpus to be processed does not have parallel clauses, extracting the components of the long sentence corpus to be processed.

In another embodiment, the splitting module extracts the components of the long sentence corpus to be processed by the following method: and based on dependency syntax analysis, extracting the components of the long sentence corpus to be processed according to the sentence structure of the long sentence corpus to be processed to obtain the sub-sentence corpus.

In one embodiment, when the word count of the phrase corpus is greater than a preset phrase length threshold: and the splitting module is also used for judging whether the clause corpus has parallel clauses or not through dependency syntax analysis.

In an embodiment, the phrase corpus acquiring device further includes: and the checking module is used for carrying out phrase checking on the short sentence linguistic data and reserving the short sentence linguistic data which passes the phrase checking.

In another embodiment, the checking module performs phrase checking on the phrase corpus in the following manner, and retains the phrase corpus that passes the phrase checking: and obtaining the confusion degree of the phrase corpus through a language training model. And comparing the confusion degree with a preset confusion threshold value, and reserving the short sentence corpus of which the confusion degree is smaller than the preset confusion threshold value.

In another embodiment, the obtaining module obtains the long sentence corpus to be processed by the following method: the method comprises the steps of obtaining a corpus set to be processed, and comparing the corpus length of the corpus set to be processed with a preset corpus sentence length threshold value, wherein the corpus set to be processed comprises at least one corpus to be processed. If the length of the linguistic data to be processed is larger than or equal to a preset linguistic data sentence length threshold value, obtaining the linguistic data to be processed, wherein the linguistic data to be processed is a long sentence linguistic data to be processed. And if the length of the linguistic data to be processed is smaller than a preset threshold of the length of the linguistic data sentences, performing phrase check on the linguistic data, wherein the linguistic data are short sentence linguistic data to be processed.

In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes: a memory to store instructions; and the processor is used for calling the instruction stored in the memory to execute the phrase corpus acquiring method.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions, when executed by a processor, perform a phrase corpus obtaining method.

The invention provides a phrase corpus acquiring method and a phrase corpus acquiring device. The long sentence corpus to be processed is split into independent sub-sentence corpuses for processing, so that the utilization rate of the corpuses for cleaning sentences is improved, the information loss of the long sentence corpus to be processed is reduced, and the manual proofreading cost is saved.

Drawings

The above and other objects, features and advantages of embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 is a schematic diagram illustrating a phrase corpus obtaining method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating another phrase corpus obtaining method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating another phrase corpus acquiring method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating another phrase corpus acquiring method according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a phrase corpus acquisition operation according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a phrase corpus acquiring apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an electronic device provided by an embodiment of the invention;

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way.

It should be noted that although the expressions "first", "second", etc. are used herein to describe different modules, steps, data, etc. of the embodiments of the present invention, the expressions "first", "second", etc. are merely used to distinguish between different modules, steps, data, etc. and do not indicate a particular order or degree of importance. Indeed, the terms "first," "second," and the like are fully interchangeable.

The phrase corpus acquisition method is applied to the corpus to be processed, and the corpus suitable for manufacturing a speech synthesis database is obtained through corpus cleaning. In practical application, in order to ensure the quality of the corpus, the corpus length after cleaning is limited so as to ensure that the variance of the corpus length after cleaning is within a controllable range, thereby facilitating the management of the corpus.

Fig. 1 is a diagram illustrating a phrase corpus acquisition method according to an exemplary embodiment. As shown in fig. 1, the phrase corpus acquiring method 10 includes the following steps S11 to S13.

In step S11, a long sentence corpus to be processed is acquired.

In the embodiment of the present disclosure, the to-be-processed long sentence corpus is a specified unprocessed to-be-processed long sentence corpus. And determining the sentence length of the long sentence corpus to be processed, which needs to be acquired, according to a preset corpus sentence length threshold. And obtaining the long sentence corpus to be processed through the local corpus or a database in the cloud. The content of the long sentence corpus to be processed may include: news, dialog, encyclopedia, law, and are not limited in this disclosure.

In an implementation scenario, the corpora to be processed in the corpus to be processed are separately obtained according to a preset corpus sentence length threshold. Wherein, the corpus to be processed at least contains one sentence of corpus to be processed. And taking the linguistic data to be processed, of which the sentence length is greater than or equal to a preset linguistic data sentence length threshold value, as long sentence linguistic data to be processed. And (4) performing corpus cleaning on the long sentence corpus to be processed by adopting a to-be-processed long sentence corpus cleaning mode. And directly acquiring the linguistic data to be processed with the sentence length smaller than a preset threshold of the linguistic data sentence length as the linguistic data of the short sentence, or cleaning the linguistic data of the short sentence to be processed by adopting a short sentence linguistic data cleaning mode. The corpus to be processed is separated according to the sentence length, which is beneficial to saving the corpus cleaning cost and accelerating the corpus cleaning process. Meanwhile, the method is also beneficial to reducing the loss rate of the useful linguistic data concentrated by the linguistic data to be processed, and further improving the utilization rate of the linguistic data to be processed.

In step S12, the long sentence corpus to be processed is split to obtain at least one sub-sentence corpus.

In the embodiment of the disclosure, the obtained long sentence corpus to be processed is split to obtain at least one independent sub-sentence corpus. The long sentence corpus to be processed is split, so that effective information in the long sentence corpus to be processed is favorably retained to the maximum, and the utilization rate of the long sentence corpus to be processed is improved.

In one embodiment, whether the long sentence corpus has independent clauses is judged according to punctuation marks and by combining context. And when the independent clauses exist, splitting the long sentence corpus to be processed. In an implementation scenario, punctuation marks in the long sentence corpus to be processed are removed in advance, and the long sentence corpus to be processed without punctuation marks is obtained. Models are annotated by sequences, for example: and predicting the punctuation mark position of the long sentence corpus without punctuation marks by combining a Bi-LSTM-CRF model (bidirectional-long and short term memory network-conditional random field) with context and named entity recognition to obtain the long sentence corpus with predicted punctuation marks. By dynamic programming algorithms, for example: the LCS algorithm (Longest common subsequence algorithm) compares the long sentence corpus to be processed with the predicted punctuation marks through alignment, and judges whether the long sentence corpus to be processed has independent clauses according to the punctuation mark change of the corresponding position. If the long sentence corpus to be processed with the predicted punctuation marks (not including the punctuation at the end of the sentence) contains the sentence end mark and the corresponding original long sentence corpus to be processed is the pause mark, judging that an independent clause exists in the long sentence corpus to be processed, and splitting the long sentence corpus to be processed into a plurality of clause corpora according to the predicted sentence end mark. If the long sentence corpus to be processed (not including the sentence end punctuation) with the predicted punctuation mark does not contain the sentence end mark, it is determined that there is no independent clause in the long sentence corpus to be processed, and the sentence end mark is, for example: period, question mark, or exclamation mark. The stall flag is, for example: comma or semicolon. For example: the long sentence corpus to be processed is as follows: the bad feeling is more and more intense, and the user can get to the meeting very early the next day before sleeping without waking up the user. The long sentence corpus to be processed with the predicted punctuation marks obtained by the sequence labeling model is as follows: the bad feeling becomes stronger. He also can get up to the meeting very early the next day with what i say before sleeping, and does not wake up me. Through comparison, the predicted 'more and more intense' ending is a period, and the position corresponding to the long sentence corpus to be processed is a comma. Therefore, the' bad feeling is more and more intense, and the user can get to the meeting very early the next day before sleeping without waking up me. "the splitting into" bad feeling is getting stronger and stronger. And also can start to meet the meeting very early in the next day without waking up me before sleeping. ". The long sentence corpus to be processed is split through punctuation marks, which is beneficial to quickly shortening the sentence length of the long sentence to be processed. The long sentences are changed into the short sentences for corpus cleaning, thereby being beneficial to improving the corpus cleaning efficiency and reducing the corpus cleaning cost. In another real-time scenario, the predetermined punctuation marks are compared with the corresponding markup symbols. And training the sequence labeling model. Marking the training sentences in the obtained training sentence set with the marked points and the training sentences without marked points according to a preset marked point symbol comparison table, constructing a training set and a test set of a sequence marking model, and training a Bi-LSTM-CRF model.

In step S13, the word count of the clause corpus is compared with a preset sentence length threshold.

In the embodiment of the present disclosure, the word number of the divided clause corpus is compared with a preset long sentence threshold, and then the length of the obtained short sentence corpus is controlled, so as to meet the requirement of building the corpus.

In step S14, a clause corpus in which the number of words is less than or equal to a preset sentence length threshold is retained as a phrase corpus.

In the embodiment of the present disclosure, the clause corpus with the number of words less than or equal to the preset sentence length threshold is retained according to the actual requirement of the corpus, and is stored as the short sentence corpus to provide a large amount of short sentence corpus meeting the standard for later speech synthesis or other applications.

Through the embodiment, the long sentence corpus to be processed is split into the independent sub-sentence corpus to be processed, the utilization rate of the corpus to be processed is improved, corpus cleaning efficiency is improved, and manual proofreading cost is saved.

FIG. 2 is a diagram illustrating another phrase corpus acquisition method, according to an example embodiment. As shown in FIG. 2, the phrase corpus acquiring method 20 includes the following steps S21-S25.

In the embodiment of the present disclosure, the steps S21, S24, and S25 are respectively the same as the steps S11, S13, and S14 in the phrase corpus acquiring method 10, and are not repeated herein.

In step S21, a long sentence corpus to be processed is acquired.

In step S22, it is determined whether or not there are parallel clauses in the long sentence corpus to be processed by dependency syntax analysis.

In the embodiment of the disclosure, the content components of the long sentence corpus to be processed are analyzed through dependency syntax analysis to obtain the core words in the long sentence corpus to be processed, and whether the core words generate a parallel relationship with the clauses except the core words is judged. When the parallel clauses exist, splitting the long sentence corpus to be processed into a plurality of parallel clause corpuses. When no parallel clauses exist, the long sentence corpus to be processed is subjected to component extraction according to sentence components, effective information in the long sentence corpus to be processed is reserved, the split clause corpus is still complete in structure, and effective content in the long sentence corpus to be processed is not influenced to be reserved. For example: the content of the long sentence corpus to be processed is as follows: "unlike other events, this warm-up of the female song is completely closed, without spectators, interviews by journalists, and event bulletin. ". As shown in table 1 below, by dependency parsing, information such as segmentation, word position, dependency relationship, and part of speech of the long sentence corpus to be processed is obtained, and then the core word in the long sentence corpus to be processed is determined. And the word with the dependency relationship word position of 0 is a core relationship word in the dependency relationship, namely the core word of the long sentence corpus to be processed.

Word segmentation	Word position	Dependency relationship word position	Dependency relationship	Part of speech
					And	1	4	middle structure	p
Others	2	3	Centering relationships	r
					Competition item	3	1	Intermediary relation	nz
Is different	4	11	Middle structure	a
					，	5	4	Punctuation mark	w
Female leaven	6	9	Centering relationships	n
					Is/are as follows	7	6	Right additive relationship	u
This time	8	9	Centering relationships	r
					Warming game	9	11	Relationship between major and minor	l
Is totally produced from	10	11	Middle structure	ad
					Sealing of	11	0	Core relationships	v
，	12	11	Punctuation mark	w
					Is not provided with	13	11	In a parallel relationship	v
Audience member	14	13	Moving guest relationship	n
					，	15	13	Punctuation mark	w
Give way to	16	17	Centering relationships	v
					Reporter	17	18	Relationship between major and minor	n
Interview	18	11	In a parallel relationship	vn
					，	19	18	Punctuation mark	w
Is not provided with	20	11	In a parallel relationship	v
					Event events	21	22	Centering relationships	n
Communique for making paper	22	20	Moving guest relationship	n
					。	23	11	Punctuation mark	w

TABLE 1

In another embodiment, the obtained long sentence corpus to be processed is pre-determined whether the long sentence corpus to be processed contains independent clauses through a sequence tagging model. And splitting the long sentence corpus to be processed based on the obtained independent clauses through dependency syntax analysis, thereby being beneficial to saving the analysis time of the dependency syntax analysis and improving the splitting rate.

In step S23, if the long sentence corpus to be processed has parallel clauses, the long sentence corpus to be processed is divided into a plurality of parallel clause corpora.

In the embodiment of the disclosure, through dependency parsing, it is determined that parallel words having a parallel relationship with core words are contained in a long sentence corpus to be processed, and the long sentence corpus is distributed in different clauses, and the clauses are divided by periods or semicolons. And acquiring the sentence head position of the clause where the parallel words are positioned, and splitting the long sentence corpus to be processed. And performing punctuation prediction on the split short sentence which is not ended by the sentence end identifier, and determining the ending identifier to ensure that the corpus structure of the split sub-sentence is complete. For example: as shown in table 1 above, by dependency parsing, it is determined that "closed" with a term position of 11 is a core term, and the corresponding parallel terms are "absent" with a term position of 13, "interview" with a term position of 18, and "absent" with a term position of 20, respectively. According to the position of the clause of the parallel words, different from other match items, the warm-up match of the female music is completely closed, no audience exists, no interview is given to a reporter, and no match bulletin exists. "split" is different from other game items, and the warm-up game of the female song is completely closed. "," has no audience. "," is not interviewed by a reporter. "and" there is no event bulletin. ".

In one embodiment, in order to ensure the complete dependency relationship of the split clause corpus, the syntax structure of the clause corpus is supplemented. The main sentence of the core word, namely the corresponding word of the syntactic analysis main and predicate relation, is obtained in advance. And judging whether the clauses containing the parallel words have corresponding words of the main-meaning relationship. And when the corresponding words of the main-meaning relationship do not exist, adding the corresponding words of the main-meaning relationship in the clauses containing the core words into the clauses containing the parallel words according to the characteristic of 'multiplying and omitting' in the Chinese language. And calculating the confusion degree of different positions in the parallel clauses through a language model, and determining the position of the word corresponding to the addition of the cardinal and predicate relations, wherein the position of the confusion value is the position suitable for the word corresponding to the addition of the cardinal and predicate relations. After the addition of the corresponding words of the main-meaning relationship is finished, judging whether conjunctions and adverbs which form a shape language relationship with the parallel words in the clauses containing the parallel words need to be deleted or not through the language model according to the size of the confusion value so as to ensure that the syntax structure of the divided clauses corpus is normal, the semanteme is smooth and no redundant sentence components exist.

In step S24, the word count of the clause corpus is compared with a preset sentence length threshold.

In step S25, a clause corpus in which the number of words is less than or equal to a preset sentence length threshold is retained as a phrase corpus.

With the above-described embodiment, the syntax structure of the long sentence corpus to be processed and the dependency relationship between words in the sentence are determined based on the dependency syntax analysis, and the parallel relationship between clauses is determined. Based on the parallel relation, the long sentence corpus to be processed is split into a plurality of independent sub-sentence corpuses, and the sentence structure of the long sentence corpus to be processed is split, so that the splitting is more accurate, and the structural integrity of the sub-sentence corpuses is improved.

Based on the same inventive concept, the present disclosure provides another phrase corpus obtaining method.

FIG. 3 is a diagram illustrating yet another phrase corpus acquisition method, according to an example embodiment. As shown in fig. 3, the phrase corpus acquiring method 30 includes the following steps S31 to S35.

In the embodiment of the present disclosure, step S31, step S32, step S34, and step S35 are the same as the implementation of step S21, step S22, step S24, and step S25 in the phrase corpus acquiring method 20, and are not repeated herein.

In step S31, a long sentence corpus to be processed is acquired.

In step S32, it is determined whether or not there are parallel clauses in the long sentence corpus to be processed by dependency syntax analysis.

In step S33, if there is no parallel clause in the long sentence corpus to be processed, the long sentence corpus to be processed is subjected to component extraction.

In the embodiment of the disclosure, through dependency parsing, it is determined that the long sentence corpus to be processed does not contain a parallel word having a parallel relationship with the core word, indicating that there is no clause having a parallel relationship in the long sentence corpus to be processed. And according to dependency syntax analysis, extracting components of the long sentence corpus to be processed, shortening the sentence length of the long sentence corpus to be processed, and obtaining a new corpus, namely the sub-sentence corpus of the long sentence corpus to be processed. In an implementation scene, the components of the long sentence corpus to be processed are extracted according to the idiosyncratic relation, the noun and lexeme parallel words, the corresponding relation of the subject and the predicate and the object. And according to the analysis of the dependency syntax, if the long sentence corpus to be processed contains a clause which forms a shape language relationship with the core word, deleting the clause which forms the shape language relationship with the core word. If the long sentence corpus to be processed contains the noun parallel components, at least one parallel component is reserved, and the rest parallel components are deleted. And when the fixed language of the word corresponding to the main and predicate relation of the core word exceeds a preset word number threshold value, deleting the fixed language. When the object clause corresponding to the core word is too long, only the object clause is reserved. The long sentence corpus to be processed is shortened through component extraction, and effective information in the long sentence corpus to be processed is reserved to the maximum extent, so that the utilization rate of the long sentence corpus to be processed is improved.

In step S34, the word count of the clause corpus is compared with a preset sentence length threshold.

In step S35, a clause corpus in which the number of words is less than or equal to a preset sentence length threshold is retained as a phrase corpus.

Through the embodiment, the sentence components of the long sentence corpus to be processed are determined through dependency relationship analysis, and the components of the long sentence corpus to be processed are extracted according to the sentence components. The method and the device have the advantages that effective information in the long sentence corpus to be processed is reserved to the maximum extent while the long sentence corpus to be processed is shortened, the utilization rate of the long sentence corpus to be processed is improved, and information loss is reduced.

In one embodiment, when the long sentence corpus to be processed has independent clauses, the long sentence corpus to be processed is divided into a plurality of clause corpuses. And comparing the obtained multiple clause corpora according to a preset sentence length threshold. And if the word number of the obtained clause corpus is larger than a preset sentence length threshold value, analyzing the clause corpus by using dependency syntax analysis, and splitting the clause corpus into a plurality of parallel clause corpuses or shortening the word number length of the clause corpus through component extraction. The method is beneficial to maximally preserving effective information in the long sentence corpus to be processed and improving the utilization rate of the long sentence corpus to be processed.

FIG. 4 is a diagram illustrating yet another phrase corpus acquisition method, according to an example embodiment. As shown in fig. 4, the phrase corpus acquiring method 40 includes the following steps S41 to S44.

In the embodiment of the present disclosure, the steps S41 to S44 are the same as the steps S11 to S14 in the phrase corpus obtaining method 10, and are not repeated herein.

In step S41, a long sentence corpus to be processed is acquired.

In step S42, the long sentence corpus to be processed is split to obtain at least one sub-sentence corpus.

In step S43, the word count of the clause corpus is compared with a preset sentence length threshold.

In step S44, a clause corpus in which the number of words is less than or equal to a preset sentence length threshold is retained as a phrase corpus.

In step S45, phrase check is performed on the phrase corpus, and the phrase corpus that passes the phrase check is retained.

In the embodiment of the present disclosure, the obtained phrase corpus is subjected to phrase verification through the language model, and the verified phrase corpus is retained, so as to ensure that the obtained phrase corpus is a new sentence with a complete structure, smooth semantic meaning and meeting the requirement of sentence length. The method is beneficial to improving the construction quality of the corpus and saving the manual proofreading time.

In one embodiment, a confusion value for a phrase corpus is derived via a language model. And comparing the confusion value with a preset confusion threshold value, and reserving the phrase corpus of which the confusion value is smaller than the preset confusion threshold value, so that the high-quality phrase corpus is favorably reserved. In an implementation scenario, the training corpus after word segmentation is input into the language training model tool, for example: and the SRILM model tool is set as a 5-gram language model and a training language model and is used for judging the quality of the input training corpus according to the confusion degree. And scoring the acquired phrase corpus by the trained language model to measure the quality of the phrase corpus.

Through the embodiment, the retained short sentence corpus is secondarily screened through phrase verification, so that the retained phrase corpus quality is improved, the manual proofreading time is saved, and the cost is saved.

Based on the same inventive concept, the work flow diagram for obtaining phrase corpora is provided in the disclosure.

FIG. 5 is a diagram illustrating yet another phrase corpus acquisition method, according to an example embodiment. As shown in fig. 5, the phrase corpus acquiring method 50 includes the following steps S51 to S57.

In step S51, the long sentence corpus to be processed is obtained according to the preset corpus sentence length threshold.

In the embodiment of the present disclosure, the sentence length of the long sentence corpus to be processed, which needs to be acquired, is determined according to the preset corpus sentence length threshold. And taking the linguistic data to be processed, of which the sentence length is greater than or equal to a preset linguistic data sentence length threshold value, as long sentence linguistic data to be processed.

In step S52, it is determined whether there is an independent clause in the long sentence corpus to be processed.

In the embodiment of the disclosure, whether the long sentence corpus to be processed contains independent clauses is judged through the sequence marking model. If the independent clauses exist, splitting the long sentence corpus to be processed to generate a plurality of clause corpuses, and executing the step S55 to compare the word number of the clause corpus with a preset sentence length threshold value. If there is no independent clause, step S53 is executed to determine whether there is a parallel clause in the long sentence corpus to be processed.

In step S53, it is determined whether or not a long sentence corpus to be processed contains a parallel clause.

In the embodiment of the disclosure, the long sentence corpus to be processed is analyzed through dependency syntax analysis, so as to obtain the core words of the long sentence corpus to be processed. And judging whether the core words and clauses except the core words generate parallel relations or not according to the syntactic structure of the long sentence corpus to be processed obtained by the analysis result and the dependency relations among the vocabularies. When there is a parallel relationship, step S54a is executed to split the long sentence corpus into a plurality of parallel clause corpuses. If the parallel relation does not exist, step S54b is executed to extract the components of the long sentence corpus to be processed.

In step S54a, the long sentence corpus to be processed is divided into a plurality of juxtaposed clause corpuses.

In the embodiment of the disclosure, through dependency parsing, it is determined that parallel words having a parallel relationship with core words are contained in a long sentence corpus to be processed, and the parallel words are distributed in different clauses. And acquiring the sentence head position of the clause where the parallel words are positioned, and splitting the long sentence corpus to be processed to obtain a plurality of parallel clause corpuses.

In step S54b, the long sentence corpus to be processed is subjected to component extraction.

In the embodiment of the disclosure, it is determined that the long sentence corpus to be processed does not contain parallel words having a parallel relationship with the core words through dependency parsing, and the long sentence corpus to be processed is subjected to component extraction to shorten the sentence length of the long sentence corpus to be processed, so as to obtain a new corpus, i.e., a sub-sentence corpus of the long sentence corpus to be processed.

In step S55, the word count of the clause corpus is compared with the preset sentence length threshold, and the clause corpus having the word count smaller than or equal to the preset sentence length threshold is retained.

In the embodiment of the present disclosure, the word number of the split clause corpus is compared with a preset long sentence threshold, and the clause corpus of which the word number is less than or equal to the preset sentence length threshold is retained, so as to control the length of the obtained short sentence corpus. And taking the clause corpus as a short sentence corpus.

In step S56, the retained clause corpus is phrase checked.

In the embodiment of the present disclosure, the obtained phrase corpus is subjected to phrase verification through the language model, and the verified phrase corpus is retained, so as to ensure that the obtained phrase corpus is a new sentence with a complete structure, smooth semantic meaning and meeting the requirement of sentence length.

In step S57, a phrase corpus is obtained.

In the embodiment of the present disclosure, the phrase corpus checked by the phrase is retained, so as to obtain a short sentence corpus meeting the requirement.

Through the embodiment, the long sentence corpus to be processed is split through independent clause detection and dependency syntax analysis, the long sentence corpus is split into independent clause corpuses, and short sentence corpuses which are complete in structure, smooth in semanteme and meet the requirement of sentence length are obtained after phrase verification. The method is beneficial to improving the utilization rate of the long sentence corpus to be processed, reducing the loss of corpus information and saving labor cost.

Based on the same inventive concept, the present disclosure provides a schematic diagram of a phrase corpus acquiring device.

Fig. 6 is a schematic diagram illustrating a personalized corpus acquisition apparatus according to an exemplary embodiment. As shown in fig. 6, the phrase corpus acquiring apparatus 100 includes the following modules.

The obtaining module 110 is configured to obtain a long sentence corpus to be processed, and when the number of words in the sub-sentence corpus is less than or equal to a preset sentence length threshold, keep the sub-sentence corpus as a short sentence corpus.

The splitting module 120 is configured to split the long sentence corpus to be processed to obtain at least one sub-sentence corpus.

The comparing module 130 is configured to compare the word number of the clause corpus with a preset sentence length threshold.

In an embodiment, the splitting module 120 splits the long sentence corpus to be processed to obtain at least one sub-sentence corpus by the following method: and judging whether the long sentence corpus to be processed has independent clauses or not through the sequence marking model. And if the long sentence corpus to be processed has independent clauses, splitting the long sentence corpus to be processed according to the punctuations to obtain the clause corpus.

In another embodiment, the splitting module 120 splits the long sentence corpus to be processed to obtain at least one sub-sentence corpus by the following method: and judging whether parallel clauses exist in the long sentence corpus to be processed or not through dependency syntax analysis. If the long sentence corpus to be processed has parallel clauses, splitting the long sentence corpus to be processed into a plurality of parallel clause corpuses.

In an embodiment, the splitting module 120 determines whether the long sentence corpus to be processed has clauses in parallel relationship by dependency parsing in the following manner: and obtaining the core words of the long sentence corpus to be processed through dependency syntax analysis. And judging whether the long sentence corpus to be processed has parallel clauses or not according to whether the long sentence corpus to be processed contains parallel words having parallel relation with the core words or not based on the dependency syntax analysis. The splitting module 120 splits the long sentence corpus to be processed according to the parallel relationship in the following manner: if the long sentence corpus to be processed has the clauses containing the parallel words, the long sentence corpus to be processed is split into the clause corpus containing the core words and the clause corpus containing the parallel words.

In an embodiment, the splitting module 120 further splits the long sentence corpus to be processed to obtain at least one sub-sentence corpus by the following method: and if the long sentence corpus to be processed does not have parallel clauses, extracting the components of the long sentence corpus to be processed.

In another embodiment, the splitting module 120 performs component extraction on the long sentence corpus to be processed by the following method: and based on dependency syntax analysis, extracting the components of the long sentence corpus to be processed according to the sentence structure of the long sentence corpus to be processed to obtain the sub-sentence corpus.

In an embodiment, when the number of words in the clause corpus is greater than the preset sentence length threshold, the splitting module 120 is further configured to determine whether there are parallel clauses in the clause corpus through dependency parsing.

In another embodiment, the obtaining module 110 obtains the long sentence corpus to be processed by the following method: the method comprises the steps of obtaining a corpus set to be processed, and comparing the corpus length of the corpus set to be processed with a preset corpus sentence length threshold value, wherein the corpus set to be processed comprises at least one corpus to be processed. If the length of the linguistic data to be processed is larger than or equal to a preset linguistic data sentence length threshold value, obtaining the linguistic data to be processed, wherein the linguistic data to be processed is a long sentence linguistic data to be processed. And if the length of the linguistic data to be processed is smaller than a preset threshold of the length of the linguistic data sentences, performing phrase verification on the linguistic data, wherein the linguistic data to be processed is short sentence linguistic data to be processed.

The functions implemented by the modules in the apparatus correspond to the steps in the method described above, and for concrete implementation and technical effects, please refer to the description of the method steps above, which is not described herein again.

As shown in fig. 7, one embodiment of the present invention provides an electronic device 200. The electronic device 200 includes a memory 210, a processor 220, and an Input/Output (I/O) interface 230. The memory 210 is used for storing instructions. And the processor 220 is configured to call the instruction stored in the memory 210 to execute the phrase corpus acquiring method according to the embodiment of the present invention. The processor 220 is connected to the memory 210 and the I/O interface 230, respectively, for example, via a bus system and/or other connection mechanism (not shown). The memory 210 may be used to store programs and data including a program for phrase corpus acquisition according to an embodiment of the present invention, and the processor 220 executes various functional applications and data processing of the electronic device 200 by executing the program stored in the memory 210.

In an embodiment of the present invention, the processor 220 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), and a Programmable Logic Array (PLA), and the processor 220 may be one or a combination of a Central Processing Unit (CPU) or other Processing units with data Processing capability and/or instruction execution capability.

Memory 210 in embodiments of the present invention may comprise one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile Memory may include, for example, a Random Access Memory (RAM), a cache Memory (cache), and/or the like. The nonvolatile Memory may include, for example, a Read-only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk Drive (HDD), a Solid-State Drive (SSD), or the like.

In the embodiment of the present invention, the I/O interface 230 may be used to receive input instructions (e.g., numeric or character information, and generate key signal inputs related to user settings and function control of the electronic device 200, etc.), and may also output various information (e.g., images or sounds, etc.) to the outside. The I/O interface 230 may include one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a mouse, a joystick, a trackball, a microphone, a speaker, a touch panel, and the like.

In some embodiments, the invention provides a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, perform any of the methods described above.

Although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.

The methods and apparatus of the present invention can be accomplished with standard programming techniques with rule based logic or other logic to accomplish the various method steps. It should also be noted that the words "means" and "module," as used herein and in the claims, is intended to encompass implementations using one or more lines of software code, and/or hardware implementations, and/or equipment for receiving inputs.

Any of the steps, operations, or procedures described herein may be performed or implemented using one or more hardware or software modules, alone or in combination with other devices. In one embodiment, the software modules are implemented using a computer program product comprising a computer readable medium containing computer program code, which is executable by a computer processor for performing any or all of the described steps, operations, or procedures.

The foregoing description of the implementation of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. The embodiments were chosen and described in order to explain the principles of the invention and its practical application to enable one skilled in the art to utilize the invention in various embodiments and with various modifications as are suited to the particular use contemplated.

Claims

1. A phrase corpus acquiring method is characterized by comprising the following steps:

acquiring long sentence linguistic data to be processed;

judging whether the long sentence corpus to be processed has independent clauses or not through a sequence marking model;

if the long sentence corpus to be processed has independent clauses, splitting the long sentence corpus to be processed according to punctuation to obtain a clause corpus;

if the long sentence corpus to be processed does not have independent clauses, judging whether parallel clauses exist in the long sentence corpus to be processed or not through dependency syntax analysis;

if the long sentence corpus to be processed has parallel clauses, splitting the long sentence corpus to be processed into a plurality of parallel clause corpuses;

comparing the word number of the clause corpus with a preset sentence length threshold;

and if the word number of the clause corpus is smaller than or equal to the preset sentence length threshold, keeping the clause corpus as a short sentence corpus.

2. The method of claim 1,

the judging whether the long sentence corpus to be processed has clauses in parallel relation through dependency syntax analysis comprises the following steps:

obtaining core words of the long sentence corpus to be processed through dependency syntax analysis;

judging whether the long sentence corpus to be processed has parallel clauses or not according to whether the long sentence corpus to be processed contains parallel words having a parallel relation with the core words or not based on the dependency syntax analysis;

the splitting of the long sentence corpus to be processed into a plurality of parallel clause corpuses comprises:

and splitting the long sentence corpus to be processed into a sub-sentence corpus containing the core words and a sub-sentence corpus containing the parallel words.

3. The method of claim 1, further comprising:

and if the long sentence corpus to be processed does not have parallel clauses, extracting the components of the long sentence corpus to be processed.

4. The method according to claim 3, wherein the component extraction of the long sentence corpus to be processed comprises:

and based on the dependency syntax analysis, performing the component extraction on the long sentence corpus to be processed according to the sentence structure of the long sentence corpus to be processed to obtain the sub-sentence corpus.

5. The method of claim 1, further comprising:

if the word number of the clause corpus is greater than the preset sentence length threshold value, then:

and judging whether the clause corpus has parallel clauses or not through dependency syntax analysis.

6. The method of claim 1, further comprising:

and performing phrase check on the short sentence corpus, and reserving the short sentence corpus which passes the phrase check.

7. The method according to claim 6, wherein said phrase checking said corpus of phrases, and retaining said corpus of phrases that pass said phrase checking comprises:

obtaining the confusion degree of the phrase corpus through a language training model;

and comparing the confusion degree with a preset confusion threshold value, and reserving the phrase corpus of which the confusion degree is smaller than the preset confusion threshold value.

8. The method according to claim 6, wherein the obtaining the long sentence corpus to be processed comprises:

obtaining a corpus set to be processed, and comparing the corpus length of the corpus set to be processed with a preset corpus sentence length threshold, wherein the corpus set to be processed comprises at least one sentence of corpus to be processed;

if the length of the linguistic data to be processed is larger than or equal to the preset linguistic data sentence length threshold value, obtaining the linguistic data to be processed, wherein the linguistic data to be processed is long sentence linguistic data to be processed;

and if the length of the linguistic data to be processed is smaller than the preset threshold of the length of the linguistic data sentences, performing phrase verification on the linguistic data to be processed, wherein the linguistic data are short sentence linguistic data to be processed.

9. A phrase corpus acquiring apparatus, comprising:

the acquisition module is used for acquiring long sentence linguistic data to be processed;

splitting module for

the comparison module is used for comparing the word number of the clause corpus with a preset sentence length threshold;

the obtaining module is further configured to, when the number of words of the clause corpus is less than or equal to a preset sentence length threshold, reserve the clause corpus as a short sentence corpus.

10. The apparatus of claim 9,

the splitting module judges whether the long sentence corpus to be processed has clauses in parallel relation or not through dependency syntax analysis in the following mode:

the splitting module splits the long sentence corpus to be processed into a plurality of parallel sub-sentence corpuses in the following way:

11. The apparatus of claim 9, wherein the splitting module is further configured to:

12. The apparatus according to claim 11, wherein the splitting module performs component extraction on the long sentence corpus to be processed by adopting the following method:

13. The apparatus of claim 9, wherein when the word count of the clause corpus is greater than the preset sentence length threshold:

the splitting module is further configured to determine whether parallel clauses exist in the clause corpus through dependency syntax analysis.

14. The apparatus of claim 9, further comprising:

and the checking module is used for carrying out phrase checking on the short sentence linguistic data and reserving the short sentence linguistic data which passes the phrase checking.

15. The apparatus according to claim 14, wherein the checking module performs phrase checking on the corpus of phrases in a manner that retains the corpus of phrases that pass the phrase checking:

16. The apparatus according to claim 14, wherein the obtaining module obtains the long sentence corpus to be processed by:

17. An electronic device, wherein the electronic device comprises:

a memory to store instructions; and

a processor for calling the instructions stored in the memory to execute the phrase corpus acquiring method according to any one of claims 1-8.

18. A computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions that, when executed by a processor, perform the phrase corpus acquisition method of any one of claims 1-8.