CN109800430B - Semantic understanding method and system - Google Patents
Semantic understanding method and system Download PDFInfo
- Publication number
- CN109800430B CN109800430B CN201910046962.3A CN201910046962A CN109800430B CN 109800430 B CN109800430 B CN 109800430B CN 201910046962 A CN201910046962 A CN 201910046962A CN 109800430 B CN109800430 B CN 109800430B
- Authority
- CN
- China
- Prior art keywords
- regular expression
- corpus
- regular
- sample
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
本发明提供了一种语义理解方法及系统,其方法包括:获取语料样本,根据语料样本建立语料库;根据语料样本获取特征信息,根据特征信息生成对应的正则表达式;分别抽取正则表达式中的特征信息生成多个不同的正则式组合;分析每一个正则式组合中每一个特征信息的存在准确性,对正则式组合进行筛选;根据语料库分析正则式对应的正则式概率,正则式为正则表达式和筛选后的正则式组合;根据正则式以及正则式概率生成概率语义模型;获取用户语料;根据用户语料生成对应的用户正则式;将用户正则式和概率语义模型进行对比,得到用户语料的用户语义。本发明基于概率语义模型对获取的用户语料进行解析,从而得到最有可能的用户语义。
The present invention provides a method and system for semantic understanding. The method includes: obtaining corpus samples, establishing a corpus according to the corpus samples; obtaining characteristic information according to the corpus samples, and generating corresponding regular expressions according to the characteristic information; respectively extracting the regular expressions in the regular expressions Feature information generates multiple different regular expression combinations; analyze the existence accuracy of each feature information in each regular expression combination, and filter the regular expression combination; analyze the regular expression probability corresponding to the regular expression according to the corpus, and the regular expression is a regular expression formula and the filtered regular formula; generate a probabilistic semantic model according to the regular formula and the probability of the regular formula; obtain the user corpus; generate the corresponding user regular formula according to the user corpus; compare the user regular formula and the probability semantic model to obtain the user corpus user semantics. The present invention analyzes the acquired user corpus based on the probabilistic semantic model, so as to obtain the most likely user semantics.
Description
技术领域technical field
本发明涉及语言处理技术领域,尤指一种语义理解方法及系统。The invention relates to the technical field of language processing, in particular to a semantic understanding method and system.
背景技术Background technique
当前随着网络的飞速发展,智能处理信息也是越来越普遍。计算机、智能设备等每天可能需要处理成千上万的信息。智能设备一般通过分析语料得到对应的正则表达式,从而解析语料。With the rapid development of the network, intelligent processing of information is becoming more and more common. Computers, smart devices, etc. may need to process thousands of information every day. Smart devices generally analyze the corpus to obtain the corresponding regular expressions, thereby parsing the corpus.
但是,在利用分词技术的语料处理过程中,会遇到分词后,正则表达式的构成中有多种词性,以及获取的用户语料对应的正则表达式与多个正则表达式匹配相符的情况,还是无法准确判断语义和实体,因此,有必要通过一种语义理解方法及系统解析用户语料从而得到最有可能的对应的用户语义。However, in the process of corpus processing using word segmentation technology, there will be multiple parts of speech in the formation of regular expressions after word segmentation, and the situation that the regular expression corresponding to the obtained user corpus matches multiple regular expressions. It is still impossible to accurately judge the semantics and entities. Therefore, it is necessary to analyze the user corpus through a semantic understanding method and system to obtain the most likely corresponding user semantics.
发明内容Contents of the invention
本发明的目的是提供一种语义理解方法及系统,实现基于概率语义模型对获取的用户语料进行解析,从而得到最有可能的用户语义。The purpose of the present invention is to provide a semantic understanding method and system, which can analyze the acquired user corpus based on the probabilistic semantic model, so as to obtain the most likely user semantics.
本发明提供的技术方案如下:The technical scheme provided by the invention is as follows:
本发明提供一种语义理解方法,包括:The present invention provides a semantic understanding method, comprising:
获取语料样本,根据所述语料样本建立语料库;Obtaining a corpus sample, and establishing a corpus according to the corpus sample;
根据所述语料样本获取特征信息,根据所述特征信息生成对应的正则表达式;Obtain feature information according to the corpus sample, and generate a corresponding regular expression according to the feature information;
分别抽取所述正则表达式中的所述特征信息生成多个不同的正则式组合;respectively extracting the feature information in the regular expression to generate a plurality of different regular expression combinations;
分析每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选;Analyzing the existence accuracy of each feature information in each regular expression combination, and screening the regular expression combination;
根据所述语料库分析所述正则式对应的正则式概率,所述正则式为所述正则表达式和筛选后的正则式组合;Analyzing the regular expression probability corresponding to the regular expression according to the corpus, the regular expression is a combination of the regular expression and the filtered regular expression;
根据所述正则式以及所述正则式概率生成概率语义模型;generating a probabilistic semantic model according to the regular expression and the probability of the regular expression;
获取用户语料;Get user corpus;
根据所述用户语料生成对应的用户正则式;Generate a corresponding user regular expression according to the user corpus;
将所述用户正则式和所述概率语义模型进行对比,得到所述用户语料的用户语义。Comparing the user regular expression with the probabilistic semantic model to obtain the user semantics of the user corpus.
进一步的,所述的根据所述语料样本获取特征信息,根据所述特征信息生成对应的正则表达式具体包括:Further, said acquiring feature information according to said corpus samples, and generating corresponding regular expressions according to said feature information specifically includes:
通过分词技术对所述语料样本进行分词,得到样本分词和所述样本分词对应的分词词性;Segmenting the corpus sample by word segmentation technology to obtain the sample word segmentation and the part of speech corresponding to the sample word segmentation;
根据所述样本分词和所述分词词性确定所述样本分词中的样本承接词;determining the sample successor words in the sample word segmentation according to the sample word segmentation and the part of speech of the word segmentation;
分析所述语料样本的句式结构得到所述样本分词之间的关联关系;Analyzing the sentence structure of the corpus sample to obtain the association relationship between the sample word segments;
根据所述特征信息生成对应的正则表达式,所述特征信息包括所述样本分词、所述分词词性、所述样本承接词以及所述关联关系。A corresponding regular expression is generated according to the feature information, where the feature information includes the sample word segmentation, the part of speech of the word segmentation, the sample successor word, and the association relationship.
进一步的,所述的分析每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选具体包括:Further, the analysis of the existence accuracy of each feature information in each regular expression combination, and screening the regular expression combination specifically includes:
分析每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选;Analyzing the existence accuracy of each feature information in each regular expression combination, and screening the regular expression combination;
若某一个正则式组合中每一个特征信息的存在都是准确的,则保存所述特征信息存在准确的正则式组合;If the existence of each characteristic information in a certain regular expression combination is accurate, then there is an accurate regular expression combination for storing the characteristic information;
若某一个正则式组合中存在至少一个特征信息的存在是不准确的,则删除所述特征信息存在不准确的正则式组合。If the existence of at least one characteristic information in a certain regular expression combination is inaccurate, then delete the regular expression combination in which the characteristic information is inaccurate.
进一步的,所述的根据所述语料库分析所述正则式对应的正则式概率,所述正则式为所述正则表达式和筛选后的正则式组合具体包括:Further, the analysis of the regular expression probability corresponding to the regular expression according to the corpus, the regular expression being the combination of the regular expression and the filtered regular expression specifically includes:
根据每一条正则式中的特征信息确定对应的正则式中的样本关键词;Determine the sample keywords in the corresponding regular formula according to the feature information in each regular formula;
分析所述样本关键词在所述语料库中出现的样本关键词概率;Analyzing the sample keyword probability that the sample keyword appears in the corpus;
根据所述样本关键词概率确定对应的所述正则式的正则式概率。Determine the corresponding regular expression probability of the regular expression according to the sample keyword probability.
进一步的,所述的将所述用户正则式和所述概率语义模型进行对比,得到所述用户语料的用户语义具体包括:Further, comparing the user regular expression with the probabilistic semantic model to obtain the user semantics of the user corpus specifically includes:
将所述用户正则式和所述概率语义模型进行对比;comparing the user regular expression with the probabilistic semantic model;
若所述用户正则式和所述概率语义模型中多个正则式匹配相符,则根据匹配相符的多个正则式对应的正则式概率确定目标正则式;If the user regular expression matches multiple regular expressions in the probabilistic semantic model, then determine the target regular expression according to the regular expression probabilities corresponding to the multiple matching regular expressions;
根据所述目标正则式解析所述用户语料得到所述用户语义。Parsing the user corpus according to the target regular expression to obtain the user semantics.
本发明还提供一种语义理解系统,包括:The present invention also provides a semantic understanding system, comprising:
语料库建立模块,获取语料样本,根据所述语料样本建立语料库;The corpus building module obtains a corpus sample and builds a corpus according to the corpus sample;
表达式生成模块,根据所述语料库建立模块获取的所述语料样本获取特征信息,根据所述特征信息生成对应的正则表达式;The expression generation module obtains feature information according to the corpus sample obtained by the corpus building module, and generates a corresponding regular expression according to the feature information;
表达式组合模块,分别抽取所述表达式生成模块生成的所述正则表达式中的所述特征信息生成多个不同的正则式组合;An expression combination module, respectively extracting the feature information in the regular expressions generated by the expression generation module to generate a plurality of different regular expression combinations;
筛选模块,分析所述表达式组合模块生成的每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选;A screening module, analyzing the existence accuracy of each feature information in each regular expression combination generated by the expression combination module, and screening the regular expression combination;
概率分析模块,根据所述语料库建立模块建立的所述语料库分析所述正则式对应的正则式概率,所述正则式为所述正则表达式和筛选后的正则式组合;The probability analysis module analyzes the regular expression probability corresponding to the regular expression according to the corpus established by the corpus building module, and the regular expression is a combination of the regular expression and the filtered regular expression;
模型生成模块,根据所述表达式生成模块和所述筛选模块得到的所述正则式以及所述概率分析模块得到的所述正则式概率生成概率语义模型;A model generation module that generates a probabilistic semantic model according to the regular expression obtained by the expression generation module and the screening module and the probability of the regular expression obtained by the probability analysis module;
语料获取模块,获取用户语料;The corpus acquisition module acquires user corpus;
处理模块,根据所述语料获取模块获取的所述用户语料生成对应的用户正则式;A processing module that generates a corresponding user regular expression according to the user corpus acquired by the corpus acquisition module;
对比模块,将所述处理模块得到的所述用户正则式和所述模型生成模块生成的所述概率语义模型进行对比,得到所述用户语料的用户语义。The comparison module compares the user regular expression obtained by the processing module with the probabilistic semantic model generated by the model generation module to obtain the user semantics of the user corpus.
进一步的,所述表达式生成模块具体包括:Further, the expression generation module specifically includes:
分词单元,通过分词技术对所述语料库建立模块获取的所述语料样本进行分词,得到样本分词和所述样本分词对应的分词词性;The word segmentation unit performs word segmentation on the corpus sample obtained by the corpus building module through word segmentation technology, to obtain the sample word segmentation and the part of speech corresponding to the sample word segmentation;
承接词确定单元,根据所述分词单元得到的所述样本分词和所述分词词性确定所述样本分词中的样本承接词;The successor word determining unit determines the sample successor word in the sample word segmentation according to the sample word segmentation and the part of speech obtained by the word segmentation unit;
关系分析单元,分析所述语料库建立模块获取的所述语料样本的句式结构得到所述分词单元得到的所述样本分词之间的关联关系;A relationship analysis unit, analyzing the sentence structure of the corpus sample obtained by the corpus building module to obtain the association relationship between the sample word segmentation obtained by the word segmentation unit;
表达式生成单元,根据所述特征信息生成对应的正则表达式,所述特征信息包括所述分词单元得到的所述样本分词、所述分词词性、所述承接词确定单元得到的所述样本承接词以及所述关系分析单元确定的所述关联关系。An expression generation unit generates a corresponding regular expression according to the feature information, the feature information includes the sample word segmentation obtained by the word segmentation unit, the part of speech of the word segmentation, and the sample succession obtained by the succession word determination unit words and the association relationship determined by the relationship analysis unit.
进一步的,所述筛选模块具体包括:Further, the screening module specifically includes:
准确性分析单元,分析所述表达式组合模块生成的每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选;The accuracy analysis unit analyzes the existence accuracy of each feature information in each regular expression combination generated by the expression combination module, and screens the regular expression combination;
筛选单元,若所述准确性分析单元得到某一个正则式组合中每一个特征信息的存在都是准确的,则保存所述特征信息存在准确的正则式组合;The screening unit, if the accuracy analysis unit obtains that the existence of each feature information in a certain regular expression combination is accurate, then save the accurate regular expression combination in the feature information;
所述筛选单元,若所述准确性分析单元得到某一个正则式组合中存在至少一个特征信息的存在是不准确的,则删除所述特征信息存在不准确的正则式组合。The screening unit is configured to delete the regular expression combination in which the feature information is inaccurate if the accuracy analysis unit obtains that at least one feature information in a certain regular expression combination is inaccurate.
进一步的,所述概率分析模块具体包括:Further, the probability analysis module specifically includes:
关键词分析单元,根据每一条正则式中的特征信息确定对应的正则式中的样本关键词;The keyword analysis unit determines the sample keywords in the corresponding regular expressions according to the feature information in each regular expression;
概率分析单元,分析所述关键词分析单元确定的所述样本关键词在所述语料库建立模块建立的所述语料库中出现的样本关键词概率;a probability analysis unit, analyzing the probability of the sample keywords determined by the keyword analysis unit appearing in the corpus established by the corpus building module;
概率确定单元,根据所述概率分析单元得到的所述样本关键词概率确定对应的所述正则式的正则式概率。A probability determining unit is configured to determine the corresponding regular expression probability of the regular expression according to the sample keyword probability obtained by the probability analyzing unit.
进一步的,所述对比模块具体包括:Further, the comparison module specifically includes:
对比单元,将所述用户正则式和所述概率语义模型进行对比;a comparing unit, comparing the user regular expression with the probabilistic semantic model;
处理单元,若所述对比单元得到所述用户正则式和所述概率语义模型中多个正则式匹配相符,则根据匹配相符的多个正则式对应的正则式概率确定目标正则式;A processing unit, if the comparison unit obtains that the user regular expression matches multiple regular expressions in the probabilistic semantic model, then determine the target regular expression according to the regular expression probabilities corresponding to the multiple matching regular expressions;
解析单元,根据所述处理单元得到的所述目标正则式解析所述用户语料得到所述用户语义。A parsing unit, for parsing the user corpus according to the target regular expression obtained by the processing unit to obtain the user semantics.
通过本发明提供的一种语义理解方法及系统,能够带来以下至少一种有益效果:The semantic understanding method and system provided by the present invention can bring at least one of the following beneficial effects:
1、本发明中,根据语料样本生成对应的正则表达式,然后抽取正则表达式中的特征信息得到正则式组合,结合语料库分析正则表达式和正则式组合的正则式概率,从而解析获取的用户语料最大可能的用户语义。1. In the present invention, the corresponding regular expression is generated according to the corpus sample, and then the feature information in the regular expression is extracted to obtain the regular expression combination, and the regular expression and the regular expression probability of the regular expression combination are analyzed in combination with the corpus, thereby analyzing the acquired user The largest possible user semantics of the corpus.
2、本发明中,根据分词技术对语料样本进行分词,并分析语料样本的句式结构,从而生成对应的正则表达式,便于后续总结出对应的正则式组合分析正则式概率。2. In the present invention, the corpus sample is segmented according to the word segmentation technology, and the sentence structure of the corpus sample is analyzed, thereby generating a corresponding regular expression, which is convenient for subsequently summarizing the corresponding regular expression combination and analyzing the regular expression probability.
3、本发明中,对抽取正则表达式中的特征信息生成的正则式组合中的特征信息进行分析,判断该正则式组合是否符合逻辑,具有实际语义,从而对正则式组合进行筛选。3. In the present invention, the characteristic information in the regular expression combination generated by extracting the characteristic information in the regular expression is analyzed, and it is judged whether the regular expression combination is logical and has actual semantics, thereby screening the regular expression combination.
附图说明Description of drawings
下面将以明确易懂的方式,结合说明书附图说明优选的实施方式,对一种语义理解方法及系统的上述特性、技术特征、优点及其实现方式予以进一步说明。In the following, a preferred implementation mode will be described in a clear and understandable manner in conjunction with the accompanying drawings, and the above-mentioned characteristics, technical features, advantages and implementation methods of a semantic understanding method and system will be further described.
图1是本发明一种语义理解方法的一个实施例的流程图;Fig. 1 is the flowchart of an embodiment of a kind of semantic understanding method of the present invention;
图2是本发明一种语义理解方法的另一个实施例的流程图;Fig. 2 is the flowchart of another embodiment of a kind of semantic understanding method of the present invention;
图3是本发明一种语义理解方法的另一个实施例的流程图;Fig. 3 is the flowchart of another embodiment of a kind of semantic understanding method of the present invention;
图4是本发明一种语义理解方法的另一个实施例的流程图;Fig. 4 is the flowchart of another embodiment of a kind of semantic understanding method of the present invention;
图5是本发明一种语义理解方法的另一个实施例的流程图;Fig. 5 is the flowchart of another embodiment of a kind of semantic understanding method of the present invention;
图6是本发明一种语义理解系统的一个实施例的结构示意图;Fig. 6 is a schematic structural diagram of an embodiment of a semantic understanding system of the present invention;
图7是本发明一种语义理解系统的另一个实施例的结构示意图。Fig. 7 is a schematic structural diagram of another embodiment of a semantic understanding system of the present invention.
附图标号说明:Explanation of reference numbers:
100 语义理解系统100 semantic understanding system
110 语料库建立模块110 Corpus Building Module
120 表达式生成模块 121 分词单元 122 承接词确定单元 123 关系分析单元124 表达式生成单元120 Expression generation module 121 Word segmentation unit 122 Successor word determination unit 123 Relationship analysis unit 124 Expression generation unit
130 表达式组合模块130 expression combination module
140 筛选模块 141 准确性分析单元 142 筛选单元140 Screening Module 141 Accuracy Analysis Unit 142 Screening Unit
150 概率分析模块 151 关键词分析单元 152 概率分析单元 153 概率确定单元150 Probability Analysis Module 151 Keyword Analysis Unit 152 Probability Analysis Unit 153 Probability Determination Unit
160 模型生成模块160 model generation module
170 语料获取模块170 Corpus acquisition module
180 处理模块180 processing modules
190 对比模块 191 对比单元 192 处理单元 193 解析单元190 Comparison module 191 Comparison unit 192 Processing unit 193 Analysis unit
具体实施方式Detailed ways
为了能够更加清楚地说明本发明实施例或现有技术中的技术方案,下面将对照说明书附图说明本发明的具体实施方式。显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图,并且获得其他的实施方式。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the specific implementation manners of the present invention will be described below with reference to the accompanying drawings. Apparently, the accompanying drawings in the following description are only some embodiments of the present invention, and those skilled in the art can obtain other accompanying drawings based on these drawings and obtain other implementations.
为了使图面简洁,各图中只示意性地表示出了与本发明相关的部分,它们并不代表其作为产品的实际结构。另外,以使图面简洁便于理解,在有些图中具有相同结构或功能的部件,仅示意性地绘示了其中的一个,或仅标出了其中的一个。在本文中,“一个”不仅表示“仅此一个”,也可以表示“多于一个”的情形。In order to make the drawings concise, each drawing only schematically shows the parts related to the present invention, and they do not represent the actual structure of the product. In addition, to make the drawings concise and easy to understand, in some drawings, only one of the components having the same structure or function is schematically shown, or only one of them is marked. Herein, "a" not only means "only one", but also means "more than one".
本发明的一个实施例,如图1所示,一种语义理解方法,包括:One embodiment of the present invention, as shown in Figure 1, a kind of semantic understanding method, comprises:
S100获取语料样本,根据所述语料样本建立语料库。S100 Acquire a corpus sample, and build a corpus according to the corpus sample.
具体的,获取大量的语料样本,根据语料样本建立语料库。其中语料样本可以是规范的书面用语,也可以是用户语音、音频等,因为在人机交互的过程中用户语音输入和文字输入都是主流的交互方式。Specifically, a large number of corpus samples are obtained, and a corpus is established according to the corpus samples. The corpus samples can be standardized written language, or user voice, audio, etc., because user voice input and text input are the mainstream interaction methods in the process of human-computer interaction.
另外,由于整个分析过程是针对书面文本,因此如果收集的是用户语音、音频等语音文件,首先需要将语音文件转化为识别文本,然后对该识别文本进行相应的处理。In addition, since the entire analysis process is aimed at written text, if the user's voice, audio and other voice files are collected, the voice file needs to be converted into recognized text first, and then the recognized text should be processed accordingly.
S200根据所述语料样本获取特征信息,根据所述特征信息生成对应的正则表达式。S200 Acquire feature information according to the corpus sample, and generate a corresponding regular expression according to the feature information.
具体的,分析语料样本的句子中包含的分词词性以及句式结构,从而获取语料样本包含的特征信息,然后根据特征信息生成该语料样本对应的正则表达式。Specifically, the part of speech and the sentence structure included in the sentence of the corpus sample are analyzed to obtain the feature information contained in the corpus sample, and then a regular expression corresponding to the corpus sample is generated according to the feature information.
S300分别抽取所述正则表达式中的所述特征信息生成多个不同的正则式组合。S300 Extract the feature information in the regular expression to generate multiple different regular expression combinations.
具体的,分别抽取正则表达式中的特征信息生成多个不同的正则式组合,相当于对正则表达式中包含的特征信息进行排列组合得到若干个正则式组合。例如,首先任意选取两个特征信息组合成正则式组合,然后任意选取三个特征信息组合成正则式组合,直至得到所有的正则式组合。Specifically, extracting feature information in the regular expressions to generate multiple different regular expression combinations is equivalent to arranging and combining the feature information contained in the regular expressions to obtain several regular expression combinations. For example, first arbitrarily select two pieces of feature information to combine into a regular expression combination, then arbitrarily select three feature information to combine into a regular expression combination, until all the regular expression combinations are obtained.
S400分析每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选。S400 Analyze the existence and accuracy of each feature information in each regular expression combination, and screen the regular expression combinations.
具体的,上述排列组合得到的若干个正则式组合中可能存在某些句子成分结构不合理的组合,因此分析每一个正则式组合中每一个特征信息的存在准确性,从而对正则式组合进行筛选。Specifically, there may be some unreasonable combinations of sentence components in the several regular formula combinations obtained from the above permutations and combinations, so the existence accuracy of each feature information in each regular formula combination is analyzed, so as to screen the regular formula combinations .
S500根据所述语料库分析所述正则式对应的正则式概率,所述正则式为所述正则表达式和筛选后的正则式组合。S500 Analyze the regular expression probability corresponding to the regular expression according to the corpus, where the regular expression is a combination of the regular expression and a filtered regular expression.
具体的,有用户语料直接得到的正则表达式,以及经过筛选之后得到的组合之后的正则式组合是符合句子结构逻辑的正则式,根据语料库中所有的语料样本分析每一条正则式对应的在语料库中出现的正则式概率。Specifically, the regular expressions obtained directly from the user corpus, and the combined regular expressions obtained after screening are regular expressions that conform to the logic of the sentence structure. According to all corpus samples in the corpus, each regular expression corresponding to the corpus is analyzed. The regular expression probability that appears in .
S600根据所述正则式以及所述正则式概率生成概率语义模型。S600 Generate a probabilistic semantic model according to the regular expression and the probability of the regular expression.
具体的,根据正则式以及对应的正则式概率生成概率语义模型,在概率语义模型中建立正则式和正则式概率的映射关系。Specifically, a probability semantic model is generated according to the regular expression and the corresponding regular expression probability, and a mapping relationship between the regular expression and the regular expression probability is established in the probability semantic model.
S700获取用户语料。The S700 acquires user corpus.
具体的,获取用户语料,智能设备在获取用户语料的过程中,用户通过语音输入和文字输入都是主流的交互方式,但是无论获取的用户语料是何种形式,最终系统进行处理的都是文本形式,因此,如果获取到语音形式,需要将其首先转化为文本形式。Specifically, to obtain user corpus, in the process of obtaining user corpus by smart devices, users' voice input and text input are the mainstream interaction methods, but no matter what form the acquired user corpus is in, the final system will process text form, so if a speech form is obtained, it needs to be converted into text form first.
S800根据所述用户语料生成对应的用户正则式。S800 Generate a corresponding user regular expression according to the user corpus.
具体的,根据上述获取的用户语料的句子中包含的分词词性以及句式结构,从而生成对应的用户正则式。Specifically, a corresponding user regular expression is generated according to part of speech and sentence structure included in the sentence of the user corpus obtained above.
S900将所述用户正则式和所述概率语义模型进行对比,得到所述用户语料的用户语义。S900 Compare the user regular expression with the probabilistic semantic model to obtain the user semantics of the user corpus.
具体的,将上述得到的用户正则式和概率语义模型中的正则式逐一进行对比,选择匹配的正则式中正则式概率最大的正则式解析得到用户语料,得到对应的用户语义。Specifically, compare the user regular expressions obtained above with the regular expressions in the probabilistic semantic model one by one, select the regular expression with the highest regular expression probability among the matching regular expressions, analyze the user corpus, and obtain the corresponding user semantics.
本实施例中,根据语料样本生成对应的正则表达式,然后抽取正则表达式中的特征信息得到正则式组合,结合语料库分析正则表达式和正则式组合的正则式概率,从而解析获取的用户语料最大可能的用户语义。In this embodiment, the corresponding regular expression is generated according to the corpus sample, and then the feature information in the regular expression is extracted to obtain the regular expression combination, and the regular expression probability of the regular expression and the regular expression combination is analyzed in conjunction with the corpus, thereby analyzing the obtained user corpus Maximum possible user semantics.
本发明的另一个实施例,是上述的实施例的优化实施例,如图2所示,包括:Another embodiment of the present invention is an optimized embodiment of the above-mentioned embodiment, as shown in Figure 2, including:
S100获取语料样本,根据所述语料样本建立语料库。S100 Acquire a corpus sample, and build a corpus according to the corpus sample.
S200根据所述语料样本获取特征信息,根据所述特征信息生成对应的正则表达式。S200 Acquire feature information according to the corpus sample, and generate a corresponding regular expression according to the feature information.
所述的S200根据所述语料样本获取特征信息,根据所述特征信息生成对应的正则表达式具体包括:The step S200 is to obtain feature information according to the corpus sample, and to generate a corresponding regular expression according to the feature information specifically includes:
S210通过分词技术对所述语料样本进行分词,得到样本分词和所述样本分词对应的分词词性。S210 Segment the corpus sample by using a word segmentation technology to obtain the sample word segmentation and the part of speech corresponding to the sample word segmentation.
具体的,根据分词技术对语料样本进行分词,识别语料样本中的每一句话中词语的词性,然后将语料样本中的每一句话中根据词语的词性将整个句子划分为字、词以及短语等分词构成。因此得到了语料样本中包含的样本分词以及对应的分词词性。Specifically, the corpus sample is segmented according to the word segmentation technology, the part of speech in each sentence in the corpus sample is identified, and then the entire sentence in each sentence in the corpus sample is divided into words, words, and phrases according to the part of speech of the word. participle composition. Therefore, the sample part of speech contained in the corpus sample and the corresponding part of speech of the part of speech are obtained.
例如,某一语料样本为“鲸鱼会喷水”,进行分词得到的样本分词为“鲸鱼”、“会”、“喷水”,“鲸鱼”对应的分词词性为名词,“会”对应的分词词性为代词,“喷水”对应的分词词性为名词。For example, if a corpus sample is "whale can spout water", the sample word segmentation obtained by word segmentation is "whale", "hui", "water spray", the part of speech corresponding to "whale" is a noun, and the part of speech corresponding to "hui" is The part of speech is a pronoun, and the part of speech corresponding to "spray water" is a noun.
S220根据所述样本分词和所述分词词性确定所述样本分词中的样本承接词。S220 Determine a sample successor word in the sample word segmentation according to the sample word segmentation and the part of speech of the word segmentation.
具体的,根据样本分词和分词词性确定样本分词中的样本承接词,例如“和”、“或”、“不但”、“而且”等样本承接词,有助于确定样本分词之间的关系。Specifically, the sample successor words in the sample word segmentation are determined according to the sample word segmentation and part of speech, such as "and", "or", "not only", "and" and other sample successor words, which help to determine the relationship between the sample word segmentation.
S230分析所述语料样本的句式结构得到所述样本分词之间的关联关系。S230 Analyze the sentence structure of the corpus sample to obtain the association relationship between the sample word segments.
具体的,上述根据分词技术得到了语料样本中包含的样本分词以及分词词性,然后根据语料样本的句式结构分析语料样本中包含的样本分词之间的关联关系。Specifically, the above-mentioned word segmentation technology obtains the sample word segmentation and part of speech included in the corpus sample, and then analyzes the relationship between the sample word segmentation included in the corpus sample according to the sentence structure of the corpus sample.
例如,某一语料样本为“鲸鱼会喷水”,进行分词得到的样本分词为,“鲸鱼”、“会”、“喷水”,“鲸鱼”对应的分词词性为名词,“会”对应的分词词性为代词,“喷水”对应的分词词性为名词,分析语料样本的句式结构得出名词“鲸鱼”和动词“喷水”是主谓关系。For example, if a corpus sample is "whale can spout water", the sample word segmentation obtained by word segmentation is "whale", "hui", "spout water", the part of speech corresponding to "whale" is a noun, and the part of speech corresponding to "hui" is The part of speech of the participle is a pronoun, and the part of part of part of speech corresponding to "spray water" is a noun. By analyzing the sentence structure of the corpus sample, it is concluded that the noun "whale" and the verb "spray water" are subject-predicate relations.
S240根据所述特征信息生成对应的正则表达式,所述特征信息包括所述样本分词、所述分词词性、所述样本承接词以及所述关联关系。S240 Generate a corresponding regular expression according to the feature information, where the feature information includes the sample word segmentation, the part of speech of the word segmentation, the sample successor word, and the association relationship.
具体的,根据样本分词、分词词性、样本承接词以及关联关系生成对应的正则表达式,例如,某一语料样本为“鲸鱼会喷水”,进行分词得到的内容分词为,“鲸鱼”、“会”、“喷水”,“鲸鱼”对应的分词词性为名词,“会”对应的分词词性为代词,“喷水”对应的分词词性为名词,分析实体内容的句子结构得出名词“鲸鱼”和动词“喷水”是主谓关系,得到的正则表达式为:名词(鲸鱼)#代词(会)#动词(喷水)。Specifically, the corresponding regular expressions are generated according to the sample word segmentation, part of speech of the word segmentation, sample successor words, and association relations. For example, if a certain corpus sample is "whale can spout water", the content word segmentation obtained by word segmentation is "whale", " The part of speech corresponding to "hui", "spray water", and "whale" is a noun, the part of speech corresponding to "hui" is a pronoun, and the part of speech corresponding to "spray water" is a noun. The sentence structure of the entity content is analyzed to obtain the noun "whale " and the verb "water spray" are subject-predicate relations, and the regular expression obtained is: noun (whale)#pronoun (will)#verb (water spray).
S300分别抽取所述正则表达式中的所述特征信息生成多个不同的正则式组合。S300 Extract the feature information in the regular expression to generate multiple different regular expression combinations.
S400分析每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选。S400 Analyze the existence and accuracy of each feature information in each regular expression combination, and screen the regular expression combinations.
S500根据所述语料库分析所述正则式对应的正则式概率,所述正则式为所述正则表达式和筛选后的正则式组合。S500 Analyze the regular expression probability corresponding to the regular expression according to the corpus, where the regular expression is a combination of the regular expression and a filtered regular expression.
S600根据所述正则式以及所述正则式概率生成概率语义模型。S600 Generate a probabilistic semantic model according to the regular expression and the probability of the regular expression.
S700获取用户语料。The S700 acquires user corpus.
S800根据所述用户语料生成对应的用户正则式。S800 Generate a corresponding user regular expression according to the user corpus.
S900将所述用户正则式和所述概率语义模型进行对比,得到所述用户语料的用户语义。S900 Compare the user regular expression with the probabilistic semantic model to obtain the user semantics of the user corpus.
本实施例中,根据分词技术对语料样本进行分词,并分析语料样本的句式结构,从而生成对应的正则表达式,便于后续总结出对应的正则式组合分析正则式概率。In this embodiment, the corpus sample is segmented according to the word segmentation technology, and the sentence structure of the corpus sample is analyzed to generate a corresponding regular expression, which facilitates the subsequent summary of the corresponding regular expression combination and analysis of the regular expression probability.
本发明的另一个实施例,是上述的实施例的优化实施例,如图3所示,包括:Another embodiment of the present invention is an optimized embodiment of the above-mentioned embodiment, as shown in Figure 3, including:
S100获取语料样本,根据所述语料样本建立语料库。S100 Acquire a corpus sample, and build a corpus according to the corpus sample.
S200根据所述语料样本获取特征信息,根据所述特征信息生成对应的正则表达式。S200 Acquire feature information according to the corpus sample, and generate a corresponding regular expression according to the feature information.
S300分别抽取所述正则表达式中的所述特征信息生成多个不同的正则式组合。S300 Extract the feature information in the regular expression to generate multiple different regular expression combinations.
S400分析每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选。S400 Analyze the existence and accuracy of each feature information in each regular expression combination, and screen the regular expression combinations.
所述的S400分析每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选具体包括:The S400 analyzes the existence accuracy of each feature information in each regular expression combination, and screening the regular expression combination specifically includes:
S410分析每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选。S410 Analyze the existence accuracy of each feature information in each regular expression combination, and screen the regular expression combinations.
具体的,抽取正则表达式中的特征信息生成若干个不同的正则式组合,但是其中可能存在某些句子成分结构不合理的组合,例如,某一语料样本为“鲸鱼会喷水”,得到的对应的正则表达式为:名词(鲸鱼)#代词(会)#动词(喷水),抽取其中的特征信息生成的部分正则式组合为:名词(鲸鱼)#代词(会)、代词(会)#动词(喷水)、名词(鲸鱼)#动词(喷水),其中正则式组合名词(鲸鱼)#代词(会)的结构不完整,名词(鲸鱼)和代词(会)的搭配组合不准确,并不具有完整的语义。Specifically, the feature information in the regular expression is extracted to generate several different regular expression combinations, but there may be some combinations of sentence components with unreasonable structures. The corresponding regular expression is: noun (whale)#pronoun (hui)#verb (spray), and the part of the regular expression generated by extracting the feature information is: noun (whale)#pronoun (hui), pronoun (hui) #verb (spray water), noun (whale)#verb (spray water), of which the regular expression combination noun (whale)# pronoun (hui) has an incomplete structure, and the combination of noun (whale) and pronoun (hui) is not accurate , does not have complete semantics.
因此,需要分析每一个正则式组合中每一个特征信息的存在准确性,对正则式组合进行筛选。Therefore, it is necessary to analyze the existence and accuracy of each feature information in each regular expression combination, and to screen the regular expression combinations.
S420若某一个正则式组合中每一个特征信息的存在都是准确的,则保存所述特征信息存在准确的正则式组合。S420 If the presence of each feature information in a certain regular expression combination is accurate, save the exact regular expression combination in which the feature information exists.
具体的,如果某一个正则式组合中每一个特征信息的存在都是准确的,说明该正则式组合具有真实语义,因此保存该正则式组合。Specifically, if the existence of each feature information in a certain regular expression combination is accurate, it means that the regular expression combination has real semantics, so the regular expression combination is saved.
S430若某一个正则式组合中存在至少一个特征信息的存在是不准确的,则删除所述特征信息存在不准确的正则式组合。S430 If the presence of at least one feature information in a certain regular expression combination is inaccurate, delete the regular expression combination in which the feature information is inaccurate.
具体的,如果某一个正则式组合中至少一个特征信息的存在是不准确的,例如抽取名词、代词以及主谓关系特征信息生成的正则式组合,其中主谓关系是不准确的,则删除该正则式组合。Specifically, if the existence of at least one feature information in a certain regular expression combination is inaccurate, such as the regular expression combination generated by extracting nouns, pronouns, and subject-predicate relationship feature information, where the subject-predicate relationship is inaccurate, delete the Regular combination.
S500根据所述语料库分析所述正则式对应的正则式概率,所述正则式为所述正则表达式和筛选后的正则式组合。S500 Analyze the regular expression probability corresponding to the regular expression according to the corpus, where the regular expression is a combination of the regular expression and a filtered regular expression.
S600根据所述正则式以及所述正则式概率生成概率语义模型。S600 Generate a probabilistic semantic model according to the regular expression and the probability of the regular expression.
S700获取用户语料。The S700 acquires user corpus.
S800根据所述用户语料生成对应的用户正则式。S800 Generate a corresponding user regular expression according to the user corpus.
S900将所述用户正则式和所述概率语义模型进行对比,得到所述用户语料的用户语义。S900 Compare the user regular expression with the probabilistic semantic model to obtain the user semantics of the user corpus.
本实施例中,对抽取正则表达式中的特征信息生成的正则式组合中的特征信息进行分析,判断该正则式组合是否符合逻辑,具有实际语义,从而对正则式组合进行筛选。In this embodiment, the characteristic information in the regular expression combination generated by extracting the characteristic information in the regular expression is analyzed, and it is judged whether the regular expression combination is logical and has actual semantics, so as to filter the regular expression combination.
本发明的另一个实施例,是上述的实施例的优化实施例,如图4所示,包括:Another embodiment of the present invention is an optimized embodiment of the above-mentioned embodiment, as shown in Figure 4, including:
S100获取语料样本,根据所述语料样本建立语料库。S100 Acquire a corpus sample, and build a corpus according to the corpus sample.
S200根据所述语料样本获取特征信息,根据所述特征信息生成对应的正则表达式。S200 Acquire feature information according to the corpus sample, and generate a corresponding regular expression according to the feature information.
S300分别抽取所述正则表达式中的所述特征信息生成多个不同的正则式组合。S300 Extract the feature information in the regular expression to generate multiple different regular expression combinations.
S400分析每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选。S400 Analyze the existence and accuracy of each feature information in each regular expression combination, and screen the regular expression combinations.
S500根据所述语料库分析所述正则式对应的正则式概率,所述正则式为所述正则表达式和筛选后的正则式组合。S500 Analyze the regular expression probability corresponding to the regular expression according to the corpus, where the regular expression is a combination of the regular expression and a filtered regular expression.
所述的S500根据所述语料库分析所述正则式对应的正则式概率,所述正则式为所述正则表达式和筛选后的正则式组合具体包括:The S500 analyzes the probability of the regular expression corresponding to the regular expression according to the corpus, and the regular expression is a combination of the regular expression and the filtered regular expression and specifically includes:
S510根据每一条正则式中的特征信息确定对应的正则式中的样本关键词。S510 Determine the sample keywords in the corresponding regular expression according to the feature information in each regular expression.
具体的,根据每一条正则式中的特征信息确定样本关键词,例如根据关联关系确定样本关键词,将主谓关系中的主语对应的样本分词确定为样本关键词,或者根据分词词性确定样本关键词,将正则式中的动词或名词对应的样本分词确定为样本关键词。样本关键词的确定规则用户根据实际需要进行设定,样本关键词的数量可以是一个也可以是多个。Specifically, the sample keywords are determined according to the feature information in each regular expression, for example, the sample keywords are determined according to the association relationship, the sample participle corresponding to the subject in the subject-predicate relationship is determined as the sample keyword, or the sample key is determined according to the part of speech word, determine the sample participle corresponding to the verb or noun in the regular expression as the sample keyword. The determination rule of the sample keyword is set by the user according to actual needs, and the number of the sample keyword can be one or more.
例如,某一语料样本为“鲸鱼会喷水”,得到的正则表达式为:名词(鲸鱼)#代词(会)#动词(喷水)。如果将主谓关系中的主语对应的样本分词确定为样本关键词,则样本关键词为“鲸鱼”,如果将动词和名词对应的样本分词确定为样本关键词,则样本关键词为“鲸鱼”和“喷水”,因此样本关键词的数量也并不是限制为一个。For example, if a certain corpus sample is "whale can spout water", the regular expression obtained is: noun (whale)#pronoun (will)#verb (spray water). If the sample participle corresponding to the subject in the subject-predicate relationship is determined as the sample keyword, the sample keyword is "whale", and if the sample participle corresponding to the verb and the noun is determined as the sample keyword, then the sample keyword is "whale" and "water spray", so the number of sample keywords is not limited to one.
S520分析所述样本关键词在所述语料库中出现的样本关键词概率。S520 Analyze the probability of the sample keyword appearing in the corpus of the sample keyword.
具体的,确定每个正则式中的样本关键词之后,统计该样本关键词在语料库中所有的语料样本中出现的样本关键词概率。例如,某一语料样本为“鲸鱼会喷水”,得到的正则表达式为:名词(鲸鱼)#代词(会)#动词(喷水),确定的样本关键词为“鲸鱼”和“喷水”,则分别统计“鲸鱼”和“喷水”在所有的语料样本中出现的样本关键词概率,譬如语料库中包含100条语料样本,其中20条语料样本中包含“鲸鱼”,10条语料样本中包含“喷水”,则样本关键词“鲸鱼”对应的样本关键词概率为0.2,样本关键词“喷水”对应的样本关键词概率为0.1。Specifically, after determining the sample keyword in each regular expression, the probability of the sample keyword appearing in all corpus samples in the corpus is counted. For example, if a certain corpus sample is "whale can spout", the regular expression obtained is: noun (whale)#pronoun (will)#verb (spray), and the determined sample keywords are "whale" and "spout". ", then count the sample keyword probabilities of "whale" and "spray" in all corpus samples respectively. For example, the corpus contains 100 corpus samples, of which 20 corpus samples contain "whale", and 10 corpus samples contains "water spray", the sample keyword probability corresponding to the sample keyword "whale" is 0.2, and the sample keyword probability corresponding to the sample keyword "water spray" is 0.1.
S530根据所述样本关键词概率确定对应的所述正则式的正则式概率。S530 Determine the corresponding regular expression probability of the regular expression according to the sample keyword probability.
具体的,根据样本关键词概率确定对应的正则式的正则式概率,如果正则式只有一个样本关键词,则该样本关键词对应的样本关键词概率即为该正则式对应的正则式概率。如果正则式有多个样本关键词,则样本关键词对应的样本关键词概率最高的为该正则式对应的正则式概率。Specifically, the regular expression probability of the corresponding regular expression is determined according to the sample keyword probability. If the regular expression has only one sample keyword, the sample keyword probability corresponding to the sample keyword is the regular expression probability corresponding to the regular expression. If the regular expression has multiple sample keywords, the probability of the sample keyword corresponding to the sample keyword is the regular expression probability corresponding to the regular expression.
S600根据所述正则式以及所述正则式概率生成概率语义模型。S600 Generate a probabilistic semantic model according to the regular expression and the probability of the regular expression.
S700获取用户语料。The S700 acquires user corpus.
S800根据所述用户语料生成对应的用户正则式。S800 Generate a corresponding user regular expression according to the user corpus.
S900将所述用户正则式和所述概率语义模型进行对比,得到所述用户语料的用户语义。S900 Compare the user regular expression with the probabilistic semantic model to obtain the user semantics of the user corpus.
本实施例中,通过预设的规则确定每个正则式中的样本关键词,然后结合语料库分析该样本关键词的样本关键词概率,从而得到对应的正则式的正则式概率。In this embodiment, the sample keywords in each regular expression are determined through preset rules, and then the sample keyword probability of the sample keywords is analyzed in combination with the corpus, so as to obtain the regular expression probability of the corresponding regular expression.
本发明的另一个实施例,是上述的实施例的优化实施例,如图5所示,包括:Another embodiment of the present invention is an optimized embodiment of the above-mentioned embodiment, as shown in Figure 5, including:
S100获取语料样本,根据所述语料样本建立语料库。S100 Acquire a corpus sample, and build a corpus according to the corpus sample.
S200根据所述语料样本获取特征信息,根据所述特征信息生成对应的正则表达式。S200 Acquire feature information according to the corpus sample, and generate a corresponding regular expression according to the feature information.
S300分别抽取所述正则表达式中的所述特征信息生成多个不同的正则式组合。S300 Extract the feature information in the regular expression to generate multiple different regular expression combinations.
S400分析每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选。S400 Analyze the existence and accuracy of each feature information in each regular expression combination, and screen the regular expression combinations.
S500根据所述语料库分析所述正则式对应的正则式概率,所述正则式为所述正则表达式和筛选后的正则式组合。S500 Analyze the regular expression probability corresponding to the regular expression according to the corpus, where the regular expression is a combination of the regular expression and a filtered regular expression.
S600根据所述正则式以及所述正则式概率生成概率语义模型。S600 Generate a probabilistic semantic model according to the regular expression and the probability of the regular expression.
S700获取用户语料。The S700 acquires user corpus.
S800根据所述用户语料生成对应的用户正则式。S800 Generate a corresponding user regular expression according to the user corpus.
S900将所述用户正则式和所述概率语义模型进行对比,得到所述用户语料的用户语义。S900 Compare the user regular expression with the probabilistic semantic model to obtain the user semantics of the user corpus.
所述的S900将所述用户正则式和所述概率语义模型进行对比,得到所述用户语料的用户语义具体包括:The S900 compares the user regular expression with the probabilistic semantic model, and obtains the user semantics of the user corpus, which specifically includes:
S910将所述用户正则式和所述概率语义模型进行对比。S910 compares the user regular expression with the probabilistic semantic model.
具体的,将根据用户语料得到的用户正则式和概率语义模型中的正则式逐一地进行对比匹配。Specifically, the user regular expressions obtained from the user corpus and the regular expressions in the probabilistic semantic model are compared and matched one by one.
S920若所述用户正则式和所述概率语义模型中多个正则式匹配相符,则根据匹配相符的多个正则式对应的正则式概率确定目标正则式。S920 If the user regular expression matches multiple regular expressions in the probabilistic semantic model, determine a target regular expression according to regular expression probabilities corresponding to multiple matching regular expressions.
S930根据所述目标正则式解析所述用户语料得到所述用户语义。S930: Parse the user corpus according to the target regular expression to obtain the user semantics.
具体的,由于用户正则式中包含多个特征信息,可能部分特征信息与概率语义模型中的正则式1匹配相符,另外部分特征信息与概率语义模型中的正则式2匹配相符,从而造成与概率语义模型中多个正则式匹配相符的情形,则对比比较匹配相符的多个正则式对应的正则式概率确定目标正则式,如比较上述正则式1和正则式2对应的正则式概率确定目标正则式,得到的目标正则式为解析该用户语料最可能的正则式,从而根据该目标正则式解析用户语料,得到对应的用户语义。Specifically, since the user regular expression contains multiple feature information, part of the feature information may match the regular expression 1 in the probabilistic semantic model, and the other part of the feature information matches the regular expression 2 in the probabilistic semantic model. If multiple regular expressions match in the semantic model, then compare and compare the regular expression probabilities corresponding to the matching regular expressions to determine the target regular expression, such as comparing the regular expression probabilities corresponding to the above regular expression 1 and regular expression 2 to determine the target regular expression formula, the obtained target regular formula is the most likely regular formula for parsing the user corpus, and then the user corpus is parsed according to the target regular formula to obtain the corresponding user semantics.
本实施例中,根据用户语料得到对应的用户正则式,然后和语义概率模型进行对比得到相应的目标正则式,通过该目标正则式解析用户语料从而得到最可能的用户语义。In this embodiment, the corresponding user regular expression is obtained according to the user corpus, and then compared with the semantic probability model to obtain the corresponding target regular expression, and the most likely user semantics is obtained by analyzing the user corpus through the target regular expression.
本发明的一个实施例,如图6所示,一种语义理解系统100,包括:One embodiment of the present invention, as shown in Figure 6, a semantic understanding system 100, comprising:
语料库建立模块110,获取语料样本,根据所述语料样本建立语料库。The corpus building module 110 acquires a corpus sample, and builds a corpus according to the corpus sample.
具体的,语料库建立模块110获取大量的语料样本,根据语料样本建立语料库。其中语料样本可以是规范的书面用语,也可以是用户语音、音频等,因为在人机交互的过程中用户语音输入和文字输入都是主流的交互方式。Specifically, the corpus building module 110 acquires a large number of corpus samples, and builds a corpus based on the corpus samples. The corpus samples can be standardized written language, or user voice, audio, etc., because user voice input and text input are the mainstream interaction methods in the process of human-computer interaction.
另外,由于整个分析过程是针对书面文本,因此如果收集的是用户语音、音频等语音文件,首先需要将语音文件转化为识别文本,然后对该识别文本进行相应的处理。In addition, since the entire analysis process is aimed at written text, if the user's voice, audio and other voice files are collected, the voice file needs to be converted into recognized text first, and then the recognized text should be processed accordingly.
表达式生成模块120,根据所述语料库建立模块110获取的所述语料样本获取特征信息,根据所述特征信息生成对应的正则表达式。The expression generating module 120 acquires feature information according to the corpus sample acquired by the corpus building module 110, and generates a corresponding regular expression according to the feature information.
具体的,表达式生成模块120分析语料样本的句子中包含的分词词性以及句式结构,从而获取语料样本包含的特征信息,然后根据特征信息生成该语料样本对应的正则表达式。Specifically, the expression generation module 120 analyzes the part of speech and sentence structure contained in the sentence of the corpus sample, so as to obtain the feature information contained in the corpus sample, and then generates a regular expression corresponding to the corpus sample according to the feature information.
表达式组合模块130,分别抽取所述表达式生成模块120生成的所述正则表达式中的所述特征信息生成多个不同的正则式组合。The expression combination module 130 extracts the feature information in the regular expressions generated by the expression generation module 120 to generate multiple different regular expression combinations.
具体的,表达式组合模块130分别抽取正则表达式中的特征信息生成多个不同的正则式组合,相当于对正则表达式中包含的特征信息进行排列组合得到若干个正则式组合。例如,首先任意选取两个特征信息组合成正则式组合,然后任意选取三个特征信息组合成正则式组合,直至得到所有的正则式组合。Specifically, the expression combination module 130 respectively extracts the feature information in the regular expressions to generate multiple different regular expression combinations, which is equivalent to arranging and combining the feature information contained in the regular expressions to obtain several regular expression combinations. For example, first arbitrarily select two pieces of feature information to combine into a regular expression combination, then arbitrarily select three feature information to combine into a regular expression combination, until all the regular expression combinations are obtained.
筛选模块140,分析所述表达式组合模块130生成的每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选。The screening module 140 analyzes the existence accuracy of each feature information in each regular expression combination generated by the expression combination module 130, and screens the regular expression combination.
具体的,上述排列组合得到的若干个正则式组合中可能存在某些句子成分结构不合理的组合,因此筛选模块140分析每一个正则式组合中每一个特征信息的存在准确性,从而对正则式组合进行筛选。Specifically, there may be some unreasonable combinations of sentence components in the several regular expression combinations obtained from the above permutations and combinations. Therefore, the screening module 140 analyzes the existence accuracy of each feature information in each regular expression combination, so that the regular expression combination to filter.
概率分析模块150,根据所述语料库建立模块110建立的所述语料库分析所述正则式对应的正则式概率,所述正则式为所述正则表达式和筛选后的正则式组合。The probability analysis module 150 analyzes the regular expression probability corresponding to the regular expression according to the corpus built by the corpus building module 110, and the regular expression is a combination of the regular expression and the filtered regular expression.
具体的,有用户语料直接得到的正则表达式,以及经过筛选之后得到的组合之后的正则式组合是符合句子结构逻辑的正则式,概率分析模块150根据语料库中所有的语料样本分析每一条正则式对应的在语料库中出现的正则式概率。Specifically, the regular expressions obtained directly from the user corpus, and the combined regular expressions obtained after screening are regular expressions that conform to the logic of the sentence structure, and the probability analysis module 150 analyzes each regular expression according to all corpus samples in the corpus. The corresponding regular expression probability of occurrence in the corpus.
模型生成模块160,根据所述表达式生成模块120和所述筛选模块140得到的所述正则式以及所述概率分析模块150得到的所述正则式概率生成概率语义模型。The model generation module 160 generates a probabilistic semantic model according to the regular expression obtained by the expression generation module 120 and the screening module 140 and the probability of the regular expression obtained by the probability analysis module 150 .
具体的,模型生成模块160根据正则式以及对应的正则式概率生成概率语义模型,在概率语义模型中建立正则式和正则式概率的映射关系。Specifically, the model generation module 160 generates a probability semantic model according to the regular expression and the corresponding regular expression probability, and establishes a mapping relationship between the regular expression and the regular expression probability in the probability semantic model.
语料获取模块170,获取用户语料。The corpus acquisition module 170 acquires user corpus.
具体的,语料获取模块170获取用户语料,智能设备在获取用户语料的过程中,用户通过语音输入和文字输入都是主流的交互方式,但是无论获取的用户语料是何种形式,最终系统进行处理的都是文本形式,因此,如果获取到语音形式,需要将其首先转化为文本形式。Specifically, the corpus acquisition module 170 acquires the user corpus. During the process of acquiring the user corpus by the smart device, the user's voice input and text input are the mainstream interaction methods, but no matter what form the acquired user corpus is, the final system will process it. All are in text form, so if you get the speech form, you need to convert it into text form first.
处理模块180,根据所述语料获取模块170获取的所述用户语料生成对应的用户正则式。The processing module 180 generates a corresponding user regular expression according to the user corpus acquired by the corpus acquisition module 170 .
具体的,处理模块180根据上述获取的用户语料的句子中包含的分词词性以及句式结构,从而生成对应的用户正则式。Specifically, the processing module 180 generates a corresponding user regular expression according to part of speech and sentence structure included in the sentence of the user corpus obtained above.
对比模块190,将所述处理模块180得到的所述用户正则式和所述模型生成模块160生成的所述概率语义模型进行对比,得到所述用户语料的用户语义。The comparison module 190 compares the user regular expression obtained by the processing module 180 with the probabilistic semantic model generated by the model generation module 160 to obtain the user semantics of the user corpus.
具体的,对比模块190将上述得到的用户正则式和概率语义模型中的正则式逐一进行对比,选择匹配的正则式中正则式概率最大的正则式解析得到用户语料,得到对应的用户语义。Specifically, the comparison module 190 compares the user regular expressions obtained above with the regular expressions in the probabilistic semantic model one by one, selects the regular expression with the highest regular expression probability among the matching regular expressions, analyzes the user corpus, and obtains the corresponding user semantics.
本实施例中,根据语料样本生成对应的正则表达式,然后抽取正则表达式中的特征信息得到正则式组合,结合语料库分析正则表达式和正则式组合的正则式概率,从而解析获取的用户语料最大可能的用户语义。In this embodiment, the corresponding regular expression is generated according to the corpus sample, and then the feature information in the regular expression is extracted to obtain the regular expression combination, and the regular expression probability of the regular expression and the regular expression combination is analyzed in combination with the corpus, thereby analyzing the obtained user corpus Maximum possible user semantics.
本发明的另一个实施例,是上述的实施例的优化实施例,如图7所示,包括:Another embodiment of the present invention is an optimized embodiment of the above-mentioned embodiment, as shown in FIG. 7 , including:
语料库建立模块110,获取语料样本,根据所述语料样本建立语料库。The corpus building module 110 acquires a corpus sample, and builds a corpus according to the corpus sample.
表达式生成模块120,根据所述语料库建立模块110获取的所述语料样本获取特征信息,根据所述特征信息生成对应的正则表达式。The expression generating module 120 acquires feature information according to the corpus sample acquired by the corpus building module 110, and generates a corresponding regular expression according to the feature information.
所述表达式生成模块120具体包括:The expression generating module 120 specifically includes:
分词单元121,通过分词技术对所述语料库建立模块110获取的所述语料样本进行分词,得到样本分词和所述样本分词对应的分词词性。The word segmentation unit 121 performs word segmentation on the corpus sample acquired by the corpus building module 110 by word segmentation technology, and obtains the sample word segmentation and the part of speech corresponding to the sample word segmentation.
具体的,分词单元121根据分词技术对语料样本进行分词,识别语料样本中的每一句话中词语的词性,然后将语料样本中的每一句话中根据词语的词性将整个句子划分为字、词以及短语等分词构成。因此得到了语料样本中包含的样本分词以及对应的分词词性。Specifically, the word segmentation unit 121 performs word segmentation on the corpus sample according to the word segmentation technology, identifies the part of speech in each sentence in the corpus sample, and then divides the entire sentence in each sentence in the corpus sample according to the part of speech of the word into words, words, and words. and phrase equivalence. Therefore, the sample part of speech contained in the corpus sample and the corresponding part of speech of the part of speech are obtained.
例如,某一语料样本为“鲸鱼会喷水”,进行分词得到的样本分词为“鲸鱼”、“会”、“喷水”,“鲸鱼”对应的分词词性为名词,“会”对应的分词词性为代词,“喷水”对应的分词词性为名词。For example, if a corpus sample is "whale can spout water", the sample word segmentation obtained by word segmentation is "whale", "hui", "water spray", the part of speech corresponding to "whale" is a noun, and the part of speech corresponding to "hui" is The part of speech is a pronoun, and the part of speech corresponding to "spray water" is a noun.
承接词确定单元122,根据所述分词单元121得到的所述样本分词和所述分词词性确定所述样本分词中的样本承接词。The successor word determining unit 122 determines the sample successor word in the sample word segmentation according to the sample word segmentation and the part of speech obtained by the word segmentation unit 121 .
具体的,承接词确定单元122根据样本分词和分词词性确定样本分词中的样本承接词,例如“和”、“或”、“不但”、“而且”等样本承接词,有助于确定样本分词之间的关系。Specifically, the successor word determination unit 122 determines the sample successor words in the sample word segmentation according to the sample word segmentation and the part of speech of the word segmentation, such as "and", "or", "not only", "and" and other sample successor words, which help to determine the sample word segmentation The relationship between.
关系分析单元123,分析所述语料库建立模块110获取的所述语料样本的句式结构得到所述分词单元121得到的所述样本分词之间的关联关系。The relationship analysis unit 123 analyzes the sentence structure of the corpus sample acquired by the corpus building module 110 to obtain the association relationship between the sample word segments obtained by the word segmentation unit 121 .
具体的,上述根据分词技术得到了语料样本中包含的样本分词以及分词词性,然后关系分析单元123根据语料样本的句式结构分析语料样本中包含的样本分词之间的关联关系。Specifically, the above-mentioned word segmentation technology obtains the sample word segmentation and part of speech included in the corpus sample, and then the relationship analysis unit 123 analyzes the relationship between the sample word segmentation included in the corpus sample according to the sentence structure of the corpus sample.
例如,某一语料样本为“鲸鱼会喷水”,进行分词得到的样本分词为,“鲸鱼”、“会”、“喷水”,“鲸鱼”对应的分词词性为名词,“会”对应的分词词性为代词,“喷水”对应的分词词性为名词,分析语料样本的句式结构得出名词“鲸鱼”和动词“喷水”是主谓关系。For example, if a corpus sample is "whale can spout water", the sample word segmentation obtained by word segmentation is "whale", "hui", "spout water", the part of speech corresponding to "whale" is a noun, and the part of speech corresponding to "hui" is The part of speech of the participle is a pronoun, and the part of part of part of speech corresponding to "spray water" is a noun. By analyzing the sentence structure of the corpus sample, it is concluded that the noun "whale" and the verb "spray water" are subject-predicate relations.
表达式生成单元124,根据所述特征信息生成对应的正则表达式,所述特征信息包括所述分词单元121得到的所述样本分词、所述分词词性、所述承接词确定单元122得到的所述样本承接词以及所述关系分析单元123确定的所述关联关系。The expression generating unit 124 generates a corresponding regular expression according to the feature information, the feature information including the sample word segmentation obtained by the word segmentation unit 121, the part of speech of the word segmentation, and the obtained word segmentation unit 122. The sample bearer word and the association relationship determined by the relationship analysis unit 123.
具体的,表达式生成单元124根据样本分词、分词词性、样本承接词以及关联关系生成对应的正则表达式,例如,某一语料样本为“鲸鱼会喷水”,进行分词得到的内容分词为,“鲸鱼”、“会”、“喷水”,“鲸鱼”对应的分词词性为名词,“会”对应的分词词性为代词,“喷水”对应的分词词性为名词,分析实体内容的句子结构得出名词“鲸鱼”和动词“喷水”是主谓关系,得到的正则表达式为:名词(鲸鱼)#代词(会)#动词(喷水)。Specifically, the expression generating unit 124 generates a corresponding regular expression according to the sample word segmentation, part of speech of the word segmentation, sample successor words, and association relations. For example, a certain corpus sample is "whale can spout water", and the content word segmentation obtained by word segmentation is, "Whale", "hui", "spray", the part of speech corresponding to "whale" is a noun, the part of speech corresponding to "hui" is a pronoun, and the part of speech corresponding to "spray" is a noun, analyze the sentence structure of the entity content It is obtained that the noun "whale" and the verb "spray" are subject-predicate relations, and the regular expression obtained is: noun (whale)#pronoun (will)#verb (spray).
表达式组合模块130,分别抽取所述表达式生成模块120生成的所述正则表达式中的所述特征信息生成多个不同的正则式组合。The expression combination module 130 extracts the feature information in the regular expressions generated by the expression generation module 120 to generate multiple different regular expression combinations.
筛选模块140,分析所述表达式组合模块130生成的每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选。The screening module 140 analyzes the existence accuracy of each feature information in each regular expression combination generated by the expression combination module 130, and screens the regular expression combination.
所述筛选模块140具体包括:The screening module 140 specifically includes:
准确性分析单元141,分析所述表达式组合模块130生成的每一个正则式组合中每一个特征信息的存在准确性,对所述正则式组合进行筛选。The accuracy analysis unit 141 analyzes the existence accuracy of each feature information in each regular expression combination generated by the expression combination module 130, and screens the regular expression combinations.
具体的,抽取正则表达式中的特征信息生成若干个不同的正则式组合,但是其中可能存在某些句子成分结构不合理的组合,例如,某一语料样本为“鲸鱼会喷水”,得到的对应的正则表达式为:名词(鲸鱼)#代词(会)#动词(喷水),抽取其中的特征信息生成的部分正则式组合为:名词(鲸鱼)#代词(会)、代词(会)#动词(喷水)、名词(鲸鱼)#动词(喷水),其中正则式组合名词(鲸鱼)#代词(会)的结构不完整,名词(鲸鱼)和代词(会)的搭配组合不准确,并不具有完整的语义。Specifically, the feature information in the regular expression is extracted to generate several different regular expression combinations, but there may be some combinations of sentence components with unreasonable structures. The corresponding regular expression is: noun (whale)#pronoun (hui)#verb (spray), and the part of the regular expression generated by extracting the feature information is: noun (whale)#pronoun (hui), pronoun (hui) #verb (spray water), noun (whale)#verb (spray water), of which the regular expression combination noun (whale)# pronoun (hui) has an incomplete structure, and the combination of noun (whale) and pronoun (hui) is not accurate , does not have complete semantics.
因此,需要准确性分析单元141分析每一个正则式组合中每一个特征信息的存在准确性,对正则式组合进行筛选。Therefore, the accuracy analysis unit 141 is required to analyze the existence accuracy of each feature information in each regular expression combination, and screen the regular expression combinations.
筛选单元142,若所述准确性分析单元141得到某一个正则式组合中每一个特征信息的存在都是准确的,则保存所述特征信息存在准确的正则式组合。The screening unit 142, if the accuracy analysis unit 141 obtains that the existence of each feature information in a certain regular expression combination is accurate, save the exact regular expression combination in which the feature information exists.
具体的,如果筛选单元142判断某一个正则式组合中每一个特征信息的存在都是准确的,说明该正则式组合具有真实语义,因此保存该正则式组合。Specifically, if the screening unit 142 judges that the existence of each feature information in a certain regular expression combination is accurate, it means that the regular expression combination has real semantics, so the regular expression combination is saved.
所述筛选单元142,若所述准确性分析单元141得到某一个正则式组合中存在至少一个特征信息的存在是不准确的,则删除所述特征信息存在不准确的正则式组合。The screening unit 142, if the accuracy analysis unit 141 obtains that the existence of at least one feature information in a certain regular expression combination is inaccurate, deletes the regular expression combination in which the feature information is inaccurate.
具体的,如果筛选单元142判断某一个正则式组合中至少一个特征信息的存在是不准确的,例如抽取名词、代词以及主谓关系特征信息生成的正则式组合,其中主谓关系是不准确的,则删除该正则式组合。Specifically, if the screening unit 142 judges that the existence of at least one feature information in a certain regular expression combination is inaccurate, for example, the regular expression combination generated by extracting nouns, pronouns, and subject-predicate relationship feature information, wherein the subject-predicate relationship is inaccurate , the regular expression combination is deleted.
概率分析模块150,根据所述语料库建立模块110建立的所述语料库分析所述正则式对应的正则式概率,所述正则式为所述正则表达式和筛选后的正则式组合。The probability analysis module 150 analyzes the regular expression probability corresponding to the regular expression according to the corpus built by the corpus building module 110, and the regular expression is a combination of the regular expression and the filtered regular expression.
所述概率分析模块150具体包括:The probability analysis module 150 specifically includes:
关键词分析单元151,根据每一条正则式中的特征信息确定对应的正则式中的样本关键词。The keyword analysis unit 151 determines the sample keywords in the corresponding regular expression according to the feature information in each regular expression.
具体的,关键词分析单元151根据每一条正则式中的特征信息确定样本关键词,例如根据关联关系确定样本关键词,将主谓关系中的主语对应的样本分词确定为样本关键词,或者根据分词词性确定样本关键词,将正则式中的动词或名词对应的样本分词确定为样本关键词。样本关键词的确定规则用户根据实际需要进行设定,样本关键词的数量可以是一个也可以是多个。Specifically, the keyword analysis unit 151 determines the sample keywords according to the feature information in each regular expression, for example, determines the sample keywords according to the association relationship, determines the sample participle corresponding to the subject in the subject-predicate relationship as the sample keyword, or determines the sample keywords according to The part of speech of the part of speech determines the sample keyword, and the sample participle corresponding to the verb or noun in the regular expression is determined as the sample keyword. The determination rule of the sample keyword is set by the user according to actual needs, and the number of the sample keyword can be one or more.
例如,某一语料样本为“鲸鱼会喷水”,得到的正则表达式为:名词(鲸鱼)#代词(会)#动词(喷水)。如果将主谓关系中的主语对应的样本分词确定为样本关键词,则样本关键词为“鲸鱼”,如果将动词和名词对应的样本分词确定为样本关键词,则样本关键词为“鲸鱼”和“喷水”,因此样本关键词的数量也并不是限制为一个。For example, if a certain corpus sample is "whale can spout water", the regular expression obtained is: noun (whale)#pronoun (will)#verb (spray water). If the sample participle corresponding to the subject in the subject-predicate relationship is determined as the sample keyword, the sample keyword is "whale", and if the sample participle corresponding to the verb and the noun is determined as the sample keyword, then the sample keyword is "whale" and "water spray", so the number of sample keywords is not limited to one.
概率分析单元152,分析所述关键词分析单元151确定的所述样本关键词在所述语料库建立模块110建立的所述语料库中出现的样本关键词概率。The probability analysis unit 152 is configured to analyze the probability of the sample keyword determined by the keyword analysis unit 151 appearing in the corpus created by the corpus building module 110 .
具体的,确定每个正则式中的样本关键词之后,概率分析单元152统计该样本关键词在语料库中所有的语料样本中出现的样本关键词概率。例如,某一语料样本为“鲸鱼会喷水”,得到的正则表达式为:名词(鲸鱼)#代词(会)#动词(喷水),确定的样本关键词为“鲸鱼”和“喷水”,则分别统计“鲸鱼”和“喷水”在所有的语料样本中出现的样本关键词概率,譬如语料库中包含100条语料样本,其中20条语料样本中包含“鲸鱼”,10条语料样本中包含“喷水”,则样本关键词“鲸鱼”对应的样本关键词概率为0.2,样本关键词“喷水”对应的样本关键词概率为0.1。Specifically, after determining the sample keywords in each regular expression, the probability analysis unit 152 counts the probability of the sample keywords appearing in all corpus samples in the corpus. For example, if a certain corpus sample is "whale can spout", the regular expression obtained is: noun (whale)#pronoun (will)#verb (spray), and the determined sample keywords are "whale" and "spout". ", then count the sample keyword probabilities of "whale" and "spray" in all corpus samples respectively. For example, the corpus contains 100 corpus samples, of which 20 corpus samples contain "whale", and 10 corpus samples contains "water spray", the sample keyword probability corresponding to the sample keyword "whale" is 0.2, and the sample keyword probability corresponding to the sample keyword "water spray" is 0.1.
概率确定单元153,根据所述概率分析单元152得到的所述样本关键词概率确定对应的所述正则式的正则式概率。The probability determining unit 153 determines the corresponding regular expression probability of the regular expression according to the sample keyword probability obtained by the probability analyzing unit 152 .
具体的,概率确定单元153根据样本关键词概率确定对应的正则式的正则式概率,如果正则式只有一个样本关键词,则该样本关键词对应的样本关键词概率即为该正则式对应的正则式概率。如果正则式有多个样本关键词,则样本关键词对应的样本关键词概率最高的为该正则式对应的正则式概率。Specifically, the probability determination unit 153 determines the regular expression probability of the corresponding regular expression according to the sample keyword probability. If the regular expression has only one sample keyword, the sample keyword probability corresponding to the sample keyword is the regular expression corresponding to the regular expression. formula probability. If the regular expression has multiple sample keywords, the probability of the sample keyword corresponding to the sample keyword is the regular expression probability corresponding to the regular expression.
模型生成模块160,根据所述表达式生成模块120和所述筛选模块140得到的所述正则式以及所述概率分析模块150得到的所述正则式概率生成概率语义模型。The model generation module 160 generates a probabilistic semantic model according to the regular expression obtained by the expression generation module 120 and the screening module 140 and the probability of the regular expression obtained by the probability analysis module 150 .
语料获取模块170,获取用户语料。The corpus acquisition module 170 acquires user corpus.
处理模块180,根据所述语料获取模块170获取的所述用户语料生成对应的用户正则式。The processing module 180 generates a corresponding user regular expression according to the user corpus acquired by the corpus acquisition module 170 .
对比模块190,将所述处理模块180得到的所述用户正则式和所述模型生成模块160生成的所述概率语义模型进行对比,得到所述用户语料的用户语义。The comparison module 190 compares the user regular expression obtained by the processing module 180 with the probabilistic semantic model generated by the model generation module 160 to obtain the user semantics of the user corpus.
所述对比模块190具体包括:The comparison module 190 specifically includes:
对比单元191,将所述用户正则式和所述概率语义模型进行对比。The comparing unit 191 compares the user regular expression with the probabilistic semantic model.
具体的,对比单元191将根据用户语料得到的用户正则式和概率语义模型中的正则式逐一地进行对比匹配。Specifically, the comparison unit 191 compares and matches the user's regular expressions obtained according to the user corpus and the regular expressions in the probabilistic semantic model one by one.
处理单元192,若所述对比单元191得到所述用户正则式和所述概率语义模型中多个正则式匹配相符,则根据匹配相符的多个正则式对应的正则式概率确定目标正则式。The processing unit 192, if the comparison unit 191 obtains that the user regular expression matches multiple regular expressions in the probabilistic semantic model, then determine the target regular expression according to the regular expression probabilities corresponding to the multiple matching regular expressions.
解析单元193,根据所述处理单元192得到的所述目标正则式解析所述用户语料得到所述用户语义。The parsing unit 193 parses the user corpus according to the target regular expression obtained by the processing unit 192 to obtain the user semantics.
具体的,由于用户正则式中包含多个特征信息,可能部分特征信息与概率语义模型中的正则式1匹配相符,另外部分特征信息与概率语义模型中的正则式2匹配相符,从而造成与概率语义模型中多个正则式匹配相符的情形,则处理单元192对比比较匹配相符的多个正则式对应的正则式概率确定目标正则式,如比较上述正则式1和正则式2对应的正则式概率确定目标正则式,得到的目标正则式为解析该用户语料最可能的正则式,从而解析单元193根据该目标正则式解析用户语料,得到对应的用户语义。Specifically, since the user regular expression contains multiple feature information, part of the feature information may match the regular expression 1 in the probabilistic semantic model, and the other part of the feature information matches the regular expression 2 in the probabilistic semantic model. In the case that multiple regular expressions match in the semantic model, the processing unit 192 compares and compares the regular expression probabilities corresponding to the matching regular expressions to determine the target regular expression, such as comparing the regular expression probabilities corresponding to the above regular expression 1 and regular expression 2 The target regular expression is determined, and the obtained target regular expression is the most likely regular expression for parsing the user corpus, so the parsing unit 193 parses the user corpus according to the target regular expression to obtain the corresponding user semantics.
本实施例中,根据分词技术对语料样本进行分词,并分析语料样本的句式结构,从而生成对应的正则表达式,对抽取正则表达式中的特征信息生成的正则式组合中的特征信息进行分析,判断该正则式组合是否符合逻辑,具有实际语义,从而对正则式组合进行筛选。In this embodiment, the corpus sample is segmented according to the word segmentation technology, and the sentence structure of the corpus sample is analyzed to generate a corresponding regular expression, and the feature information in the regular expression combination generated by extracting the feature information in the regular expression is performed. Analyze and judge whether the regular expression combination is logical and has actual semantics, so as to screen the regular expression combination.
通过预设的规则确定每个正则式中的样本关键词,结合语料库分析该样本关键词的样本关键词概率,从而得到对应的正则式的正则式概率。根据用户语料得到对应的用户正则式,然后和语义概率模型进行对比得到相应的目标正则式,通过该目标正则式解析用户语料从而得到最可能的用户语义。The sample keyword in each regular expression is determined through preset rules, and the sample keyword probability of the sample keyword is analyzed in combination with the corpus, so as to obtain the regular expression probability of the corresponding regular expression. According to the user corpus, the corresponding user regular expression is obtained, and then compared with the semantic probability model to obtain the corresponding target regular expression, and the user corpus is analyzed through the target regular expression to obtain the most likely user semantics.
应当说明的是,上述实施例均可根据需要自由组合。以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。It should be noted that the above embodiments can be freely combined as required. The above is only a preferred embodiment of the present invention, it should be pointed out that, for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications can also be made. It should be regarded as the protection scope of the present invention.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910046962.3A CN109800430B (en) | 2019-01-18 | 2019-01-18 | Semantic understanding method and system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910046962.3A CN109800430B (en) | 2019-01-18 | 2019-01-18 | Semantic understanding method and system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109800430A CN109800430A (en) | 2019-05-24 |
| CN109800430B true CN109800430B (en) | 2023-06-27 |
Family
ID=66559710
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201910046962.3A Active CN109800430B (en) | 2019-01-18 | 2019-01-18 | Semantic understanding method and system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN109800430B (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111859949B (en) * | 2019-04-30 | 2023-04-25 | 广东小天才科技有限公司 | A method and system for parsing classical Chinese |
| CN110472031A (en) * | 2019-08-13 | 2019-11-19 | 北京知道创宇信息技术股份有限公司 | A kind of regular expression preparation method, device, electronic equipment and storage medium |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105786793A (en) * | 2015-12-23 | 2016-07-20 | 百度在线网络技术(北京)有限公司 | Method and device for analyzing semanteme of spoken language text information |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP4047885B2 (en) * | 2005-10-27 | 2008-02-13 | 株式会社東芝 | Machine translation apparatus, machine translation method, and machine translation program |
| US8972328B2 (en) * | 2012-06-19 | 2015-03-03 | Microsoft Corporation | Determining document classification probabilistically through classification rule analysis |
| CN103177087B (en) * | 2013-03-08 | 2016-05-18 | 浙江大学 | A kind of similar Chinese medicine search method based on probability topic model |
| CN105512228B (en) * | 2015-11-30 | 2018-12-25 | 北京光年无限科技有限公司 | A kind of two-way question and answer data processing method and system based on intelligent robot |
| JP6558856B2 (en) * | 2016-03-31 | 2019-08-14 | 日本電信電話株式会社 | Morphological analyzer, model learning device, and program |
| CN107315737B (en) * | 2017-07-04 | 2021-03-23 | 北京奇艺世纪科技有限公司 | Semantic logic processing method and system |
| CN108304372B (en) * | 2017-09-29 | 2021-08-03 | 腾讯科技(深圳)有限公司 | Entity extraction method and device, computer equipment and storage medium |
-
2019
- 2019-01-18 CN CN201910046962.3A patent/CN109800430B/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105786793A (en) * | 2015-12-23 | 2016-07-20 | 百度在线网络技术(北京)有限公司 | Method and device for analyzing semanteme of spoken language text information |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109800430A (en) | 2019-05-24 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11544459B2 (en) | Method and apparatus for determining feature words and server | |
| WO2017198031A1 (en) | Semantic parsing method and apparatus | |
| CN109101551B (en) | Question-answer knowledge base construction method and device | |
| CN110297880B (en) | Corpus product recommendation method, apparatus, device and storage medium | |
| CN103678684A (en) | Chinese word segmentation method based on navigation information retrieval | |
| RU2618374C1 (en) | Identifying collocations in the texts in natural language | |
| CN104156352A (en) | Method and system for processing Chinese events | |
| CN110852095B (en) | Statement hotspot extraction method and system | |
| CN107301170A (en) | The method and apparatus of cutting sentence based on artificial intelligence | |
| CN114266256A (en) | A method and system for extracting new words in the field | |
| CN108536667A (en) | Chinese text recognition methods and device | |
| CN111723192B (en) | Code recommendation method and device | |
| CN110929520A (en) | Non-named entity object extraction method and device, electronic equipment and storage medium | |
| CN109271492A (en) | Automatic generation method and system of corpus regular expression | |
| CN107526721A (en) | A kind of disambiguation method and device to electric business product review vocabulary | |
| CN108614814A (en) | A kind of abstracting method of evaluation information, device and equipment | |
| KR101805607B1 (en) | Method for making abstracts from Voice of Customer data | |
| CN109800430B (en) | Semantic understanding method and system | |
| CN118916499A (en) | Query method integrating AI large model and knowledge graph | |
| CN111062832A (en) | Auxiliary analysis method and device for intelligently providing patent answer and debate opinions | |
| CN111046168A (en) | Method, apparatus, electronic device, and medium for generating patent summary information | |
| CN108573025B (en) | Method and device for extracting sentence classification characteristics based on mixed template | |
| WO2024255290A1 (en) | Method for constructing fault knowledge graph, and computing apparatus | |
| CN111552783A (en) | Content analysis query method, device, equipment and computer storage medium | |
| CN111177312A (en) | Open source code searching method with grammar and semantics fused |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |