CN104142917A

CN104142917A - A method and system for constructing a hierarchical semantic tree for language understanding

Info

Publication number: CN104142917A
Application number: CN201410216929.8A
Authority: CN
Inventors: 晋耀红; 朱筠; 刘小蝶
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2014-05-21
Filing date: 2014-05-21
Publication date: 2014-11-12
Anticipated expiration: 2034-05-21
Also published as: CN104142917B

Abstract

A method and system for constructing a hierarchical semantic tree for language understanding. The method mainly includes the following steps: segmenting sentences and loading a semantic knowledge base; The hierarchy of nodes; generate a special node for the punctuation at the end of the sentence as the root node of the semantic tree; merge it according to the above-mentioned generated node information, identify the semantic edge block of the sentence, and hang the 0-level semantic edge as a child node on the root Node; loop through each of its child nodes until there are no low-level semantic edges, and hang on the child nodes as leaf nodes. In the absence of syntactic resources, the scheme only uses semantic information and word positions and collocations to obtain semantic structure trees, enabling computers to enter the deep semantic level of natural language, complete various processing of natural language on the basis of understanding, and realize It is the first step of natural language semantic understanding, which can be used for information retrieval, automatic summarization, machine translation, text classification and information filtering, etc.

Description

A method and system for constructing a hierarchical semantic tree for language understanding

技术领域technical field

本发明涉及一种自然语言处理领域，具体地说是利用语义知识和词语的位置及搭配而得到的层次语义树构建方法及系统。The invention relates to the field of natural language processing, in particular to a method and a system for constructing a hierarchical semantic tree obtained by using semantic knowledge and the position and collocation of words.

背景技术Background technique

随着电子信息技术的发展，数字信息资源被越来越多的广泛使用。这就需要机器也能理解自然语言，在“懂”的基础上完成对自然语言的各种处理，如信息检索、自动文摘、机器翻译、文本分类以及信息过滤等等。可见，使得计算机能够进入自然语言的语义深层，是达到上述目的一个条件。要想让机器了解自然语言的意义，首先要了解自然语言语句的结构，语句结构是自然语言的一种基本结构，一般包括语法结构和语义结构。为了更好的对语句的语义进行描述，采用语句结构树是一种简单且清晰有效的方式。语句的结构树类型主要包括两种：一种是句法结构树，一种是语义结构树。句法结构树主要包括短语结构树、依存树等，其自动构建主要在句法标注的基础上，采用基于统计的方法来实现，此类句法结构树的构建不使用或较少使用词语的语义知识。With the development of electronic information technology, digital information resources are widely used more and more. This requires that machines can also understand natural language, and complete various processing of natural language on the basis of "understanding", such as information retrieval, automatic summarization, machine translation, text classification, and information filtering. It can be seen that enabling computers to enter the semantic depth of natural language is a condition for achieving the above purpose. In order for a machine to understand the meaning of natural language, it must first understand the structure of natural language sentences. Sentence structure is a basic structure of natural language, generally including grammatical structure and semantic structure. In order to better describe the semantics of a sentence, using a sentence structure tree is a simple, clear and effective way. There are mainly two types of sentence structure trees: one is a syntactic structure tree, and the other is a semantic structure tree. The syntactic structure tree mainly includes phrase structure tree, dependency tree, etc. Its automatic construction is mainly based on syntactic annotation, and it is realized by a statistical method. The construction of such a syntactic structure tree does not use or seldom uses the semantic knowledge of words.

语义结构树的构建必须使用语义知识，构建语义树是在HNC(概念层次网络)理论的指导下，在没有句法资源的情况下，仅使用语义知识和语词语位置及搭配而进行的，使得计算机能够进入自然语言的语义深层，在理解的基础上进行自然语言的各种处理，实现自然语言语义理解的第一步，为后续应用在信息检索、机器翻译、信息过滤、文本分类等过程中创造条件。The construction of the semantic structure tree must use semantic knowledge. The construction of the semantic tree is under the guidance of the HNC (Concept Hierarchy Network) theory, in the absence of syntactic resources, only using semantic knowledge and word positions and collocations, so that the computer It can enter into the deep semantics of natural language, perform various processing of natural language on the basis of understanding, realize the first step of natural language semantic understanding, and create new information for subsequent applications in information retrieval, machine translation, information filtering, text classification, etc. condition.

在中国专利文献CN1606004A中公开了一种从文本标识语义结构的方法和装置，形成至少两个候选语义结构，基于所述语义结构的似然性对每一候选语义结构确定语义得分，也基于单词在文本中的位置以及从该单词形成的语义实体在该语义结构中的位置对每一语义结构确定句法得分，将句法得分和语义得分组合来对该文本的至少一部分选择语义结构。该方案中定义实体的模式，该模式包括语义类型和概率、马尔科夫概率和语义规则，这些语义内容的获取需要训练大规模的数据，对文本的领域依赖性强，由于任务的复杂性，取得的效果不一定理想，后续的所有的操作都依赖这一步的结果，其效果将大打折扣。In the Chinese patent document CN1606004A, a method and device for identifying semantic structures from text are disclosed, forming at least two candidate semantic structures, determining a semantic score for each candidate semantic structure based on the likelihood of the semantic structure, and also based on the word The position in the text and the position in the semantic structure of the semantic entities formed from the word determine a syntactic score for each semantic structure, the syntactic score and the semantic score are combined to select a semantic structure for at least a portion of the text. The schema of the entity is defined in this scheme, which includes semantic type and probability, Markov probability, and semantic rules. The acquisition of these semantic content requires training of large-scale data, which is highly dependent on the domain of text. Due to the complexity of the task, The effect obtained may not be ideal, and all subsequent operations depend on the result of this step, and the effect will be greatly reduced.

发明内容Contents of the invention

本发明所要解决的技术问题在于现有技术中的标识语义结构的方法需要训练大规模的数据，对文本的领域依赖性强，从而提出一种无需训练的层次语义树构建方法和系统。The technical problem to be solved by the present invention is that the method for identifying semantic structure in the prior art needs to train large-scale data and has a strong dependence on the domain of text, so a method and system for constructing a hierarchical semantic tree without training is proposed.

为解决上述技术问题，本发明提供一种用于语言理解的层次语义树构建方法及系统，包括如下步骤：In order to solve the above technical problems, the present invention provides a method and system for constructing a hierarchical semantic tree for language understanding, comprising the following steps:

S1、输入待处理语句，对待处理语句进行分词，并加载分词后词语的语义知识；S1. Input the sentence to be processed, perform word segmentation on the sentence to be processed, and load the semantic knowledge of the word after word segmentation;

S2、根据分词结果，识别出该语句的语义节点；S2. According to the word segmentation result, identify the semantic node of the sentence;

S3、利用语义知识和词语位置及搭配获得语义节点的层次；S3. Using semantic knowledge and word positions and collocations to obtain the level of semantic nodes;

S4、识别该语句中不同层次的语义边；S4. Identify semantic edges at different levels in the sentence;

S5、根据各层次的语义边生成层次语义树。S5. Generate a hierarchical semantic tree according to the semantic edges of each level.

优选地，所述步骤S1中，对待处理语句进行分词时，按照领域词典和通用词典对待处理语句进行分词。Preferably, in the step S1, when segmenting the sentence to be processed, the sentence to be processed is segmented according to the domain dictionary and the general dictionary.

优选地，所述语义知识包括词语的广义概念类及其子类，所述词语的广义概念类包括动态、静态、物、人、属性、逻辑。Preferably, the semantic knowledge includes generalized concept classes of words and their subclasses, and the generalized concept classes of words include dynamic, static, objects, people, attributes, and logic.

优选地，所述步骤S2中“根据分词结果，识别出该语句的语义节点”的过程，包括：Preferably, the process of "identifying the semantic node of the sentence according to the word segmentation result" in the step S2 includes:

对于分词后的词语，如果词语的语义知识中有逻辑概念，对该词语标记为L，如果词语的语义知识中有动态概念，标记为V；For the word after word segmentation, if there is a logical concept in the semantic knowledge of the word, the word is marked as L, and if there is a dynamic concept in the semantic knowledge of the word, it is marked as V;

对所有标记为L或V的词语，进行LV排除处理；For all words marked as L or V, perform LV exclusion processing;

对所有L标记根据其概念类别进行标记，并判断其是否有后标记，如果有后标记，对后标记的词语标记为L1H，根据上述所有标记生成语义节点。Mark all L tags according to their concept categories, and judge whether they have post-marks. If there are post-marks, mark the post-marked words as L1H, and generate semantic nodes according to all the above tags.

优选地，所述步骤S2中“根据分词结果，识别出该语句的语义节点”的过程，还包括：将句末标点生成语义节点作为根节点。Preferably, the process of "identifying the semantic node of the sentence based on the word segmentation result" in step S2 further includes: taking the semantic node generated by the end punctuation as the root node.

优选地，所述步骤S3中“利用语义知识和词语位置及搭配获得语义节点的层次”的过程，包括：Preferably, the process of "using semantic knowledge and word positions and collocations to obtain the hierarchy of semantic nodes" in the step S3 includes:

所有L标记和v标记的默认层次都记为0，当出现两个上述标记相邻时，第二个标记的层次减小一层为-1。The default level of all L marks and v marks is recorded as 0, and when two of the above marks are adjacent, the level of the second mark is reduced by one level to -1.

优选地，所述步骤S4中“识别该语句中不同层次的语义边”的过程，包括Preferably, the process of "identifying semantic edges at different levels in the sentence" in step S4 includes

对所有标记为V的语义节点，进行核心动词识别，生成语块；For all the semantic nodes marked V, perform core verb recognition and generate language chunks;

对所有标记为L的语义节点，生成语块；Generate chunks for all semantic nodes marked L;

根据语块生成语义边。Semantic edges are generated based on chunks.

优选地，所述进行核心动词识别的过程包括：Preferably, the process of performing core verb recognition includes:

排除不能构成核心动词的词语；exclude words that cannot form core verbs;

其余的词语根据构成和词语本身所具有的特征赋予不同的权值，根据权值的排序结果和位置信息选择核心动词。The rest of the words are given different weights according to the composition and the characteristics of the words themselves, and the core verbs are selected according to the ranking results and position information of the weights.

优选地，所述根据各层次的语义边生成层次语义树的过程，包括：Preferably, the process of generating a hierarchical semantic tree according to the semantic edges of each level includes:

选择根节点；Select the root node;

把层次高的语块，按照该层次中的顺序，挂到根节点上，作为子节点；Hang the high-level language blocks on the root node according to the order of the level, as child nodes;

遍历所有子节点，将每个子节点范围内的所有语块作为该子节点的子节点，直到没有新的子节点产生。Traverse all child nodes, and use all language blocks within the scope of each child node as child nodes of the child node until no new child nodes are generated.

一种所述的层次语义树构建方法对应的层次语义树构建系统，包括：A hierarchical semantic tree construction system corresponding to the hierarchical semantic tree construction method, comprising:

预处理单元：输入待处理语句，对待处理语句进行分词，并加载分词后词语的语义知识；Preprocessing unit: input the sentence to be processed, segment the sentence to be processed, and load the semantic knowledge of the word after word segmentation;

第一序列生成单元：根据分词结果，识别出该语句的语义节点；利用语义知识和词语位置及搭配获得语义节点的层次；The first sequence generation unit: according to the word segmentation result, identify the semantic node of the sentence; use the semantic knowledge and word position and collocation to obtain the level of the semantic node;

第二序列生成单元：识别该语句中不同层次的语义边；The second sequence generation unit: identify semantic edges at different levels in the sentence;

层次语义树生成单元：根据各层次的语义边生成层次语义树。Hierarchical semantic tree generation unit: generate a hierarchical semantic tree according to the semantic edges of each level.

本发明的上述技术方案相比现有技术具有以下优点，The above technical solution of the present invention has the following advantages compared with the prior art,

(1)本实施例所述的层次语义树构建方法，主要包括预处理、节点识别、语义边识别、语义树生成的过程，输出上述结构树即可得到层次语义树。本实施例中的层次语义树构建的方案，对语句的分析都是利用规则方法进行实现的。本方案中通过节点及其层次的识别、语义边及其层次的识别，来控制规则在不同层次、不同阶段的调度。在此原则指导下，首先需要对规则进行层次分类，每一类规则只在固定分析层次中调用，且每一条规则只关注对邻近语串中语言现象的分析，不需要兼顾对整体形势的判断，而是通过调度来解决规则的兼容性问题。(1) The hierarchical semantic tree construction method described in this embodiment mainly includes the process of preprocessing, node recognition, semantic edge recognition, and semantic tree generation, and the hierarchical semantic tree can be obtained by outputting the above structure tree. In the scheme of constructing the hierarchical semantic tree in this embodiment, the analysis of the sentences is realized by using the rule method. In this solution, the scheduling of rules at different levels and stages is controlled through the identification of nodes and their levels, and the identification of semantic edges and their levels. Under the guidance of this principle, it is first necessary to classify the rules hierarchically. Each type of rule is only called in a fixed analysis level, and each rule only focuses on the analysis of linguistic phenomena in adjacent sentence strings, and does not need to take into account the judgment of the overall situation. , but to solve the rule compatibility problem through scheduling.

(2)本发明中的层次语义树构建方法，在没有句法资源的情况下，仅使用语义信息和词语位置及搭配而得到的语义结构树，使计算机能够进入自然语言的语义深层，在理解的基础上完成对自然语言的各种处理，实现了自然语言语义理解的第一步。构建语义树，可广泛应用在自然语言处理领域，如信息检索、自动文摘、机器翻译、文本分类以及信息过滤等方便。本实施例中的语义树的构建方法，已经应用到专利文献汉英机器翻译上，显著提高了专利文献译文的可读性和准确性。(2) The hierarchical semantic tree construction method in the present invention, in the absence of syntactic resources, only uses semantic information and word positions and the semantic structure tree obtained by collocation, so that the computer can enter the semantic deep layer of natural language, and in the process of understanding Based on the completion of various processing of natural language, the first step of semantic understanding of natural language is realized. Constructing a semantic tree can be widely used in the field of natural language processing, such as information retrieval, automatic summarization, machine translation, text classification, and information filtering. The semantic tree construction method in this embodiment has been applied to Chinese-English machine translation of patent documents, which significantly improves the readability and accuracy of patent document translations.

附图说明Description of drawings

为了使本发明的内容更容易被清楚的理解，下面根据本发明的具体实施例并结合附图，对本发明作进一步详细的说明，其中In order to make the content of the present invention more easily understood, the present invention will be described in further detail below according to specific embodiments of the present invention in conjunction with the accompanying drawings, wherein

图1是本发明所述的层次语义树构建方法的流程图；Fig. 1 is the flow chart of the hierarchical semantic tree construction method of the present invention;

图2本发明所述的层次语义树构建方法的节点生成流程图；The node generation flowchart of Fig. 2 hierarchical semantic tree construction method of the present invention;

图3本发明所述的层次语义树构建方法的语义边生成流程图；Fig. 3 is a flow chart of generating semantic edges of the hierarchical semantic tree construction method of the present invention;

图4、图5本发明所述的层次语义树构建方法的一个应用实例的结果示意图；Fig. 4, the result schematic diagram of an application example of the hierarchical semantic tree construction method of Fig. 5 of the present invention;

图6是本发明所述的层次语义树构建系统的结构框图。Fig. 6 is a structural block diagram of the hierarchical semantic tree construction system of the present invention.

具体实施方式Detailed ways

实施例1：Example 1:

本实施例中提供一种用于语言理解的层次语义树构建方法及系统，语义树即语义结构树，是针对自然语言中的一个句子而言，指的是一个句子中特征语块(核心动词语块)和由其决定的其他语块之间的语义关系。如一个句子中的特征语块V是表示作用的动词，该特征语块决定此句中必有作用者语块、对象语块、内容语块，只有如此句子的语义才完整。虽然后三者在一定的上下文环境中可以省略其一，但是这四种语块是句子成立即语义完整的必要构件，又叫主要语块。而相比较而言，辅助语块不是句子成立的必要构件，主要是表示动作的方式、手段、途径、条件、时间等。主要语块和辅助语块都可由一定的逻辑概念来提示，因此使用LV(逻辑概念和动态概念)准则来识别句子的语义结构成为可能。本实施例中的层次语义树构建方法，就是利用LV准则来识别一个句子的主要语块和辅助语块，该方案可实现自动对语句进行划分，用于语言翻译中，可以大大提高了机器翻译的可读性和准确性。Provide a kind of hierarchical semantic tree construction method and system for language comprehension in the present embodiment, semantic tree is the semantic structure tree, is for a sentence in natural language, refers to the feature word block (core verb) in a sentence lexical chunk) and the semantic relationship between other chunks determined by it. For example, the characteristic chunk V in a sentence is a verb expressing action, which determines that there must be actor chunks, object chunks, and content chunks in this sentence. Only in this way can the semantics of the sentence be complete. Although one of the latter three can be omitted in a certain context, these four chunks are the necessary components for the complete semantics of a sentence, and they are also called main chunks. In contrast, auxiliary language chunks are not necessary components for the establishment of a sentence, but mainly express the way, means, way, condition, time, etc. of an action. Both the main chunk and the auxiliary chunk can be prompted by certain logical concepts, so it is possible to use the LV (logical concept and dynamic concept) criterion to identify the semantic structure of the sentence. The method for constructing a hierarchical semantic tree in this embodiment is to use the LV criterion to identify the main chunk and the auxiliary chunk of a sentence. This scheme can automatically divide sentences and be used in language translation, which can greatly improve machine translation. readability and accuracy.

本实施例中的层次语义树构建方法，主要的处理过程包括：待处理语句S110经过预处理S120、节点识别S130、语义边识别S140、语义树生成S150后得到语义树S160，流程图如图1所示，具体包括如下步骤：The method for constructing a hierarchical semantic tree in this embodiment mainly includes: the sentence to be processed S110 undergoes preprocessing S120, node identification S130, semantic edge identification S140, and semantic tree generation S150 to obtain a semantic tree S160, as shown in the flowchart in Figure 1 As shown, it specifically includes the following steps:

S1、输入待处理语句，对待处理语句进行分词，并加载分词后词语的语义知识。对待处理语句进行分词时，按照领域词典和通用词典对待处理语句进行分词。S1. Input the sentence to be processed, perform word segmentation on the sentence to be processed, and load the semantic knowledge of the word after the word segmentation. When segmenting the sentence to be processed, the sentence to be processed is segmented according to the domain dictionary and the general dictionary.

S2、根据分词结果，识别出该语句的语义节点。主要包括以下过程：对于分词后的词语，如果词语的语义知识中有虚词义项，对该词语标记为L，如果词语的语义知识中有动词义项，标记为V；对所有标记为L或V的词语，进行LV排除处理；对所有L标记根据其概念类别进行标记，并判断其是否有后标记，如果有后标记，对后标记的词语也进行标记，根据上述所有标记生成语义节点。S2. Identify the semantic nodes of the sentence according to the word segmentation result. It mainly includes the following process: for the word after word segmentation, if there is a function word meaning item in the semantic knowledge of the word, mark the word as L, if there is a verb meaning item in the semantic knowledge of the word, mark it as V; for all the words marked as L or V Words, perform LV exclusion processing; mark all L tags according to their concept categories, and judge whether they have post-marks, if there are post-marks, also mark the post-marked words, and generate semantic nodes according to all the above-mentioned tags.

上述过程具体的方式如下：The specific way of the above process is as follows:

对每个词语进行LV识别，如果词语的语义知识中有虚词义项，则该词语标记为L，如果词语的语义知识中有动词义项，则该词语标记为V。所述语义知识包括词语的广义概念类及其子类(即概念类别)，所述词语的概念广义概念类包括动态、静态、物、人、属性和逻辑。LV recognition is carried out for each word. If there is a function word meaning item in the semantic knowledge of the word, the word is marked as L; if there is a verb meaning item in the semantic knowledge of the word, the word is marked as V. The semantic knowledge includes generalized concept categories of words and their subcategories (ie, concept categories), and the concept generalized concept categories of words include dynamic, static, objects, people, attributes and logic.

对所有标记为L或V的词语，进行LV排除处理，如果该词语前面有“的”、“一种”这样的词语，则取消其L和V标记；如果该词语后面有“的”这样的词语，则取消其L和V标记；For all words marked as L or V, carry out LV exclusion processing, if there are words like "the" and "a" in front of the word, then cancel its L and V marks; if there are words like "the" behind the word words, cancel their L and V marks;

对所有L标记，如果该节点的概念类别是l1，则其标记修改为L1；判断其是否有后标记，“当…时候”中，“时候”是“当”的后标记，对后标记的词语，生成一个标记为L1H的标记；如果该节点的概念类别是l0，则其标记修改为L0。For all L labels, if the concept category of the node is l1, then its label is changed to L1; judge whether it has a post-label, in "when...time", "time" is the post-label of "when", and the post-label word, generate a tag labeled L1H; if the concept category of the node is l0, its tag is modified to L0.

把所有L标记(包括L0、L1和L1H)和V标记，带上位置信息，生成一个语义节点，记入一个队列，称之为第一序列。如果一个词语上生成超过1个语义节点，都记入第一序列。Put all L tags (including L0, L1 and L1H) and V tags with position information to generate a semantic node and record it into a queue, which is called the first sequence. If more than one semantic node is generated on a word, they are all recorded in the first sequence.

S3、利用语义知识和词语位置获得语义节点的层次。首先，将所有L标记和v标记的默认层次都记为0，当出现两个上述标记相邻时，第二个标记的层次减小一层。具体如下：S3. Using semantic knowledge and word positions to obtain the hierarchy of semantic nodes. First, the default levels of all L and v marks are recorded as 0, and when two of the above-mentioned marks are adjacent, the level of the second mark is reduced by one layer. details as follows:

对第一序列中的所有语义节点，进行LV层次识别，所有L标记和V标记的默认层次都记为0；For all semantic nodes in the first sequence, LV level identification is performed, and the default levels of all L marks and V marks are recorded as 0;

当两个L相邻时，即出现L1L2时，L2的层次减1；When two Ls are adjacent, that is, when L1L2 appears, the level of L2 is reduced by 1;

当L和V相邻时，即出现L1V2时，V2的层次减1；When L and V are adjacent, that is, when L1V2 appears, the level of V2 is reduced by 1;

当L和V相邻时，即出现V1L2时，L2的层次减1；When L and V are adjacent, that is, when V1L2 appears, the level of L2 is reduced by 1;

对句号标点符号，生成一个语义节点，其标记为SST，记入第一序列。For period punctuation marks, a semantic node is generated, which is marked as SST, and recorded in the first sequence.

S4、识别该语句中不同层次的语义边。包括：首先，对所有标记为V的语义节点，进行核心动词识别，生成语块；然后，对所有标记为L的语义节点，生成语块；从而，根据语块生成语义边。S4. Identify semantic edges at different levels in the sentence. Including: firstly, for all the semantic nodes marked V, the core verbs are identified to generate language chunks; then, for all the semantic nodes marked L, language chunks are generated; thus, semantic edges are generated according to the language chunks.

具体方式如下：The specific method is as follows:

生成一个队列，称之为第二序列；Generate a queue, called the second sequence;

对第一序列中所有标记为V的语义节点，进行EG识别，生成语块，其标记为CHK_EG，把语块加入第二序列；For all the semantic nodes marked as V in the first sequence, carry out EG recognition, generate a language block, which is marked as CHK_EG, and add the language block to the second sequence;

对第一序列中所有标记为L的语义节点，进行以下处理：For all semantic nodes marked as L in the first sequence, perform the following processing:

对所有标记为L1的语义节点，生成一个语块，其标记是CHK_ABK，其起始位置为L1节点的起始位置；判断该节点后是否有L1H，如果有，则语块结束位置是L1H的结束位置；如果其后没有L1H，则语块结束位置是紧邻的下一个标记为L的语义节点的起始位置pos-1，语块层次是语义节点的层次，把语块加入第二序列；For all semantic nodes marked as L1, generate a language block, its mark is CHK_ABK, and its starting position is the starting position of the L1 node; judge whether there is L1H after the node, if so, the end position of the language block is L1H End position; if there is no L1H thereafter, then the end position of the language block is the starting position pos-1 of the next semantic node marked as L, the language block level is the level of the semantic node, and the language block is added to the second sequence;

对所有标记为L0的语义节点，生成一个语块，其标记是CHK_L0，其起始位置是L0的起始位置，其结束位置是L0的结束位置，语块层次是语义节点的层次，把语块加入第二序列；For all semantic nodes marked as L0, generate a language block, its mark is CHK_L0, its starting position is the starting position of L0, its end position is the end position of L0, the level of language block is the level of semantic nodes, and the language block level is the level of semantic nodes. block joins the second sequence;

对所有标记为L0的语义节点，生成一个语块，其标记是CHK_GBK，其起始位置是L0的结束位置pos+1，其结束位置是紧邻的下一个语块(其标记是CHK_EG或CHK_ABK或CHK_L0)的起始位置pos-1，语块层次是语义节点的层次，把语块加入第二序列；For all semantic nodes marked as L0, generate a language block, its mark is CHK_GBK, its starting position is the end position pos+1 of L0, and its end position is the next word block (its mark is CHK_EG or CHK_ABK or The initial position pos-1 of CHK_L0), the language chunk level is the level of semantic nodes, and the language chunk is added to the second sequence;

对第一序列中标记为SST的语义节点，生成一个语块，其标记是CHK_SST，加入到第二序列。该过程中得到的语块CHK_SST、CHK_ABK、CHK_EG、CHK_L0即为语义边。For the semantic node marked as SST in the first sequence, generate a chunk whose mark is CHK_SST, and add it to the second sequence. The chunks CHK_SST, CHK_ABK, CHK_EG, CHK_L0 obtained in this process are semantic edges.

上述过程中，EG识别是指核心动词识别，主要是通过设计一系列有序的权值来判断每一个动态概念作为EG的权值大小，该过程包括：首先，排除不能构成核心动词的词语，将语句中有可能构成EG的词语进行初步排除，包括动态概念与静态概念、逻辑概念、属性兼类以及不同动态概念的兼类。然后，其余的词语根据搭配和词语本身所具有的特征赋予不同的权值，根据权值的排序结果和位置信息选择核心动词。也就是把排除后剩下的候选词语全部生成EG，并根据它们构成或词语本身所具有的特征赋予不同的权值，综合考虑权值排序结果及位置信息选择一个合适的词语作为语句的EG。In the above process, EG recognition refers to the recognition of core verbs. It mainly judges the weight of each dynamic concept as EG by designing a series of orderly weights. The process includes: first, exclude words that cannot form core verbs, Preliminarily exclude the words that may constitute EG in the sentence, including dynamic concept and static concept, logical concept, attribute and category, and different dynamic concept and category. Then, the rest of the words are given different weights according to the collocation and the characteristics of the words themselves, and the core verbs are selected according to the ranking results of the weights and the position information. That is to generate EG for all the remaining candidate words after exclusion, and assign different weights according to their composition or the characteristics of the words themselves, and comprehensively consider the weight sorting results and position information to select a suitable word as the EG of the sentence.

S5、根据各层次的语义边生成层次语义树。首先，选择根节点；然后，把层次高的语块，按照该层次中的顺序，挂到根节点上，作为子节点；最后，遍历所有子节点，将每个子节点范围内的所有语块作为该子节点的子节点，直到没有新的叶子节点产生。S5. Generate a hierarchical semantic tree according to the semantic edges of each level. First, select the root node; then, hang the high-level language blocks on the root node according to the order of the level as child nodes; finally, traverse all child nodes, and use all language blocks within the range of each child node as The child nodes of this child node until no new leaf nodes are generated.

本实施例所述的层次语义树构建方法，主要包括以下步骤：对语句进行分词并加载语义知识库；根据LV规则和语言规则，识别语句的所有节点及其层次；把句末标点符号生成特殊的节点，作为语义树的根节点；根据上述生成的节点信息对其进行合并，识别语句的语义边语块，把0级语义边语块作为子节点挂于根节点；遍历各个子节点直至无低层次语义边语块，作为叶子节点挂于子节点。输出上述结构树即可得到层次语义树。本实施例中的层次语义树构建的方案，对语句的分析都是利用规则方法进行实现的。规则系统受到质疑的一个原因在于，若规则描述过于简单，则规则产生的结果或者互相矛盾，或者不足以分析句子。若想完全依赖规则准确地给出分析结果，就需要每一条规则能够描写复杂的语言现象，这使得规则的概括性差，书写需要大量人工，不具有可行性。为解决这一矛盾，本方案中通过节点及其层次的识别、语义边及其层次的识别，来控制规则在不同层次、不同阶段的调度。在此原则指导下，首先需要对规则进行层次分类，每一类规则只在固定分析层次中调用，且每一条规则只关注对邻近语串中语言现象的分析，不需要兼顾对整体形势的判断，而是通过调度来解决规则的兼容性问题。本实施例中解决的策略有两条：首先避免规则的贪婪匹配，使规则调用具有层次性，并在每一个层次上依据激活信息调用相应规则；其次，调度会根据不同处理阶段的语句特征对规则生成的结果进行选择合成。这样，既减少了需要匹配的规则，也减少了不同规则所产生的矛盾对最终分析的影响，以此加强对规则调用的控制，也使得基于规则的层次语义树的构建成为可能。The method for constructing a hierarchical semantic tree described in this embodiment mainly includes the following steps: carrying out word segmentation to a sentence and loading a semantic knowledge base; identifying all nodes and their levels of the sentence according to the LV rules and language rules; as the root node of the semantic tree; according to the node information generated above, it is merged, the semantic edge block of the statement is identified, and the 0-level semantic edge block is hung on the root node as a child node; each child node is traversed until there is no The low-level semantic edge block is attached to the child node as a leaf node. The hierarchical semantic tree can be obtained by outputting the above structure tree. In the scheme of constructing the hierarchical semantic tree in this embodiment, the analysis of the sentences is realized by using the rule method. One reason rule systems are questioned is that if the rule description is too simple, the results produced by the rules are either contradictory or insufficient to analyze the sentence. If you want to rely entirely on the rules to give accurate analysis results, you need each rule to be able to describe complex language phenomena, which makes the generalization of the rules poor, writing requires a lot of labor, and is not feasible. In order to solve this contradiction, this scheme controls the scheduling of rules at different levels and stages through the identification of nodes and their levels, and the identification of semantic edges and their levels. Under the guidance of this principle, it is first necessary to classify the rules hierarchically. Each type of rule is only called in a fixed analysis level, and each rule only focuses on the analysis of linguistic phenomena in adjacent sentence strings, and does not need to take into account the judgment of the overall situation. , but to solve the rule compatibility problem through scheduling. There are two strategies to be solved in this embodiment: first, avoid greedy matching of rules, make rule calling hierarchical, and call corresponding rules based on activation information at each level; The results generated by the rules are selected and synthesized. In this way, it not only reduces the rules that need to be matched, but also reduces the impact of the contradictions caused by different rules on the final analysis, so as to strengthen the control of rule invocation, and also makes it possible to build a rule-based hierarchical semantic tree.

上述构建语义树是在概念层次网络理论的指导下，在没有句法资源的情况下，仅使用语义信息和语言规则而得到的语义结构树，使计算机能够进入自然语言的语义深层，在理解的基础上完成对自然语言的各种处理，实现了自然语言语义理解的第一步。构建语义树，可广泛应用在自然语言处理领域，如信息检索、自动文摘、机器翻译、文本分类以及信息过滤等方便。本实施例中的语义树的构建方法，已经应用到专利文献汉英机器翻译上，显著提高了专利文献译文的可读性和准确性。The above construction of semantic tree is a semantic structure tree obtained under the guidance of concept hierarchy network theory and only using semantic information and language rules in the absence of syntactic resources, so that the computer can enter the deep semantic layer of natural language Various processing of natural language is completed on the computer, realizing the first step of semantic understanding of natural language. Constructing a semantic tree can be widely used in the field of natural language processing, such as information retrieval, automatic summarization, machine translation, text classification, and information filtering. The semantic tree construction method in this embodiment has been applied to Chinese-English machine translation of patent documents, which significantly improves the readability and accuracy of patent document translations.

实施例2：Example 2:

本实施例中给出一个具体的层次语义树构建方法，该方案的基本流程也是如图1所示，本实施例中层次语义树构建方法100开始于步骤S110即输入待处理语句，然后在步骤S120中对待处理语句进行预处理，即按照领域词典和通用词典对待处理语句进行分词，并加载词语的语义知识，语义知识主要包括词语的广义概念类即V(动态)、G(静态)、W(物)、P(人)、U(属性)、L(逻辑)六大广义概念类和其统筹下的若干子类；其次，在步骤S130中识别该语句的语义节点并对其的层次进行区分，第一步是对分词后的结果，采用LV规则识别所有的语义节点，第二步是利用语义知识和词语位置，比较判断出节点的层次；再次，在步骤S140中识别该语句的不同层次的语义边，把小句层面的语义节点的识别结果，识别为小句层面的语义边，把语块层面的语义节点的识别结果，识别为语块层面的语义边；然后，在步骤S150中生成层次语义树，根据语义边的识别结果，根据调度分层次生成在树结构上；最后，在步骤S160中输出待处理语句的层次语义树。Provide a concrete hierarchical semantic tree construction method in the present embodiment, the basic process of this scheme is also as shown in Figure 1, the hierarchical semantic tree construction method 100 starts in step S110 in the present embodiment and promptly imports the statement to be processed, then in step S110 In S120, the sentence to be processed is preprocessed, that is, the sentence to be processed is segmented according to the domain dictionary and the general dictionary, and the semantic knowledge of the word is loaded. The semantic knowledge mainly includes the generalized concept class of the word, namely V (dynamic), G (static), and W (thing), P (person), U (attribute), L (logic) six generalized concept categories and some subcategories under their overall planning; secondly, in step S130, identify the semantic nodes of the sentence and perform Distinguish, the first step is the result after word segmentation, adopt LV rule to identify all semantic nodes, the second step is to utilize semantic knowledge and word position, compare and judge the hierarchy of node; Again, identify the difference of this sentence in step S140 The semantic edge of the level, the recognition result of the semantic node at the clause level is recognized as the semantic edge of the clause level, and the recognition result of the semantic node at the language block level is recognized as the semantic edge of the language block level; then, in step S150 A hierarchical semantic tree is generated in the step S160, and the hierarchical semantic tree of the sentence to be processed is output in step S160 according to the recognition result of the semantic edge and according to the scheduling.

图2是说明节点识别300的示意图。如图2所示，节点识别的入口S310是待处理语料的分词结果。在步骤S311中对词语和标点区别对待。对于词语来说，需要对每一个词加载概念类别等语义知识。语义知识简单包括如下两个方面：词语属性，其包括广义概念类GCC、概念类别CC、LV属性LV、语素QH、是否是纯V动词CHUNV；句类属性，其包括广义作用句GXGY、主语块数量GBK_NUM、是否块扩句EPER、是否GBK2原型句蜕GBK2_YT、被动语态ALL_PASS、是否构成双向关系句R0、是否构成比较判断句JD0。需要特别说明的是，概念类别的分类及其说明如下表所示：FIG. 2 is a schematic diagram illustrating node identification 300 . As shown in FIG. 2 , the entry S310 of node identification is the word segmentation result of the corpus to be processed. In step S311, words and punctuation are treated differently. For words, it is necessary to load semantic knowledge such as concept categories for each word. Semantic knowledge simply includes the following two aspects: word attribute, which includes generalized concept class GCC, concept category CC, LV attribute LV, morpheme QH, whether it is a pure V verb CHUNV; sentence class attribute, which includes generalized action sentence GXGY, subject block Quantity GBK_NUM, block expansion sentence EPER, GBK2 prototype sentence GBK2_YT, passive voice ALL_PASS, bidirectional relational sentence R0, comparison judgment sentence JD0. In particular, the classification and description of concept categories are shown in the table below:

其中知识库体例的基本格式如下：The basic format of the knowledge base style is as follows:

词形word form

$Feature[Value]$$Feature[Value]$

例如：For example:

半导体元件semiconductor element

$GCC[W]CC[pw]$$GCC[W]CC[pw]$

表示express

$CC[v]SC_GXY[GX]EPER[Y]GBK_NUM[3；4]SC_GBK1_PP[Y]$$CC[v]SC_GXY[GX]EPER[Y]GBK_NUM[3;4]SC_GBK1_PP[Y]$

其中，GCC[W]表示该词条(“半导体元件”)的概念大类是物W，CC[pw]表示概念类别是人造物PW；CC[v]表示该词条(“表示”)的概念类别是动词，SC_GXY[GX]表示是广义作用句，EPER[Y]表示是块扩句，GBK_NUM[3；4]表示是三主块或四主块句，SC_GBK1_PP[Y]表示GBK1必须是人或生命体。Among them, GCC[W] indicates that the conceptual category of the entry ("semiconductor element") is object W, CC[pw] indicates that the concept category is artificial PW; CC[v] indicates that the entry ("representation") The concept category is a verb, SC_GXY[GX] means a generalized action sentence, EPER[Y] means a block extension sentence, GBK_NUM[3; 4] means a sentence with three main blocks or four main blocks, SC_GBK1_PP[Y] means that GBK1 must be person or living being.

对于标点来说，句号要生成特殊的语义节点，标记为SST，作为根节点。For punctuation, a period generates a special semantic node, labeled SST, as the root node.

在步骤S330中，对每个词语进行“LV”识别，如果词语的语义知识中有逻辑l概念，生成语义节点，则该词语标记为L，如果词语的语义知识中有动态v概念等，生成语义节点，则该词语标记为V。同时，分别对标记为V和L的词语通过相应的若干排歧规则进行兼类排除处理。对所有标记为V的词语，可通过如下面两条规则为例进行兼类排除处理：对于标记为V的词语，如果该词语前面有“的”、“一种”这样的词语，则取消其V标记；如果该词语后面有“的”这样的词语，则取消其V标记。In step S330, "LV" recognition is carried out for each word, if there is a logic l concept in the semantic knowledge of the word, a semantic node is generated, and the word is marked as L, and if there is a dynamic v concept in the semantic knowledge of the word, etc., generate semantic node, the word is marked as V. At the same time, the words marked as V and L are respectively excluded by corresponding several disambiguation rules. For all words marked as V, the following two rules can be used as an example to carry out concurrent exclusion processing: for words marked as V, if there are words such as "of" and "a" in front of the word, then cancel its V mark; if there is such word as " of " behind this word, then cancel its V mark.

在步骤S340中，对所有L标记，如果该词语的概念类别是l1，则其标记修改为L1；判断其是否有后标记，如果有后标记，对后标记的词语，生成一个标记为L1H的标记。如在汉语中的“当……时候”，其中，“当”的概念类别是l1，则其标记可以修改为L1，而“时候”是“当”的后标记，把“时候”标记为L1H。如果该词语的概念类别是l0，则其标记修改为L0，如汉语的“把”字。In step S340, for all L marks, if the concept category of the word is l1, then its mark is modified to L1; judge whether it has a back mark, if there is a back mark, generate a mark as L1H for the words of the back mark mark. For example, in Chinese "when...when", wherein, the concept category of "when" is l1, then its mark can be modified as L1, and "time" is the back mark of "when", and "time" is marked as L1H . If the concept category of the word is l0, then its mark is modified to L0, such as the word "把" in Chinese.

步骤S350即是识别出的所有节点。Step S350 is to identify all nodes.

在步骤S360中，对所有节点进行LV层次识别即区分节点的LEVEL信息。对第一序列中的所有语义节点，进行LV层次识别，其包括以下操作：所有L标记和V标记的默认层次都记为0；当两个L相邻时，即出现L1、L2时，L2的层次减1，如“把在书架上的那本数学书拿下来”，其中“把”和“在”是两个相邻的L概念，此时，“把”为L1，其层次为0；而“在”为L2，其层次为-1；当L和V相邻时，即出现L1、V2时，V2的层次减1，如“把位于书架上的那本数学书拿下来”，其中“把”和“位于”是两个相邻的L和V概念，此时，“把”为L1，其层次为0；而“位于”为V2，其层次为-1；当V和L相邻时，即出现V1、L2时，L2的层次减1，如“应用与用户有关的模块”，其中“应用”和“与”是两个相邻的V和L概念，此时，“应用”为V1，其层次为0，而“与”是L2，其层次为-1。In step S360, LV level identification is performed on all nodes, that is, the LEVEL information of the nodes is distinguished. For all semantic nodes in the first sequence, LV level recognition is performed, which includes the following operations: the default levels of all L marks and V marks are recorded as 0; when two Ls are adjacent, that is, when L1 and L2 appear, L2 minus 1, such as "take down the math book on the shelf", where "put" and "zai" are two adjacent L concepts, at this time, "put" is L1, and its level is 0 ; and "in" is L2, and its level is -1; when L and V are adjacent, that is, when L1 and V2 appear, the level of V2 is reduced by 1, such as "take down the math book on the shelf", Among them, "put" and "located" are two adjacent concepts of L and V. At this time, "put" is L1, and its level is 0; while "located" is V2, and its level is -1; when V and L When adjacent, that is, when V1 and L2 appear, the level of L2 is reduced by 1, such as "application and user-related modules", where "application" and "and" are two adjacent concepts of V and L, at this time, " Apply" is V1 and its level is 0, while "AND" is L2 and its level is -1.

步骤S370中，得到的结果就是该语句的区分了LEVEL信息的所有节点，并记入第一序列，称之为第一序列：把所有L标记(包括L0、L1和L1H)和V标记，带上在语句中的位置信息，作为语义节点，记入第一序列；如果一个词语上生成超过1个语义节点，都记入第一序列；对标点符号，生成的语义节点SST，也一同记入第一序列。In step S370, the result that obtains is exactly all nodes of this sentence that distinguishes LEVEL information, and writes down the first sequence, is referred to as the first sequence: put all L marks (comprising L0, L1 and L1H) and V marks, with The above position information in the sentence, as a semantic node, is recorded in the first sequence; if more than one semantic node is generated on a word, it is recorded in the first sequence; for punctuation marks, the generated semantic node SST is also recorded together first sequence.

图3是说明语义边识别400的示意图。如图4所示，语义边识别的入口是所有节点及其层次LEVEL信息。FIG. 3 is a schematic diagram illustrating semantic edge identification 400 . As shown in Figure 4, the entry of semantic edge recognition is all nodes and their level LEVEL information.

首先先生成一个队列，称之为第二序列。First generate a queue, called the second sequence.

在步骤S410中，对第一序列中所有标记为V的语义节点，进行EG识别，生成语块，其标记为CHK_EG，把语块加入第二序列。In step S410, perform EG recognition on all the semantic nodes marked as V in the first sequence, generate language chunks marked as CHK_EG, and add the language chunks to the second sequence.

如“本发明可以快速访问与电子设备10对接的各种设备。”中“访问、对接”是标记为V的语义节点，通过语言规则对两个语义节点进行加权和降权，“访问”通过“可以、快速”两词加权，而“对接”通过与紧邻其后的“的”对其降权，在此句中“访问”权值较高，被选为小句的EG，标记为CHK_EG。For example, "the present invention can quickly access various devices that are docked with the electronic device 10." "Access, docking" is a semantic node marked V, and the two semantic nodes are weighted and downweighted by language rules, and "access" is passed The words "can, fast" are weighted, and "docking" is lowered by the "of" immediately after it. In this sentence, "visit" has a higher weight and is selected as the EG of the clause, marked as CHK_EG .

在步骤S420中，对第一序列中所有标记为L的语义节点，进行以下处理：In step S420, perform the following processing on all semantic nodes marked as L in the first sequence:

对所有标记为L1的语义节点，生成一个语块，其标记是CHK_ABK，其起始位置为L1节点的起始位置，；判断该节点后是否有L1H，如果有，则语块结束位置是L1H的结束位置；如果其后没有L1H，则语块结束位置是紧邻的下一个标记为L的语义节点的起始位置pos-1，语块层次是语义节点的层次，把语块加入第二序列。For all semantic nodes marked as L1, generate a language block, its mark is CHK_ABK, and its starting position is the starting position of the L1 node; judge whether there is L1H after this node, if so, the end position of the language block is L1H If there is no L1H thereafter, the end position of the chunk is the starting position pos-1 of the next semantic node marked L, the chunk level is the level of the semantic node, and the chunk is added to the second sequence .

如下示例说明小句层面的CHK_ABK的生成情况：如“存储器130可以以不同方式被分离。”，其中“以”是标记为L1的语义节点，其后没有标记为L1、L1H的语义节点，则可生成一个标记为CHK_ABK的语块，其起始位置为“以”语义节点的起始位置，结束位置是CHK_EG的起始位置并不包括该位置，即该句的CHK_ABK语块是“以不同方式”；如“本发明用刀片以螺旋滚动方式除去杂草。”，其中“用”是标记为L1的语义节点，其后有标记为L1的语义节点“以”，则可生成一个标记为CHK_ABK的语块，其起始位置为“用”语义节点的起始位置，结束位置是“以”的起始位置并不包括该位置，即该句第一个CHK_ABK语块是“用刀片”，同上，“以螺旋滚动方式”也是该句的一个CHK_ABK；又如“在电子设备10上呈现媒体内容”，其中，“在”是标记为L1的语义节点，其后有标记为L1H的语义节点“上”，则可生成一个标记为CHK_ABK的语块，其起始位置为“以”语义节点的起始位置，结束位置是“内”语义节点的位置，即该句的CHK_ABK语块是“在电子设备10上”。上述三例的L1和L1H都是小句层面的，其层次默认为0，CHK_ABK的层次也是0。如下示例说明语块内部的CHK_ABK的生成情况，在句子“用户有权访问通过操作系统137呈现的媒体内容。”中，“访问”是句子的CHK_EG，“通过操作系统137呈现的媒体内容”是一个CHK_GBK语块，其是由句子“通过操作系统137呈现媒体内容”降级蜕化而来的，其中“呈现”是该CHK_GBK语块的V语义节点，可以生成CHK_EG，其层次是-1；其中“通过”是标记为L1的语义节点，其层次是-1，则“通过操作系统137”可生成一个标记为CHK_ABK的语块。同样，在GBK语块内部生成的CHK_ABK的层次是-1。The following example illustrates the generation of CHK_ABK at the clause level: such as "the memory 130 can be separated in different ways.", wherein "with" is a semantic node marked L1, and there is no semantic node marked L1, L1H thereafter, then A block marked as CHK_ABK can be generated, its starting position is the starting position of the "with" semantic node, and the ending position is the starting position of CHK_EG and does not include this position, that is, the CHK_ABK block of this sentence is "with different mode"; as "the present invention removes weeds with a blade in a spiral rolling manner.", wherein "use" is a semantic node marked as L1, followed by a semantic node marked as L1 "with", then a mark can be generated The CHK_ABK language block, its starting position is the starting position of the "use" semantic node, and the ending position is the starting position of "yi" does not include this position, that is, the first CHK_ABK language block of the sentence is "use the blade" , as above, "in a spiral scrolling manner" is also a CHK_ABK of the sentence; another example is "presenting media content on the electronic device 10", wherein "in" is a semantic node marked L1, followed by a semantic node marked L1H If the node is “upper”, a chunk marked as CHK_ABK can be generated, its starting position is the starting position of the “with” semantic node, and the ending position is the position of the “inside” semantic node, that is, the CHK_ABK chunk of the sentence is "on electronic device 10". Both L1 and L1H in the above three examples are at the clause level, and their level is 0 by default, and the level of CHK_ABK is also 0. The following example illustrates the generation of CHK_ABK inside the language block. In the sentence "the user has the right to access the media content presented by the operating system 137." In the sentence, "access" is the CHK_EG of the sentence, and "the media content presented by the operating system 137" is A CHK_GBK language block, which is degenerated from the sentence "present media content through the operating system 137", wherein "present" is the V semantic node of the CHK_GBK language block, and CHK_EG can be generated, and its level is -1; where " "Pass" is a semantic node marked as L1, and its level is -1, then "pass through the operating system 137" can generate a language block marked as CHK_ABK. Similarly, the level of CHK_ABK generated inside the GBK chunk is -1.

对所有标记为L0的语义节点，生成一个语块，其标记是CHK_L0，其起始位置是L0的起始位置，其结束位置是L0的结束位置，语块层次是语义节点的层次，把语块加入第二序列。如下示例说明小句层面的CHK_L0的生成情况，在句子“用户将用户名和/或密码组合输入到用户接口150和/或认证设备70。”中，“由”被标记为L0，其层次信息为0，则将其生成一个标记是CHK_L0语块，起始位置和结束位置都是L0；如下示例说明语块层面的CHK_L0的生成情况，在语块“由用户访问的媒体内容”中，“由”被标记为L0，其层次信息为-1，则将其生成一个标记是CHK_L0语块，起始位置和结束位置都是L0。For all semantic nodes marked as L0, generate a language block, its mark is CHK_L0, its starting position is the starting position of L0, its end position is the end position of L0, the level of language block is the level of semantic nodes, and the language block level is the level of semantic nodes. Blocks are added to the second sequence. The following example illustrates the generation of CHK_L0 at the clause level. In the sentence "the user enters the user name and/or password combination into the user interface 150 and/or the authentication device 70.", "by" is marked as L0, and its level information is 0, then it will generate a CHK_L0 block whose start position and end position are both L0; the following example illustrates the generation of CHK_L0 at the block level. ” is marked as L0, and its level information is -1, then it will generate a block marked as CHK_L0, and the start position and end position are both L0.

对第一序列中标记为SST的语义节点，生成一个语块，其标记是CHK_SST，加入到第二序列。For the semantic node marked as SST in the first sequence, generate a chunk whose mark is CHK_SST, and add it to the second sequence.

在步骤S430中，利用所有的语块CHK_L0与CHK_ABK和CHK_EG之间的关系，生成一个语块，其标记是CHK_GBK，其起始位置是CHK_L0的结束位置pos+1，其结束位置是紧邻的下一个语块(其标记是CHK_ABK或CHK_EG)的起始位置pos-1，语块层次是语义节点的层次，把语块加入第二序列。如在上述示例“用户将用户名和/或密码组合输入到用户接口150和/或认证设备70。”中，“将”生成语块CHK_L0，“输入到”生成语块CHK_EG，则“用户”、“用户名和/或密码组合”和“用户接口150和/或认证设备70”是CHK_GBK语块。In step S430, a language chunk is generated by using the relationship between all chunks CHK_L0 and CHK_ABK and CHK_EG, its mark is CHK_GBK, its starting position is the end position pos+1 of CHK_L0, and its ending position is the next The initial position pos-1 of a language chunk (its mark is CHK_ABK or CHK_EG), the language chunk level is the level of semantic nodes, and the language chunk is added to the second sequence. As in the above example "the user enters the user name and/or password combination into the user interface 150 and/or the authentication device 70.", "will" generate the language block CHK_L0, and "input to" generates the language block CHK_EG, then the "user", "User name and/or password combination" and "user interface 150 and/or authentication device 70" are CHK_GBK chunks.

在步骤S440中，得到的CHK_EG、CHK_ABK、CHK_L0、CHK_SST即是所有的语义边。In step S440, the obtained CHK_EG, CHK_ABK, CHK_L0, and CHK_SST are all semantic edges.

确定以SST为根节点，第一层次CHK_EG、CHK_L0、CHK_ABK、CHK_GBK为其子节点并挂于其下，第二层次的CHK_EG、CHK_L0、CHK_ABK、CHK_GBK围棋子节点的子节点挂于其下，以此类推，直至全部为叶子节点。Determine to take SST as the root node, the first level CHK_EG, CHK_L0, CHK_ABK, CHK_GBK are its sub-nodes and hang under it, the sub-nodes of the CHK_EG, CHK_L0, CHK_ABK, CHK_GBK go sub-nodes of the second level are hung under it, with And so on until all are leaf nodes.

实施例3：Example 3:

本实施例中给出一个具体的应用实例，图4和图5是说明示例语句的层次语义树构建结果的示意图。如图4所示，待处理语句是“网络浏览器使用统一资源定位符将HTML请求发送给由系统控制的服务器。”，小句层面的语义树结构是：GBK1“网络浏览器”+ABK“使用统一资源定位符”+L0“将”+GBK2“HTML请求”+EG“发送给”+GBK3“由系统控制的服务器”，其中，CHK_SST(句号)语块作为根节点。第一层次的语义节点是L1(使用)、L0(将)、V(发送给)，三个层次全部都为0；第一层次的语义边是CHK_ABK(使用统一资源定位符)、CHK_L0(将)、CHK_EG(发送给)、CHK_GBK(网络浏览器、HTML请求、远程服务器)，六个语块层次全部都为0，其作为根节点的子节点挂出。根据CHK_EG“发送给”表示传递的动作，可以确定CHK_GBK的语义角色如下：“网络浏览器”是作用者GBK1，“HTML请求”是内容GBK2，“远程服务器”是目标地GBK3。GBK1、GBK2语块层面语义关系比较简单，虽然“浏览器”是“网络浏览器”的语义中心，“请求”是“HTML请求”的语义中心，“服务器”是“远程服务器”的语义中心，但是因为没有语块层次的语义边，语块的词语都作为叶子节点挂出。在GBK3中语块层面的语义树结构：L0“由”+GBK2“系统”+EG“控制”+CHK_L1“的”+GBK3“服务器”，其中GBK3语块作为根节点。第二层次的语义节点是L0(由)、V(控制)、L1(的)，三个层次全部都为-1；第二层次的语义边是CHK_L0(由)、CHK_EG(控制)、CHK_L1(的)、CHK_GBK(系统、服务器)，五个语块层次全部都为-1，其作为子节点挂出。根据CHK_EG“控制”是表示广义的作用的概念，可以确定CHK_GBK的语义角色如下：“系统”是作用者GBK1，“服务器”是内容GBK2。本实施例中该语句建立的语义树如图4和图5所示。A specific application example is given in this embodiment, and FIG. 4 and FIG. 5 are schematic diagrams illustrating construction results of a hierarchical semantic tree of an example sentence. As shown in Figure 4, the sentence to be processed is "the web browser uses the Uniform Resource Locator to send the HTML request to the server controlled by the system.", and the semantic tree structure at the clause level is: GBK1 "web browser"+ABK" Use uniform resource locator "+L0" to send "+GBK2" HTML request "+EG" to "+GBK3" server controlled by the system", wherein CHK_SST (period) block is used as the root node. The semantic nodes of the first level are L1 (use), L0 (will), V (send to), and all three levels are 0; the semantic edges of the first level are CHK_ABK (use uniform resource locator), CHK_L0 (will ), CHK_EG (sent to), CHK_GBK (web browser, HTML request, remote server), the six language block levels are all 0, and it hangs out as a child node of the root node. According to the action of CHK_EG "send to", the semantic role of CHK_GBK can be determined as follows: "web browser" is the actor GBK1, "HTML request" is the content GBK2, and "remote server" is the target GBK3. The semantic relationship at the block level of GBK1 and GBK2 is relatively simple. Although "browser" is the semantic center of "web browser", "request" is the semantic center of "HTML request", and "server" is the semantic center of "remote server", But because there is no semantic edge at the chunk level, the words of the chunk are hung out as leaf nodes. Semantic tree structure at chunk level in GBK3: L0 "+GBK3 "server" controlled by +GBK2 "system" + EG "+CHK_L1", where GBK3 chunk is the root node. The semantic nodes of the second level are L0(by), V(control), L1(of), and all three levels are -1; the semantic edges of the second level are CHK_L0(by), CHK_EG(control), CHK_L1( ), CHK_GBK (system, server), five language block levels are all -1, and they hang out as child nodes. According to the concept that "control" of CHK_EG represents a generalized role, the semantic role of CHK_GBK can be determined as follows: "system" is the actor GBK1, and "server" is the content GBK2. The semantic tree established by the statement in this embodiment is shown in FIG. 4 and FIG. 5 .

实施例4：Example 4:

本实施例中给出一种实现上述实施例所述的层次语义树构建方法的系统，本实施例中的层次语义树构建系统500，结构框图如图6所示，包括In this embodiment, a system for implementing the method for constructing a hierarchical semantic tree described in the foregoing embodiments is provided. The hierarchical semantic tree construction system 500 in this embodiment has a structural block diagram as shown in FIG. 6 , including

预处理单元S520：输入待处理语句，对待处理语句进行分词，并加载分词后词语的语义知识；Preprocessing unit S520: input the sentence to be processed, perform word segmentation on the sentence to be processed, and load the semantic knowledge of the word after word segmentation;

第一序列生成单元S530：根据分词结果，识别出该语句的语义节点；利用语义知识和词语位置获得语义节点的层次；The first sequence generating unit S530: according to the word segmentation result, identify the semantic node of the sentence; use the semantic knowledge and word position to obtain the level of the semantic node;

第二序列生成单元S540：识别该语句中不同层次的语义边；The second sequence generating unit S540: identifying semantic edges of different levels in the sentence;

层次语义树生成单元S550：根据各层次的语义边生成层次语义树。Hierarchical semantic tree generation unit S550: Generate a hierarchical semantic tree according to the semantic edges of each hierarchy.

此外，在实施时，还包括输入语句单元和层次语义树输出单元S560.In addition, during implementation, it also includes an input sentence unit and a hierarchical semantic tree output unit S560.

优选地，所述预处理单元S520中，对待处理语句进行分词时，按照领域词典和通用词典对待处理语句进行分词。本实施例中，所述语义知识包括词语的广义概念类及其子类，所述词语的广义概念类包括动态、静态、物、人、属性、逻辑。Preferably, in the preprocessing unit S520, when segmenting the sentence to be processed, the sentence to be processed is segmented according to the domain dictionary and the general dictionary. In this embodiment, the semantic knowledge includes generalized concept classes of words and their subclasses, and the generalized concept classes of words include dynamic, static, objects, people, attributes, and logic.

优选地，第一序列生成单元S530中，包括：Preferably, the first sequence generating unit S530 includes:

第一子单元：对于分词后的词语，如果词语的语义知识中有逻辑概念，The first subunit: For the word after word segmentation, if there is a logical concept in the semantic knowledge of the word,

对该词语标记为L，如果词语的语义知识中有动态概念，标记为V；Mark the word as L, if there is a dynamic concept in the semantic knowledge of the word, mark it as V;

第二子单元：对所有标记为L或V的词语，进行LV排除处理；The second subunit: carry out LV exclusion processing for all words marked as L or V;

第三子单元：对所有L标记根据其概念类别进行标记，并判断其是否有后标记，如果有后标记，对后标记的词语标记为L1H，根据上述所有标记生成语义节点。The third subunit: mark all L tags according to their concept categories, and judge whether they have post-marks. If there are post-marks, mark the post-marked words as L1H, and generate semantic nodes according to all the above-mentioned tags.

还包括第四子单元：将句末标点生成语义节点作为根节点。It also includes a fourth subunit: taking the end-of-sentence punctuation generation semantic node as the root node.

第一序列生成单元S530还包括：The first sequence generating unit S530 also includes:

第五子单元：所有L标记和v标记的默认层次都记为0，当出现两个上述标记相邻时，第二个标记的层次减小一层为-1。The fifth subunit: the default levels of all L marks and v marks are recorded as 0, and when two of the above marks are adjacent, the level of the second mark is reduced by one level to -1.

第二序列生成单元S540包括：The second sequence generation unit S540 includes:

核心动词识别单元：对所有标记为V的语义节点，进行核心动词识别，生成语块；Core verb identification unit: for all semantic nodes marked V, perform core verb identification and generate language chunks;

语块生成单元：对所有标记为L的语义节点，生成语块；Chunk generation unit: generate chunks for all semantic nodes marked as L;

语义边生成单元：根据语块生成语义边。Semantic edge generation unit: Generate semantic edges according to language chunks.

核心动词识别单元中，进行核心动词识别，还包括：In the core verb identification unit, the core verb identification is performed, which also includes:

排除子单元：排除不能构成核心动词的词语；Exclude subunits: Exclude words that cannot form core verbs;

选择子单元：其余的词语根据构成和词语本身所具有的特征赋予不同的权值，根据权值的排序结果和位置信息选择核心动词。Selecting subunits: The rest of the words are given different weights according to the composition and the characteristics of the words themselves, and the core verbs are selected according to the ranking results and position information of the weights.

层次语义树生成单元S550，包括：Hierarchical semantic tree generation unit S550, including:

根节点子单元：选择根节点；Root node subunit: select the root node;

子节点子单元：把层次高的语块，按照该层次中的顺序，挂到根节点上，作为子节点；Sub-node sub-unit: link the high-level language blocks to the root node according to the order of the level as sub-nodes;

遍历子单元：遍历所有子节点，将每个子节点范围内的所有语块作为该子节点的子节点，直到没有新的子节点产生。Traversing sub-units: Traversing all sub-nodes, taking all the blocks within the scope of each sub-node as the sub-nodes of the sub-node, until no new sub-nodes are generated.

图6是说明本发明实施例中的层次语义树构建系统500的示意图。层次语义树构建设备500包括五个单元：预处理单元S520、第一序列生成单元S530、第二序列生成单元S540、层次语义树生成单元S550和层次语义树输出单元S560。步骤S510表示语句的输入，一般指的是一个完整的句子，而非句群或篇章。预处理单元S520包括对语句进行分词处理、对成对的括号、引号、书名号等特殊标点的处理、加载语义知识库、对语句中出现的数字和英文缩写进行绑定并加载其语义信息、对逗号、冒号、顿号、句号等有效标点进行处理并加载其语义信息以及采用消歧规则对兼类的词语进行消歧处理，预处理单元的操作主要目的是排除干扰使得后续的识别步骤更加简洁易行。第一序列生成单元S530主要是采取LV原则对所有含有l或v概念的词语进行处理以识别所有语义节点L/V，并利用LV语义节点所呈现的位置关系区分其层次，默认都是0，其表示第一层次，第二层次则是-1；根据逗号、冒号、顿号、句号等有效标点的语义信息识别标点类的语义节点。第二序列生成单元S540主要根据所有的语义节点L/V/SST及其层次识别节点边CHK_EG、CHK_L0、CHK_ABK、CHK_GBK及其层次。层次语义树生成单元S550主要是针对CHK_GBK内部结构的分析，根据语块内部组合符识别其中并列语义结构、上述降级蜕化句的语义结构及其它。特别需要说明的是，降级蜕化句的识别与小句层面类似，不同的是CHK_ABK、CHK_L0、CHK_EG的层次信息是-1。层次语义树输出单元S560主要是根据层次语义树生成单元的结果将其输出以得到的层次语义树，具体包括：确定以SST为根节点，第一层次CHK_EG、CHK_L0、CHK_ABK、CHK_GBK为其子节点并挂于其下，第二层次的CHK_EG、CHK_L0、CHK_ABK、CHK_GBK围棋子节点的子节点挂于其下，以此类推，直至全部为叶子节点。FIG. 6 is a schematic diagram illustrating a hierarchical semantic tree construction system 500 in an embodiment of the present invention. The hierarchical semantic tree construction device 500 includes five units: a preprocessing unit S520, a first sequence generation unit S530, a second sequence generation unit S540, a hierarchical semantic tree generation unit S550 and a hierarchical semantic tree output unit S560. Step S510 represents the input of a sentence, which generally refers to a complete sentence, rather than sentence groups or chapters. The preprocessing unit S520 includes word segmentation processing for sentences, processing of special punctuation such as paired brackets, quotation marks, book title numbers, loading semantic knowledge base, binding numbers and English abbreviations appearing in sentences and loading their semantic information, Effective punctuation such as commas, colons, commas, periods, etc. is processed and its semantic information is loaded, and disambiguation rules are used to disambiguate words of the same category. The main purpose of the operation of the preprocessing unit is to eliminate interference and make the subsequent recognition steps more concise. easy. The first sequence generation unit S530 mainly adopts the LV principle to process all words containing the concept of l or v to identify all semantic nodes L/V, and utilizes the positional relationship presented by the LV semantic nodes to distinguish their levels. The default is 0, It represents the first level, and the second level is -1; the semantic nodes of the punctuation class are identified according to the semantic information of effective punctuation such as commas, colons, commas, and full stops. The second sequence generating unit S540 mainly identifies node edges CHK_EG, CHK_L0, CHK_ABK, CHK_GBK and their levels according to all semantic nodes L/V/SST and their levels. The hierarchical semantic tree generating unit S550 mainly analyzes the internal structure of CHK_GBK, and identifies the parallel semantic structure, the semantic structure of the above-mentioned degraded sentence and others according to the internal combinations of the chunk. In particular, it should be noted that the recognition of degraded sentences is similar to that at the clause level, the difference is that the level information of CHK_ABK, CHK_L0, and CHK_EG is -1. The hierarchical semantic tree output unit S560 mainly outputs the hierarchical semantic tree according to the result of the hierarchical semantic tree generating unit to obtain the hierarchical semantic tree, specifically including: determining that SST is the root node, and the first-level CHK_EG, CHK_L0, CHK_ABK, and CHK_GBK are its child nodes And hang under it, the child nodes of CHK_EG, CHK_L0, CHK_ABK, CHK_GBK go sub-nodes of the second level hang under it, and so on, until all are leaf nodes.

显然，上述实施例仅仅是为清楚地说明所作的举例，而并非对实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引伸出的显而易见的变化或变动仍处于本发明创造的保护范围之中。Apparently, the above-mentioned embodiments are only examples for clear description, rather than limiting the implementation. For those of ordinary skill in the art, other changes or changes in different forms can be made on the basis of the above description. It is not necessary and impossible to exhaustively list all the implementation manners here. And the obvious changes or changes derived therefrom are still within the scope of protection of the present invention.

Claims

1. for Layer semantics tree constructing method and the system of language understanding, it is characterized in that, comprise the steps:

S1, input pending statement, pending statement is carried out to participle, and load the semantic knowledge of word after participle;

S2, according to word segmentation result, identify the semantic node of this statement;

S3, utilize semantic knowledge and word position and collocation to obtain the level of semantic node;

S4, identify the semantic limit of different levels in this statement;

S5, generate level semantic tree according to the semantic limit of each level.

2. Layer semantics tree constructing method according to claim 1, is characterized in that, comprising: in described step S1, when pending statement is carried out to participle, according to field dictionary and universaling dictionary, pending statement is carried out to participle.

3. Layer semantics tree constructing method according to claim 1 and 2, is characterized in that, described semantic knowledge comprises generalized concept class and the subclass thereof of word, and the generalized concept class of described word comprises dynamically, static state, thing, people, attribute, logic.

4. according to the arbitrary described Layer semantics tree constructing method of claim 1-3, it is characterized in that, the process that " according to word segmentation result, identifies the semantic node of this statement " in described step S2, comprising:

For the word after participle, if having logical concept in the semantic knowledge of word, this word is labeled as to L, if having dynamic concept in the semantic knowledge of word, be labeled as V;

To all words that are labeled as L or V, carry out LV and get rid of processing;

All L marks are carried out to mark according to its concept classification, and judge whether it has rear mark, if there is rear mark, the word of rear mark is labeled as to L1H, according to above-mentioned all mark generative semantics nodes.

5. according to the arbitrary described Layer semantics tree constructing method of claim 1-4, it is characterized in that, the process that " according to word segmentation result, identifies the semantic node of this statement " in described step S2, also comprises: using end of the sentence punctuate generative semantics node as root node.

6. according to the arbitrary described Layer semantics tree constructing method of claim 1-5, it is characterized in that, the process of " utilizing semantic knowledge and word position and collocation to obtain the level of semantic node " in described step S3, comprising:

The acquiescence level of all L marks and v mark is all designated as 0, and in the time occurring that two above-mentioned marks are adjacent, it is-1 that the level of second mark reduces one deck.

7. according to the arbitrary described Layer semantics tree constructing method of claim 1-6, it is characterized in that, the process of " identifying the semantic limit of different levels in this statement " in described step S4, comprises

To all semantic nodes that are labeled as V, carry out the identification of core verb, generate language piece;

To all semantic nodes that are labeled as L, generate language piece;

According to language piece generative semantics limit.

8. according to the arbitrary described Layer semantics tree constructing method of claim 1-7, it is characterized in that, described in carry out the identification of core verb process comprise:

Eliminating can not form the word of core verb;

Remaining word is given different weights according to the feature forming and word itself has, and selects core verb according to the ranking results of weights and positional information.

9. according to the arbitrary described Layer semantics tree constructing method of claim 1-8, it is characterized in that, the described process that generates level semantic tree according to the semantic limit of each level, comprising:

Select root node;

Language piece high level, according to the order in this level, be suspended on root node, as child node;

Travel through all child nodes, all language pieces within the scope of each child node are as the child node of this child node, until do not have new child node to produce.

10. the Layer semantics tree constructing system that the Layer semantics tree constructing method described in claim 1-9 is corresponding, is characterized in that, comprising:

Pretreatment unit: input pending statement, pending statement is carried out to participle, and load the semantic knowledge of word after participle;

First ray generation unit: according to word segmentation result, identify the semantic node of this statement; Utilize semantic knowledge and word position and collocation to obtain the level of semantic node;

The second sequence generating unit: the semantic limit of identifying different levels in this statement;

Layer semantics tree generation unit: generate level semantic tree according to the semantic limit of each level.