CN110852056B

CN110852056B - Method, device and equipment for obtaining text similarity and readable storage medium

Info

Publication number: CN110852056B
Application number: CN201810827262.3A
Authority: CN
Inventors: 李鹏
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2024-09-24
Anticipated expiration: 2038-07-25
Also published as: CN110852056A; WO2020020287A1

Abstract

The invention discloses a method, a device, equipment and a readable storage medium for acquiring text similarity, belonging to the technical field of communication, wherein the method comprises the following steps: obtaining numerical characteristics of the text pairs according to the text pair data set; constructing a sample feature matrix through the numerical features of the text pairs; model training is carried out according to the sample feature matrix and the prediction vector, and a prediction model is obtained; obtaining a target text pair, and obtaining a similarity score of the target text pair according to the sample feature matrix and the prediction model; the text similarity is judged by acquiring a plurality of numerical characteristics of the text pairs and considering semantic and syntactic structures, and the method has the advantages of being trainable in weight, less in manual intervention, simple and quick, easy to implement, high in accuracy and the like, and user experience is improved.

Description

A method, device, equipment and readable storage medium for obtaining text similarity

技术领域Technical Field

本文涉及通信技术领域，尤其涉及一种获取文本相似度的方法、装置、设备及可读存储介质。The present invention relates to the field of communication technology, and in particular to a method, device, equipment and readable storage medium for obtaining text similarity.

背景技术Background Art

在信息爆炸的时代下，人们对从海量信息中快速准确获取所需内容的需求与日俱增，为实现这一需求，许多应用应运而生，如信息检索、智能问答、文献查重、个性推荐等，在这些应用背后，文本相似度计算技术是关键的核心技术之一。In the era of information explosion, people's demand for quickly and accurately obtaining the required content from massive information is increasing day by day. To meet this demand, many applications have emerged, such as information retrieval, intelligent question and answer, document duplicate checking, personalized recommendation, etc. Behind these applications, text similarity calculation technology is one of the key core technologies.

文本相似度在不同领域被广泛讨论，由于应用场景不同，其内涵有所差异，故没有统一、公认的定义。从信息论的角度来看，文本相似度与文本之间的共性和差异有关，共性越大、差异越小，则文本间的相似度越高；反之，共性越小、差异越大，则文本间的相似度越低。Text similarity has been widely discussed in different fields. Due to different application scenarios, its connotation varies, so there is no unified and recognized definition. From the perspective of information theory, text similarity is related to the commonalities and differences between texts. The greater the commonality and the smaller the difference, the higher the similarity between texts; conversely, the smaller the commonality and the greater the difference, the lower the similarity between texts.

发明内容Summary of the invention

本文在于提供一种获取文本相似度的方法、装置、设备及可读存储介质,通过获取文本对的多个数值特征，兼顾语义和句法结构，来判断文本相似度，具有权重可训练、人工干预少、简单快捷、易于实施、准确率高等优点，提高了用户体验。This article aims to provide a method, device, equipment and readable storage medium for obtaining text similarity. By obtaining multiple numerical features of text pairs, the text similarity is judged taking into account semantic and syntactic structures. It has the advantages of trainable weights, less manual intervention, simple and fast, easy to implement, high accuracy, etc., which improves the user experience.

本文解决上述技术问题所采用的技术方案如下：The technical solutions adopted in this paper to solve the above technical problems are as follows:

根据本文的一个方面，提供的一种获取文本相似度的方法，包括：According to one aspect of this invention, a method for obtaining text similarity is provided, comprising:

根据文本对数据集得到所述文本对的数值特征；Obtaining numerical features of the text pairs according to the text pair dataset;

通过所述文本对的数值特征构造样本特征矩阵；Constructing a sample feature matrix through the numerical features of the text pairs;

根据所述样本特征矩阵和预测向量进行模型训练，得到预测模型；Performing model training according to the sample feature matrix and the prediction vector to obtain a prediction model;

获取目标文本对，根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。A target text pair is obtained, and a similarity score of the target text pair is obtained according to the sample feature matrix and the prediction model.

可选地，所述数值特征包括：基于有序编辑距离的语义特征，基于无序编辑距离的语义特征，基于词义距离的语义特征，基于依存关系的句法特征。Optionally, the numerical features include: semantic features based on ordered edit distance, semantic features based on unordered edit distance, semantic features based on word meaning distance, and syntactic features based on dependency relationship.

可选地，所述根据文本对数据集得到所述文本对的数值特征包括：Optionally, obtaining the numerical features of the text pair according to the text pair dataset includes:

获取训练语料文件，所述训练语料文件包括若干组文本对及每组文本对的相似度得分；Obtaining a training corpus file, wherein the training corpus file includes a plurality of text pairs and a similarity score of each text pair;

根据所述训练语料文件得到训练数据集；Obtaining a training data set according to the training corpus file;

从所述训练数据集中得到词向量矩阵；Obtaining a word vector matrix from the training data set;

根据所述词向量矩阵和编辑距离，获取文本对之间的第一改进编辑距离，作为基于有序编辑距离的语义特征；According to the word vector matrix and the edit distance, obtaining a first improved edit distance between the text pairs as a semantic feature based on the ordered edit distance;

根据所述编辑距离和词袋模型，获取文本对之间的第二改进编辑距离，作为基于无序编辑距离的语义特征；According to the edit distance and the bag-of-words model, obtaining a second improved edit distance between the text pairs as a semantic feature based on the unordered edit distance;

根据所述词向量矩阵，获取文本对之间的词义距离，作为基于词义距离的语义特征；According to the word vector matrix, obtaining the word meaning distance between the text pairs as a semantic feature based on the word meaning distance;

对文本对进行依存句法分析，获取文本对之间的句法距离，作为基于依存关系的句法特征。Perform dependency syntactic analysis on text pairs and obtain the syntactic distance between text pairs as a syntactic feature based on dependency relations.

可选地，所述获取目标文本对，根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分包括：Optionally, the acquiring the target text pair, and acquiring the similarity score of the target text pair according to the sample feature matrix and the prediction model comprises:

获取目标文本对，获取所述目标文本对的数值特征，构成目标文本对的特征向量；Obtain a target text pair, obtain numerical features of the target text pair, and construct a feature vector of the target text pair;

将所述目标文本对的特征向量代入所述预测模型，获得所述目标文本对的相似度得分。Substitute the feature vector of the target text pair into the prediction model to obtain a similarity score of the target text pair.

根据本发明的另一个方面，提供的一种获取文本相似度的装置，包括：According to another aspect of the present invention, a device for obtaining text similarity is provided, comprising:

训练模块，用于根据文本对数据集得到所述文本对的数值特征；A training module, used for obtaining numerical features of the text pairs according to the text pair data set;

矩阵构造模块，用于通过所述文本对的数值特征构造样本特征矩阵；A matrix construction module, used for constructing a sample feature matrix through the numerical features of the text pairs;

预测模块，用于根据所述样本特征矩阵和预测向量进行模型训练，得到预测模型；A prediction module, used for performing model training according to the sample feature matrix and the prediction vector to obtain a prediction model;

在线获取模块，用于获取目标文本对，根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。The online acquisition module is used to acquire a target text pair and acquire a similarity score of the target text pair according to the sample feature matrix and the prediction model.

可选地，所述训练模块包括：Optionally, the training module includes:

获取单元，用于获取训练语料文件，所述训练语料文件包括若干组文本对及每组文本对的相似度得分；An acquisition unit, used for acquiring a training corpus file, wherein the training corpus file includes a plurality of groups of text pairs and a similarity score of each group of text pairs;

提取单元，用于根据所述训练语料文件得到训练数据集；An extraction unit, used to obtain a training data set according to the training corpus file;

词向量获取单元，用于从所述训练数据集中得到词向量矩阵；A word vector acquisition unit, used to obtain a word vector matrix from the training data set;

有序编辑距离获取单元，用于根据所述词向量矩阵和编辑距离，获取文本对之间的第一改进编辑距离，作为基于有序编辑距离的语义特征；An ordered edit distance acquisition unit, used to acquire a first improved edit distance between text pairs according to the word vector matrix and the edit distance as a semantic feature based on the ordered edit distance;

无序编辑距离获取单元，用于根据所述编辑距离和词袋模型，获取文本对之间的第二改进编辑距离，作为基于无序编辑距离的语义特征；An unordered edit distance acquisition unit, used to acquire a second improved edit distance between text pairs according to the edit distance and the bag-of-words model as a semantic feature based on the unordered edit distance;

词义距离获取单元，用于根据所述词向量矩阵，获取文本对之间的词义距离，作为基于词义距离的语义特征；A word meaning distance acquisition unit, used to acquire the word meaning distance between text pairs according to the word vector matrix as a semantic feature based on the word meaning distance;

句法距离获取单元，用于对文本对进行依存句法分析，获取文本对之间的句法距离，作为基于依存关系的句法特征。The syntactic distance acquisition unit is used to perform dependency syntactic analysis on the text pairs and acquire the syntactic distance between the text pairs as a syntactic feature based on the dependency relationship.

可选地，所述在线获取模块包括：Optionally, the online acquisition module includes:

特征向量获取单元，用于获取目标文本对，获取所述目标文本对的数值特征，构成目标文本对的特征向量；A feature vector acquisition unit, used to acquire a target text pair, acquire numerical features of the target text pair, and form a feature vector of the target text pair;

相似度获取单元，用于将所述目标文本对的特征向量代入所述预测模型，获得所述目标文本对的相似度得分。The similarity acquisition unit is used to substitute the feature vector of the target text pair into the prediction model to obtain the similarity score of the target text pair.

根据本文的再一个方面，提供的一种电子设备，包括存储器、处理器和至少一个被存储在所述存储器中并被配置为由所述处理器执行的应用程序，所述应用程序被配置为用于执行以上所述的获取文本相似度的方法。According to another aspect of the present invention, an electronic device is provided, comprising a memory, a processor, and at least one application stored in the memory and configured to be executed by the processor, wherein the application is configured to execute the above-mentioned method for obtaining text similarity.

根据本文的再一个方面，提供的一种可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现以上所述的获取文本相似度的方法。According to another aspect of the present invention, a readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the method for obtaining text similarity described above is implemented.

本发明实施例的一种获取文本相似度的方法、装置、设备及可读存储介质，该方法包括：根据文本对数据集得到所述文本对的数值特征；通过所述文本对的数值特征构造样本特征矩阵；根据所述样本特征矩阵和预测向量进行模型训练，得到预测模型；获取目标文本对，根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分；通过获取文本对的多个数值特征，兼顾语义和句法结构，来判断文本相似度，具有权重可训练、人工干预少、简单快捷、易于实施、准确率高等优点，提高了用户体验。The embodiments of the present invention provide a method, device, equipment and readable storage medium for obtaining text similarity. The method comprises: obtaining numerical features of text pairs according to a text pair data set; constructing a sample feature matrix through the numerical features of the text pairs; performing model training according to the sample feature matrix and a prediction vector to obtain a prediction model; obtaining a target text pair, and obtaining a similarity score of the target text pair according to the sample feature matrix and the prediction model; and determining text similarity by obtaining multiple numerical features of the text pairs and taking into account semantic and syntactic structures. The method has the advantages of trainable weights, less manual intervention, simplicity and speed, easy implementation, high accuracy, etc., thereby improving user experience.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例一提供的一种获取文本相似度的方法流程图；FIG1 is a flow chart of a method for obtaining text similarity provided by Embodiment 1 of the present invention;

图2为图1中步骤S10的一种方法流程图；FIG. 2 is a flow chart of a method of step S10 in FIG. 1 ;

图3为图1中步骤S40的一种方法流程图；FIG3 is a flow chart of a method of step S40 in FIG1 ;

图4为本发明实施例二提供的一种获取文本相似度的装置示范性结构框图；FIG4 is an exemplary structural block diagram of an apparatus for obtaining text similarity provided in Embodiment 2 of the present invention;

图5为图4中训练模块的示范性结构框图；FIG5 is an exemplary structural block diagram of the training module in FIG4;

图6为图4中在线获取模块模块的示范性结构框图。FIG. 6 is an exemplary structural block diagram of the online acquisition module in FIG. 4 .

本文目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization of the purpose, functional features and advantages of this article will be further explained in conjunction with embodiments and with reference to the accompanying drawings.

具体实施方式DETAILED DESCRIPTION

为了使本文所要解决的技术问题、技术方案及有益效果更加清楚、明白，以下结合附图和实施例，对本文进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本文，并不用于限定本文。In order to make the technical problems, technical solutions and beneficial effects to be solved by this article clearer and more understandable, this article is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain this article and are not used to limit this article.

实施例一Embodiment 1

如图1所示，在本实施例中，一种获取文本相似度的方法，包括：As shown in FIG. 1 , in this embodiment, a method for obtaining text similarity includes:

S10、根据文本对数据集得到所述文本对的数值特征；S10, obtaining numerical features of the text pairs according to the text pair data set;

S20、通过所述文本对的数值特征构造样本特征矩阵；S20, constructing a sample feature matrix through the numerical features of the text pairs;

S30、根据所述样本特征矩阵和预测向量进行模型训练，得到预测模型；S30, performing model training according to the sample feature matrix and the prediction vector to obtain a prediction model;

S40、获取目标文本对，根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。S40: Obtain a target text pair, and obtain a similarity score of the target text pair according to the sample feature matrix and the prediction model.

在本实施例中，通过获取文本对的多个数值特征，兼顾语义和句法结构，来判断文本相似度，考虑了文本间包含词义、编辑距离、词袋模型等特征在内的语义相似度，还考虑了包含句法结构的语法相似度，并将语义与句法相结合，使用神经网络进行了更高层面的特征抽取，具有权重可训练、人工干预少、简单快捷、易于实施、准确率高等优点，提高了用户体验。In this embodiment, the text similarity is judged by obtaining multiple numerical features of the text pair, taking into account the semantic and syntactic structures. The semantic similarity between the texts, including features such as word meaning, edit distance, and bag-of-words model, is considered. The grammatical similarity including syntactic structure is also considered, and the semantics and syntax are combined. A neural network is used to perform feature extraction at a higher level, which has the advantages of trainable weights, less manual intervention, simple and fast, easy implementation, and high accuracy, thereby improving the user experience.

在步骤S10中，首先要准备大量已标注文本对数据作为训练语料；训练语料中每个样本，为一组文本对与对应的标注相似度得分，可形式化表述为[text1；text2；score]，其中text1和text2为进行相似度获取的文本对，score为标注的相似度得分。标注得分可以来源于人工标注，亦可来源于其它先验信息，如问答系统中用户对系统本次答复的满意程度、检索系统中用户对系统本次排列信息的浏览情况等。所有样本保存在文件originalData.txt中，每行为一个训练样本，每个训练样本中text1、text2、score之间可通过制表符tab进行分割。其中，标注相似度得分score为0到1之间的实数，数字越大表示文本对之间相似度越高，反之亦然；特别地，score为0表示文本对完全不相关，score为1表示文本对完全相同。根据来源不同，score的精度不固定，如来源于人工标注可能多为0.3,0.6等一位精度小数，如来源于其他应用系统可能多为0.563,0.8192等多位精度小数。In step S10, first, a large amount of annotated text pair data is prepared as training corpus; each sample in the training corpus is a set of text pairs and corresponding annotated similarity scores, which can be formally expressed as [text1; text2; score], where text1 and text2 are text pairs for similarity acquisition, and score is the annotated similarity score. The annotation score can come from manual annotation or other prior information, such as the user's satisfaction with the system's answer in the question-and-answer system, the user's browsing of the system's arrangement information in the retrieval system, etc. All samples are saved in the file originalData.txt, each line is a training sample, and text1, text2, and score in each training sample can be separated by a tab. Among them, the annotated similarity score score is a real number between 0 and 1, and the larger the number, the higher the similarity between the text pairs, and vice versa; in particular, a score of 0 indicates that the text pairs are completely unrelated, and a score of 1 indicates that the text pairs are exactly the same. The accuracy of the score is not fixed depending on the source. For example, if it comes from manual annotation, it may be a single-digit decimal such as 0.3 or 0.6. If it comes from other application systems, it may be a multi-digit decimal such as 0.563 or 0.8192.

在本实施例中，文件originalData.txt形式如下：In this embodiment, the format of the file originalData.txt is as follows:

我想问下在哪里可以购入中兴手机、中兴手机在哪里购买、0.769I would like to ask where I can buy ZTE mobile phones, where can I buy ZTE mobile phones, 0.769

中兴公司在南京市雨花台区、南京雨花台区的中兴通讯公司、0.816ZTE Corporation in Nanjing Yuhuatai District, Nanjing Yuhuatai District ZTE Corporation, 0.816

智能问答系统团队又出新成果、智能问答领域日新月异、0.324The intelligent question-answering system team has made new achievements, the field of intelligent question-answering is changing with each passing day, 0.324

办理信用卡的渠道有哪些、借记卡申请的方式、0.814。What are the channels for applying for credit cards, how to apply for debit cards, 0.814.

在本实施例中，假设训练语料文件共M行文本对，且针对训练语料中的每个文本对得到N个数值特征，则从训练语料中抽取出的样本特征矩阵可以表示为X∈R^M×N。针对训练语料中的每个文本对将其标注相似度得分作为该样本的预测值，可以从训练语料中抽取出预测向量y∈R^M×1。因此，训练数据集可以表示为D＝[X,y]。In this embodiment, assuming that the training corpus file has a total of M lines of text pairs, and N numerical features are obtained for each text pair in the training corpus, the sample feature matrix extracted from the training corpus can be expressed as X∈R ^M×N . For each text pair in the training corpus, the annotation similarity score is used as the predicted value of the sample, and a prediction vector y∈R ^M×1 can be extracted from the training corpus. Therefore, the training data set can be expressed as D=[X,y].

在本实施例中，所述数值特征包括：基于有序编辑距离的语义特征，基于无序编辑距离的语义特征，基于词义距离的语义特征，基于依存关系的句法特征。In this embodiment, the numerical features include: semantic features based on ordered edit distance, semantic features based on unordered edit distance, semantic features based on word meaning distance, and syntactic features based on dependency relationship.

在本实施例中，除了有序编辑距离，还考虑了无序词语的移动距离，这对简单颠倒语序的文本具有更强的适应性，可大大提升系统召回率。而且，本实施例还根据语句中的有效依存配对数来获取句法相似度，可以更好的衡量句中核心词和与其存在依赖关系的词的数量。In this embodiment, in addition to the ordered edit distance, the moving distance of disordered words is also considered, which has stronger adaptability to texts with simple word order reversal and can greatly improve the system recall rate. Moreover, this embodiment also obtains syntactic similarity based on the number of valid dependency pairs in the sentence, which can better measure the number of core words in the sentence and the words that have dependency relationships with them.

如图2所示，在本实施例中，所述步骤S10包括：As shown in FIG. 2 , in this embodiment, step S10 includes:

S11、获取训练语料文件，所述训练语料文件包括若干组文本对及每组文本对的相似度得分；S11, obtaining a training corpus file, wherein the training corpus file includes a plurality of text pairs and a similarity score of each text pair;

S12、根据所述训练语料文件得到训练数据集；S12, obtaining a training data set according to the training corpus file;

S13、从所述训练数据集中得到词向量矩阵；S13, obtaining a word vector matrix from the training data set;

在本实施例中，词向量训练方法采用业界通用的方式即可，这里采用Word2Vec方法，具体步骤如下：In this embodiment, the word vector training method can adopt the common method in the industry. Here, the Word2Vec method is adopted. The specific steps are as follows:

S131、由文件originalData.txt生成新的训练语料文件originalDataForWord2Vec.txt：对文件originalData.txt中每行样本只获取text1和text2，然后将text1和text2分为两行存储。S131. Generate a new training corpus file originalDataForWord2Vec.txt from the file originalData.txt: obtain only text1 and text2 for each line of samples in the file originalData.txt, and then divide text1 and text2 into two lines for storage.

语料文件originalDataForWord2Vec.txt形式如下：The corpus file originalDataForWord2Vec.txt is in the following format:

我想问下在哪里可以购入中兴手机I would like to ask where I can buy a ZTE mobile phone

中兴手机在哪里购买Where to buy ZTE mobile phones

中兴公司在南京市雨花台区ZTE Corporation in Yuhuatai District, Nanjing

南京雨花台区的中兴通讯公司ZTE Corporation in Yuhuatai District, Nanjing

智能问答系统团队又出新成果The intelligent question-answering system team has made new achievements

智能问答领域日新月异The field of intelligent question answering is changing with each passing day

办理信用卡的渠道有哪些What are the channels for applying for credit cards?

借记卡申请的方式How to apply for a debit card

S132、采用word2vec进行词向量训练，向量长度记为d_w(比如d_w＝400)。S132. Use word2vec to perform word vector training, and the vector length is denoted as _dw (eg, _dw = 400).

S133、将训练得到的wordv2ec模型记为矩阵其中V为语料文件中所有词汇构成的词汇表，|V|为该词汇表中的词汇个数，表示|V|行d_w列的实数矩阵。S133. Record the trained wordv2ec model as a matrix Where V is the vocabulary consisting of all the words in the corpus file, |V| is the number of words in the vocabulary, represents a real matrix with |V| rows and _w columns.

S134、单词w由该模型得到的词向量可以表示为其中，表示词向量为1行d_w列的矩阵，其中，w为变量，可以指代任意单词，如“中兴”。S134, the word vector obtained by the model for word w can be expressed as in, The word vector is represented as a matrix with 1 row and d _w columns, where w is a variable and can refer to any word, such as "ZTE".

S14、根据所述词向量矩阵和编辑距离，获取文本对之间的第一改进编辑距离，作为基于有序编辑距离的语义特征；S14, obtaining a first improved edit distance between text pairs according to the word vector matrix and the edit distance as a semantic feature based on the ordered edit distance;

在本实施例中，第一改进编辑距离c_A中定义的编辑操作包括：匹配(Mat)、插入(Ins)、删除(Del)、替换(Sub)，分别对应的操作代价为c_Mat、c_Ins、c_Del、c_Sub。具体计算步骤如下：In this embodiment, the editing operations defined in the first improved edit distance c _A include: match (Mat), insert (Ins), delete (Del), and replace (Sub), and the corresponding operation costs are c _Mat , c _Ins , c _Del , and c _Sub . The specific calculation steps are as follows:

S141、对文本text1和text2，分别进行分词、去停止词操作后，得到词序列t1和t2。S141. After performing word segmentation and stop word removal operations on the texts text1 and text2, word sequences t1 and t2 are obtained.

例如，text1为“我想申请内购中兴手机了”，分词后为[我|想|申请|内购|中兴|手机|了]，去掉停用词后，得到词序列t1为[申请|内购|中兴|手机]；text2为“如何申请一下中兴产品的内购呢”，分词后为[如何|申请|一下|中兴|产品|的|内购|呢]，去掉停用词后，得到词序列t2为[如何|申请|中兴|产品|内购]。其中，“我”“想”“了”“一下”“的”“呢”均为停止词。For example, text1 is "I want to apply for in-app purchase of ZTE mobile phone", after word segmentation, it is [I|want|apply|in-app purchase|ZTE|mobile phone|], after removing stop words, the word sequence t1 is [apply|in-app purchase|ZTE|mobile phone]; text2 is "How to apply for in-app purchase of ZTE products", after word segmentation, it is [how to|apply|for|ZTE|product|of|in-app purchase|], after removing stop words, the word sequence t2 is [how to|apply|ZTE|product|in-app purchase]. Among them, "I", "want", "have", "a while", "of", "are" are all stop words.

S142、使用通用方法(如基于动态规划的方法)计算词序列t1到词序列t2的编辑路径Path_A和对应编辑元素序列Elements_A。S142: Calculate the edit path Path _A from the word sequence t1 to the word sequence t2 and the corresponding edit element sequence Elements _A using a general method (such as a method based on dynamic programming).

例如，使用通用方法可以计算出t1＝[申请|内购|中兴|手机]到t2＝[如何|申请|中兴|产品|内购]的编辑路径Path_A为[Ins,Mat,Sub,Sub,Sub]，对应编辑元素序列Elements_A为[如何,申请,内购→中兴,中兴→产品,手机→内购]。其中，无箭头表示Mat、Ins、Del操作，有箭头表示Sub操作。For example, using the general method, we can calculate that the editing path Path _A from t1 = [apply | in-app purchase | ZTE | mobile phone] to t2 = [how to | apply | ZTE | product | in-app purchase] is [Ins, Mat, Sub, Sub, Sub], and the corresponding editing element sequence Elements _A is [how to, apply, in-app purchase → ZTE, ZTE → product, mobile phone → in-app purchase]. Among them, no arrow indicates Mat, Ins, Del operations, and arrow indicates Sub operation.

S143、对编辑路径Path_A得到相应的编辑操作代价向量Action_A。具体的，将所有编辑操作换成对应的操作代价，形成编辑操作代价向量即可。S143, obtaining a corresponding editing operation cost vector Action _A for the editing path Path _A. Specifically, all editing operations are replaced with corresponding operation costs to form an editing operation cost vector.

例如，编辑路径Path_A为[Ins,Mat,Sub,Sub,Sub]，对应编辑操作代价向量即为[c_Ins,c_Mat,c_Sub,c_Sub,c_Sub]。For example, the editing path Path _A is [Ins, Mat, Sub, Sub, Sub], and the corresponding editing operation cost vector is [c _Ins , c _Mat , c _Sub , c _Sub , c _Sub ].

S144、对编辑元素序列Elements_A中每个元素计算编辑元素距离，从而得到编辑元素距离向量Dis_A。具体的，进行Mat、Ins、Del操作的编辑元素距离为1，进行Sub操作的编辑元素距离为sim_cos(w₁,w₂)。其中，sim_cos(w₁,w₂)为词w₁和词w₂的余弦相似度，可以表示为 S144. Calculate the edit element distance for each element in the edit element sequence Elements _A , thereby obtaining the edit element distance vector Dis _A. Specifically, the edit element distance for Mat, Ins, and Del operations is 1, and the edit element distance for Sub operation is sim _cos (w ₁ ,w ₂ ). Wherein, sim _cos (w ₁ ,w ₂ ) is the cosine similarity between word w ₁ and word w ₂ , which can be expressed as

例如，编辑元素序列Elements_A为[如何,申请,内购→中兴,中兴→产品,手机→内购]，对应的编辑元素距离向量Dis_A为[1,1,0.218,0.294,0.511]。For example, the editing element sequence Elements _A is [how, apply, in-app purchase→ZTE, ZTE→product, mobile phone→in-app purchase], and the corresponding editing element distance vector Dis _A is [1,1,0.218,0.294,0.511].

S145、根据编辑操作代价向量Action_A和对应的编辑元素距离向量Dis_A，计算两文本间的改进编辑距离作为基于有序编辑距离的语义特征。S145, calculating the improved edit distance between the two texts according to the edit operation cost vector Action _A and the corresponding edit element distance vector Dis _A As a semantic feature based on ordered edit distance.

例如，编辑操作代价向量为[c_Ins,c_Mat,c_Sub,c_Sub,c_Sub]，对应的编辑元素距离向量为[1,1,0.218,0.294,0.511]，则有：For example, the edit operation cost vector is [c _Ins ,c _Mat ,c _Sub ,c _Sub ,c _Sub ], and the corresponding edit element distance vector is [1,1,0.218,0.294,0.511], then:

c_A＝1*c_Ins+1*c_Mat+1*c_Ins+0.218*c_Sub+0.294*c_Sub+0.511*c_Sub。c _A ＝1*c _Ins +1*c _Mat +1*c _Ins +0.218*c _Sub +0.294*c _Sub +0.511*c _Sub .

S15、根据所述编辑距离和词袋模型，计算文本对之间的第二改进编辑距离，作为基于无序编辑距离的语义特征；S15, calculating a second improved edit distance between text pairs according to the edit distance and the bag-of-words model as a semantic feature based on the unordered edit distance;

在本实施例中，第二改进编辑距离c_B中定义的编辑操作包括：匹配(Mat)、插入(Ins)、删除(Del)，分别对应的操作代价为c_Mat、c_Ins、c_Del。具体计算步骤如下：In this embodiment, the editing operations defined in the second improved edit distance c _B include: match (Mat), insert (Ins), delete (Del), and the corresponding operation costs are c _Mat , c _Ins , and c _Del , respectively. The specific calculation steps are as follows:

S151、对文本text1和text2进行分词、去停止词操作后得到的词序列t1和t2。S151, word sequences t1 and t2 are obtained after word segmentation and stop word removal operations on texts text1 and text2.

S152、将词序列t1和t2中所有不重复的词加入到集合中，构成词袋BOW。S152, add all non-repeating words in word sequences t1 and t2 to the set to form a bag of words BOW.

例如，针对t1＝[申请|内购|中兴|手机]和t2＝[如何|申请|中兴|产品|内购]，得到的词袋BOW为[如何|申请|内购|中兴|手机|产品]。For example, for t1 = [apply|in-app purchase|ZTE|mobile phone] and t2 = [how to|apply|ZTE|product|in-app purchase], the obtained bag of words BOW is [how to|apply|in-app purchase|ZTE|mobile phone|product].

S153、根据词袋BOW和t1、t2，计算t1到t2的编辑距离。计算方式为：对词袋BOW中某词w，如t1中存在该词或其同义词，如t2中存在该词或其同义词，则记操作Mat；如t1中存在该词或其同义词，如t2中不存在该词或其同义词，则记操作Del；t1中不存在该词或其同义词，如t2中存在该词或其同义词，则记操作Ins。对词袋BOW中所有词依次执行上述操作后，可以得到编辑路径Path_B，进而得到相应的编辑操作代价向量Action_B。S153. Calculate the edit distance from t1 to t2 based on the bag of words BOW and t1 and t2. The calculation method is as follows: for a word w in the bag of words BOW, if the word or its synonym exists in t1 and if the word or its synonym exists in t2, then record the operation Mat; if the word or its synonym exists in t1 and if the word or its synonym does not exist in t2, then record the operation Del; if the word or its synonym does not exist in t1 and if the word or its synonym exists in t2, then record the operation Ins. After performing the above operations on all words in the bag of words BOW in sequence, the edit path Path _B can be obtained, and then the corresponding edit operation cost vector Action _B can be obtained.

例如，t1＝[申请|内购|中兴|手机]到t2＝[如何|申请|中兴|产品|内购]的编辑路径Path_B为[Ins,Mat,Mat,Mat,Del,Ins]，则编辑操作代价向量Action_B为[c_Ins,c_Mat,c_Mat,c_Mat,c_Del,c_Ins]。For example, the editing path Path _B from t1 = [apply | in-app purchase | ZTE | mobile phone] to t2 = [how to | apply | ZTE | product | in-app purchase] is [Ins, Mat, Mat, Mat, Del, Ins], then the editing operation cost vector Action _B is [c _Ins , c _Mat , c _Mat , c _Mat , c _Del , c _Ins ].

S154、将编辑操作代价向量Action_B中所有元素加和，得到两文本间的第二改进编辑距离c_B，作为基于无序编辑距离的语义特征。S154 , summing up all elements in the edit operation cost vector Action _B to obtain a second improved edit distance c _B between the two texts as a semantic feature based on the unordered edit distance.

例如，对编辑操作代价向量Action_B＝[c_Ins,c_Mat,c_Mat,c_Mat,c_Del,c_Ins]，c_B＝c_Ins+c_Mat+c_Mat+c_Mat+c_Del+c_Ins。For example, for the editing operation cost vector Action _B =[c _Ins ,c _Mat ,c _Mat ,c _Mat ,c _Del ,c _Ins ], c _B =c _Ins +c _Mat +c _Mat +c _Mat +c _Del +c _Ins .

S16、根据所述词向量矩阵，计算文本对之间的词义距离，作为基于词义距离的语义特征；S16, calculating the word meaning distance between the text pairs according to the word vector matrix as a semantic feature based on the word meaning distance;

在本步骤中，首先，对文本text1和text2进行分词、去停止词操作后得到的词序列t1和t2。假设t1包含的词为t2包含的词为其次，计算词与词之间的词义距离其中，以为例，下标m表示t1词序列的总数，上标1表示该词属于t1，序列，同理，表示，下标n表示t2词序列的总数，上标2表示该词属于t2，从而定义t1中词与t2的词义距离为t2中词与t1的词义距离为最后，计算两文本间的词义相似度作为基于词义距离的语义特征。In this step, first, the texts text1 and text2 are segmented and stop words are removed to obtain word sequences t1 and t2. Assume that the words contained in t1 are The words contained in t2 are Secondly, calculate the word With words Semantic distance Among them, For example, the subscript m represents the total number of t1 word sequences, and the superscript 1 represents that the word belongs to t1, sequence. Similarly, The subscript n represents the total number of word sequences in t2, and the superscript 2 represents that the word belongs to t2, thus defining the word in t1 The semantic distance from t2 is T2 Chinese word The semantic distance from t1 is Finally, calculate the semantic similarity between the two texts As a semantic feature based on word meaning distance.

S17、对文本对进行依存句法分析，计算文本对之间的句法距离，作为基于依存关系的句法特征。S17. Perform dependency syntactic analysis on the text pairs and calculate the syntactic distance between the text pairs as a syntactic feature based on the dependency relationship.

在本步骤中，首先，对文本text1和text2进行分词、去停止词操作后得到的词序列t1和t2。其次，使用通用方法(如StanfordNLP、FNLP等工具)，对t1和t2分别进行依存句法分析，并分别计算t1和t2中有效词搭配对的数量，记为p₁和p₂。其中，有效搭配对，指句中核心词和直接依存于它的有效词组成的搭配对。核心词，即句子经依存句法分析后得到的全句中唯一的核心词汇；有效词，即句子经依存句法分析后的名词、动词和形容词。In this step, first, the word sequences t1 and t2 are obtained after the word segmentation and stop word removal operations on the texts text1 and text2. Secondly, dependency syntactic analysis is performed on t1 and t2 respectively using common methods (such as StanfordNLP, FNLP and other tools), and the number of valid word collocation pairs in t1 and t2 is calculated respectively, which are recorded as _p1 and _p2 . Among them, a valid collocation pair refers to a collocation pair consisting of a core word in a sentence and a valid word that directly depends on it. The core word is the only core word in the entire sentence obtained after the sentence is analyzed by dependency syntactic analysis; the valid word is the noun, verb and adjective after the sentence is analyzed by dependency syntactic analysis.

例如，针对t1＝[申请|内购|中兴|手机]，经依存句法分析后，核心词为“内购”，与之直接依存的词有“申请”“手机”，且这两个词都是有效词，因此，t1的有效搭配对数量为2。根据为p₁和p₂计算两文本间的句法结构距离c_D＝|p₁-p₂|，作为基于依存关系的句法特征。For example, for t1 = [apply|in-app purchase|ZTE|mobile phone], after dependency syntactic analysis, the core word is "in-app purchase", and the words directly dependent on it are "apply" and "mobile phone", and these two words are both valid words, so the number of valid collocation pairs of t1 is 2. The syntactic structure distance c _D = |p ₁ -p ₂ | between the two texts is calculated for p ₁ and p ₂ as a syntactic feature based on dependency relations.

如图3所示，在本实施例中，所述步骤S40包括：As shown in FIG. 3 , in this embodiment, step S40 includes:

S41、获取目标文本对，获取所述目标文本对的数值特征，构成目标文本对的特征向量；S41, obtaining a target text pair, obtaining numerical features of the target text pair, and forming a feature vector of the target text pair;

S42、将所述目标文本对的特征向量代入所述预测模型，获得所述目标文本对的相似度得分。S42: Substitute the feature vector of the target text pair into the prediction model to obtain a similarity score of the target text pair.

在本实施例中，首先搭建训练用的网络结构，然后根据上节得到的样本特征矩阵X和预测向量y进行模型训练，最后保存模型用于后续的在线获取。In this embodiment, a network structure for training is first built, and then the model is trained according to the sample feature matrix X and the prediction vector y obtained in the previous section, and finally the model is saved for subsequent online acquisition.

其中，网络结构采用多层感知机(MLP，Multi-layer Perceptron)，利用样本特征矩阵X和预测向量y，使用通用方法，对上述网络结构进行模型训练。The network structure adopts a multi-layer perceptron (MLP), and uses the sample feature matrix X and the prediction vector y to perform model training on the above network structure using a general method.

训练后，得到的模型参数记为W^1*、b^1*、W^2*、b^2*，其中，W^1*表示MLP中第一层的连接权重，b^1*为MLP中第二层的偏置，W^2*为MLP中第二层的连接权重，b^2*为MLP中第二层的偏置，则预测模型可以表示为其中x^T为待预测样本的特征向量，g¹为MLP中第一层的非线性激活函数，g²为MLP中第二层的非线性激活函数，x^T为目标文本对的特征向量。After training, the obtained model parameters are recorded as W ^1* , b ^1* , W ^2* , b ^2* , where W ^1* represents the connection weight of the first layer in MLP, b ^1* is the bias of the second layer in MLP, W ^2* is the connection weight of the second layer in MLP, and b ^2* is the bias of the second layer in MLP. The prediction model can be expressed as Where x ^T is the feature vector of the sample to be predicted, g ¹ is the nonlinear activation function of the first layer in MLP, g ² is the nonlinear activation function of the second layer in MLP, and x ^T is the feature vector of the target text pair.

在本实施例中，针对输入系统的目标文本对t1和t2，根据上述数值特征的计算步骤，依次计算出文本对的四个数值特征c_A、c_B、c_C、c_D，构成目标文本对的特征向量x^T＝[c_A,c_B,c_C,c_D]。In this embodiment, for the target text pair t1 and t2 input into the system, according to the above numerical feature calculation steps, four numerical features c _A , c _B , c _C , c _D of the text pair are calculated in sequence to form a feature vector x ^T = [c _A , c _B , c _C , c _D ] of the target text pair.

将上述目标文本对的特征向量代入预测模型，即可得到目标文本对t1和t2的相似度得分：Substituting the feature vector of the above target text pair into the prediction model, we can get the similarity score of the target text pair t1 and t2:

实施例二Embodiment 2

如图4所示，在本实施例中，一种获取文本相似度的装置，包括：As shown in FIG4 , in this embodiment, a device for obtaining text similarity includes:

训练模块10，用于根据文本对数据集得到所述文本对的数值特征；A training module 10, used to obtain numerical features of the text pairs according to the text pair data set;

矩阵构造模块20，用于通过所述文本对的数值特征构造样本特征矩阵；A matrix construction module 20, used to construct a sample feature matrix through the numerical features of the text pairs;

预测模块30，用于根据所述样本特征矩阵和预测向量进行模型训练，得到预测模型；A prediction module 30, configured to perform model training according to the sample feature matrix and the prediction vector to obtain a prediction model;

在线获取模块40，用于获取目标文本对，根据所述样本特征矩阵和所述预测模型获取所述目标文本对的相似度得分。The online acquisition module 40 is used to acquire a target text pair and acquire a similarity score of the target text pair according to the sample feature matrix and the prediction model.

在本实施例中，首先要准备大量已标注文本对数据作为训练语料；训练语料中每个样本，为一组文本对与对应的标注相似度得分，可形式化表述为[text1；text2；score]，其中text1和text2为进行相似度计算的文本对，score为标注的相似度得分。标注得分可以来源于人工标注，亦可来源于其它先验信息，如问答系统中用户对系统本次答复的满意程度、检索系统中用户对系统本次排列信息的浏览情况等。所有样本保存在文件originalData.txt中，每行为一个训练样本，每个训练样本中text1、text2、score之间可通过制表符tab进行分割。其中，标注相似度得分score为0到1之间的实数，数字越大表示文本对之间相似度越高，反之亦然；特别地，score为0表示文本对完全不相关，score为1表示文本对完全相同。根据来源不同，score的精度不固定，如来源于人工标注可能多为0.3,0.6等一位精度小数，如来源于其他应用系统可能多为0.563,0.8192等多位精度小数。In this embodiment, a large amount of annotated text pair data is first prepared as training corpus; each sample in the training corpus is a set of text pairs and corresponding annotated similarity scores, which can be formally expressed as [text1; text2; score], where text1 and text2 are text pairs for similarity calculation, and score is the annotated similarity score. The annotated score can come from manual annotation or other prior information, such as the user's satisfaction with the system's answer in the question-and-answer system, the user's browsing of the system's arrangement information in the retrieval system, etc. All samples are saved in the file originalData.txt, each line is a training sample, and text1, text2, and score in each training sample can be separated by a tab. Among them, the annotated similarity score score is a real number between 0 and 1, and the larger the number, the higher the similarity between the text pairs, and vice versa; in particular, a score of 0 indicates that the text pairs are completely unrelated, and a score of 1 indicates that the text pairs are exactly the same. The accuracy of the score is not fixed depending on the source. For example, if it comes from manual annotation, it may be a single-digit decimal such as 0.3 or 0.6. If it comes from other application systems, it may be a multi-digit decimal such as 0.563 or 0.8192.

在本实施例中，除了有序编辑距离，还考虑了无序词语的移动距离，这对简单颠倒语序的文本具有更强的适应性，可大大提升系统召回率。而且，本实施例还根据语句中的有效依存配对数来计算句法相似度，可以更好的衡量句中核心词和与其存在依赖关系的词的数量。In this embodiment, in addition to the ordered edit distance, the moving distance of disordered words is also considered, which has stronger adaptability to texts with simple word order reversal and can greatly improve the system recall rate. Moreover, this embodiment also calculates the syntactic similarity based on the number of valid dependency pairs in the sentence, which can better measure the number of core words in the sentence and the words with dependency relationships therewith.

如图5所示，在本实施例中，所述训练模块包括：As shown in FIG5 , in this embodiment, the training module includes:

获取单元11，用于获取训练语料文件，所述训练语料文件包括若干组文本对及每组文本对的相似度得分；An acquisition unit 11 is used to acquire a training corpus file, wherein the training corpus file includes a plurality of text pairs and a similarity score of each text pair;

提取单元12，用于根据所述训练语料文件得到训练数据集；An extraction unit 12, used to obtain a training data set according to the training corpus file;

词向量获取单元13，用于从所述训练数据集中得到词向量矩阵；A word vector acquisition unit 13, used to obtain a word vector matrix from the training data set;

中兴手机在哪里购买Where to buy ZTE mobile phones

借记卡申请的方式How to apply for a debit card

S134、单词w由该模型得到的词向量可以表示为其中，表示词向量为1行d_w列的矩阵，其中，w为变量，可以指代任意单词，如“中兴”。S134, the word vector obtained by the model can be expressed as in, The word vector is represented as a matrix with 1 row and d _w columns, where w is a variable and can refer to any word, such as "ZTE".

有序编辑距离获取单元14，用于根据所述词向量矩阵和编辑距离，获取文本对之间的第一改进编辑距离，作为基于有序编辑距离的语义特征；An ordered edit distance acquisition unit 14, configured to acquire a first improved edit distance between text pairs according to the word vector matrix and the edit distance as a semantic feature based on the ordered edit distance;

无序编辑距离获取单元15，用于根据所述编辑距离和词袋模型，获取文本对之间的第二改进编辑距离，作为基于无序编辑距离的语义特征；An unordered edit distance acquisition unit 15, configured to acquire a second improved edit distance between text pairs according to the edit distance and the bag-of-words model as a semantic feature based on the unordered edit distance;

词义距离获取单元16，用于根据所述词向量矩阵，获取文本对之间的词义距离，作为基于词义距离的语义特征；A word meaning distance acquisition unit 16, used to acquire the word meaning distance between text pairs according to the word vector matrix as a semantic feature based on the word meaning distance;

在本实施例中，首先，对文本text1和text2进行分词、去停止词操作后得到的词序列t1和t2。假设t1包含的词为t2包含的词为其次，计算词与词之间的词义距离其中，以为例，下标m表示t1词序列的总数，上标1表示该词属于t1，序列，同理，表示，下标n表示t2词序列的总数，上标2表示该词属于t2，从而定义t1中词与t2的词义距离为t2中词与t1的词义距离为最后，计算两文本间的词义相似度作为基于词义距离的语义特征。In this embodiment, first, the word sequences t1 and t2 are obtained after the word segmentation and stop word removal operations on the texts text1 and text2. Assume that the words contained in t1 are The words contained in t2 are Secondly, calculate the word With words Semantic distance Among them, For example, the subscript m represents the total number of t1 word sequences, and the superscript 1 represents that the word belongs to t1, sequence. Similarly, The subscript n represents the total number of word sequences in t2, and the superscript 2 represents that the word belongs to t2, thus defining the word in t1 The semantic distance from t2 is T2 Chinese word The semantic distance from t1 is Finally, calculate the semantic similarity between the two texts As a semantic feature based on word meaning distance.

句法距离获取单元17，用于对文本对进行依存句法分析，获取文本对之间的句法距离，作为基于依存关系的句法特征。The syntactic distance acquisition unit 17 is used to perform dependency syntactic analysis on the text pairs and acquire the syntactic distance between the text pairs as a syntactic feature based on the dependency relationship.

在本实施例中，首先，对文本text1和text2进行分词、去停止词操作后得到的词序列t1和t2。其次，使用通用方法(如StanfordNLP、FNLP等工具)，对t1和t2分别进行依存句法分析，并分别计算t1和t2中有效词搭配对的数量，记为p₁和p₂。其中，有效搭配对，指句中核心词和直接依存于它的有效词组成的搭配对。核心词，即句子经依存句法分析后得到的全句中唯一的核心词汇；有效词，即句子经依存句法分析后的名词、动词和形容词。In this embodiment, first, the word sequences t1 and t2 are obtained after the word segmentation and stop word removal operations on the texts text1 and text2. Secondly, a general method (such as StanfordNLP, FNLP and other tools) is used to perform dependency syntactic analysis on t1 and t2 respectively, and the number of valid word collocation pairs in t1 and t2 is calculated respectively, which are recorded as _p1 and _p2 . Among them, the valid collocation pair refers to the collocation pair consisting of the core word in the sentence and the valid words directly dependent on it. The core word is the only core vocabulary in the whole sentence obtained after the sentence is analyzed by dependency syntax; the valid word is the noun, verb and adjective after the sentence is analyzed by dependency syntax.

如图6所示，在本实施例中，所述在线获取模块包括：As shown in FIG6 , in this embodiment, the online acquisition module includes:

特征向量获取单元41，用于获取目标文本对，计算所述目标文本对的数值特征，构成目标文本对的特征向量；A feature vector acquisition unit 41 is used to acquire a target text pair, calculate the numerical features of the target text pair, and form a feature vector of the target text pair;

相似度获取单元42，用于将所述目标文本对的特征向量代入所述预测模型，获得所述目标文本对的相似度得分。The similarity acquisition unit 42 is used to substitute the feature vector of the target text pair into the prediction model to obtain the similarity score of the target text pair.

在本实施例中，首先搭建训练用的网络结构，然后根据上节得到的样本特征矩阵X和预测向量y进行模型训练，最后保存模型用于后续的在线计算。In this embodiment, a network structure for training is first built, and then the model is trained according to the sample feature matrix X and the prediction vector y obtained in the previous section, and finally the model is saved for subsequent online calculations.

其中，网络结构采用多层感知机(MLP)，利用样本特征矩阵X和预测向量y，使用通用方法，对上述网络结构进行模型训练。The network structure adopts a multi-layer perceptron (MLP), and uses the sample feature matrix X and the prediction vector y to perform model training on the above network structure using a general method.

训练后，得到的模型参数记为W^1*、b^1*、W^2*、b^2*，则预测模型可以表示为其中x^T为待预测样本的特征向量。After training, the obtained model parameters are recorded as W ^1* , b ^1* , W ^2* , b ^2* , and the prediction model can be expressed as Where x ^T is the feature vector of the sample to be predicted.

将上述目标文本对的特征向量代入预测模型，即可得到目标文本对t1和Substituting the feature vector of the above target text pair into the prediction model, we can get the target text pair t1 and

t2的相似度得分：Similarity score of t2:

实施例三Embodiment 3

在本实施例中，一种电子设备，包括存储器、处理器和至少一个被存储在所述存储器中并被配置为由所述处理器执行的应用程序，所述应用程序被配置为用于执行实施例一所述的获取文本相似度的方法。In this embodiment, an electronic device includes a memory, a processor, and at least one application stored in the memory and configured to be executed by the processor, wherein the application is configured to execute the method for obtaining text similarity described in Embodiment 1.

实施例四Embodiment 4

本发明实施例提供一种可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现如上述获取文本相似度的方法实施例中任一所述的方法实施例。An embodiment of the present invention provides a readable storage medium having a computer program stored thereon. When the program is executed by a processor, the program implements any method embodiment described in the above method embodiments for obtaining text similarity.

需要说明的是，上述装置、设备实和可读存储介质实施例与方法实施例属于同一构思，其具体实现过程详见方法实施例，且方法实施例中的技术特征在装置实施例中均对应适用，这里不再赘述。It should be noted that the above-mentioned device, equipment and readable storage medium embodiments belong to the same concept as the method embodiments. The specific implementation process is detailed in the method embodiments, and the technical features in the method embodiments are correspondingly applicable in the device embodiments, which will not be repeated here.

本领域普通技术人员可以理解，上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those skilled in the art will appreciate that all or some of the steps in the methods disclosed above, and the functional modules/units in the systems and devices may be implemented as software, firmware, hardware, or a suitable combination thereof.

在硬件实施方式中，在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分；例如，一个物理组件可以具有多个功能，或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器，如中央处理器、数字信号处理器或微处理器执行的软件，或者被实施为硬件，或者被实施为集成电路，如专用集成电路。这样的软件可以分布在计算机可读介质上，计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的，术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外，本领域普通技术人员公知的是，通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据，并且可包括任何信息递送介质。In hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed by several physical components in cooperation. Some physical components or all physical components may be implemented as software executed by a processor, such as a central processing unit, a digital signal processor or a microprocessor, or implemented as hardware, or implemented as an integrated circuit, such as an application-specific integrated circuit. Such software may be distributed on a computer-readable medium, which may include a computer storage medium (or non-transitory medium) and a communication medium (or temporary medium). As known to those of ordinary skill in the art, the term computer storage medium includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules or other data). Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and can be accessed by a computer. Furthermore, it is well known to those skilled in the art that communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media.

以上参照附图说明了本发明的优选实施例，并非因此局限本发明的权利范围。本领域技术人员不脱离本发明的范围和实质内所作的任何修改、等同替换和改进，均应在本发明的权利范围之内。The preferred embodiments of the present invention are described above with reference to the accompanying drawings, but the scope of the present invention is not limited thereby. Any modification, equivalent substitution and improvement made by those skilled in the art without departing from the scope and essence of the present invention shall be within the scope of the present invention.

Claims

1. A method for obtaining text similarity, characterized by comprising:

Obtaining numerical features of the text pairs according to the text pair data set, the numerical features including: semantic features based on ordered edit distance, semantic features based on unordered edit distance, semantic features based on word meaning distance, and syntactic features based on dependency relationship;

Constructing a sample feature matrix through the numerical features of the text pairs;

Performing model training according to the sample feature matrix and the prediction vector to obtain a prediction model;

Obtaining a target text pair, and obtaining a similarity score of the target text pair according to the sample feature matrix and the prediction model;

Among them, the semantic feature based on the ordered edit distance is the first improved edit distance between text pairs obtained according to the word vector matrix and the edit distance of the text pair data set, the semantic feature based on the unordered edit distance is the second improved edit distance between text pairs obtained according to the edit distance and the bag-of-words model, the semantic feature based on the word meaning distance is the word meaning distance between text pairs obtained according to the word vector matrix, and the syntactic feature based on the dependency relationship is the syntactic distance between the text pairs obtained by performing dependency syntactic analysis on the text pairs.

2. A method for obtaining text similarity according to claim 1, characterized in that the numerical features of the text pair obtained according to the text pair data set include:

Obtaining a training corpus file, wherein the training corpus file includes a plurality of text pairs and a similarity score of each text pair;

Obtaining a training data set according to the training corpus file;

Obtaining a word vector matrix from the training data set;

According to the word vector matrix and the edit distance, obtaining a first improved edit distance between the text pairs as a semantic feature based on the ordered edit distance;

According to the edit distance and the bag-of-words model, obtaining a second improved edit distance between the text pairs as a semantic feature based on the unordered edit distance;

According to the word vector matrix, obtaining the word meaning distance between the text pairs as a semantic feature based on the word meaning distance;

Perform dependency syntactic analysis on text pairs and obtain the syntactic distance between text pairs as a syntactic feature based on dependency relations.

3. A method for obtaining text similarity according to claim 2, characterized in that the obtaining of the target text pair and obtaining the similarity score of the target text pair according to the sample feature matrix and the prediction model comprises:

Obtain a target text pair, obtain numerical features of the target text pair, and construct a feature vector of the target text pair;

Substitute the feature vector of the target text pair into the prediction model to obtain a similarity score of the target text pair.

4. A device for obtaining text similarity, characterized by comprising:

A training module, used for obtaining numerical features of the text pairs according to the text pair data set, wherein the numerical features include: semantic features based on ordered edit distance, semantic features based on unordered edit distance, semantic features based on word meaning distance, and syntactic features based on dependency relationship;

A matrix construction module, used for constructing a sample feature matrix through the numerical features of the text pairs;

A prediction module, used for performing model training according to the sample feature matrix and the prediction vector to obtain a prediction model;

An online acquisition module, used to acquire a target text pair, and acquire a similarity score of the target text pair according to the sample feature matrix and the prediction model;

5. The device for obtaining text similarity according to claim 4, wherein the training module comprises:

An acquisition unit, used for acquiring a training corpus file, wherein the training corpus file includes a plurality of groups of text pairs and a similarity score of each group of text pairs;

An extraction unit, configured to obtain a training data set according to the training corpus file;

A word vector acquisition unit, used to obtain a word vector matrix from the training data set;

An ordered edit distance acquisition unit, used to acquire a first improved edit distance between text pairs according to the word vector matrix and the edit distance as a semantic feature based on the ordered edit distance;

An unordered edit distance acquisition unit, used to acquire a second improved edit distance between text pairs according to the edit distance and the bag-of-words model as a semantic feature based on the unordered edit distance;

A word meaning distance acquisition unit, used to acquire the word meaning distance between text pairs according to the word vector matrix as a semantic feature based on the word meaning distance;

The syntactic distance acquisition unit is used to perform dependency syntactic analysis on the text pairs and acquire the syntactic distance between the text pairs as a syntactic feature based on the dependency relationship.

6. The device for obtaining text similarity according to claim 5, characterized in that the online obtaining module comprises:

A feature vector acquisition unit, used to acquire a target text pair, acquire numerical features of the target text pair, and form a feature vector of the target text pair;

The similarity acquisition unit is used to substitute the feature vector of the target text pair into the prediction model to obtain the similarity score of the target text pair.

7. An electronic device comprising a memory, a processor and at least one application stored in the memory and configured to be executed by the processor, characterized in that the application is configured to execute the method for obtaining text similarity as described in any one of claims 1-3.

8. A readable storage medium, characterized in that a computer program is stored thereon, and when the program is executed by a processor, the method for obtaining text similarity as described in any one of claims 1 to 3 is implemented.