[go: up one dir, main page]

CN114328805A - Text processing method, system, storage medium and terminal equipment - Google Patents

Text processing method, system, storage medium and terminal equipment Download PDF

Info

Publication number
CN114328805A
CN114328805A CN202110902041.XA CN202110902041A CN114328805A CN 114328805 A CN114328805 A CN 114328805A CN 202110902041 A CN202110902041 A CN 202110902041A CN 114328805 A CN114328805 A CN 114328805A
Authority
CN
China
Prior art keywords
language
cross
information
sample
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110902041.XA
Other languages
Chinese (zh)
Other versions
CN114328805B (en
Inventor
梁云龙
孟凡东
徐金安
陈钰枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110902041.XA priority Critical patent/CN114328805B/en
Publication of CN114328805A publication Critical patent/CN114328805A/en
Application granted granted Critical
Publication of CN114328805B publication Critical patent/CN114328805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

本发明实施例公开了一种文本处理方法、系统及存储介质和终端设备,应用于基于人工智能的信息处理技术领域。在预训练跨语言摘要模型的过程中确定的初始训练模型中包括了三个支路的编码后处理模块,分别对应三个不同的任务,即确定翻译信息、单语言摘要信息和跨语言摘要信息,而这三个任务共享了同一特征编码模块和同一特征提取模块,由于确定跨语言摘要信息这个任务可以是确定翻译信息和单语言摘要信息这两个子任务的整合,在训练跨语言摘要模型的过程中利用了实现这三个任务的信息,同时考虑到了确定跨语言摘要信息这个整体任务及其包括的子任务,使得训练得到的跨语言摘要模型在提取跨语言摘要信息时更准确。

Figure 202110902041

The embodiments of the present invention disclose a text processing method, a system, a storage medium and a terminal device, which are applied to the technical field of information processing based on artificial intelligence. The initial training model determined in the process of pre-training the cross-language summary model includes three branches of post-coding processing modules, corresponding to three different tasks, namely determining translation information, single-language summary information and cross-language summary information , and these three tasks share the same feature encoding module and the same feature extraction module, since the task of determining cross-language summary information can be the integration of the two sub-tasks of determining translation information and single-language summary information, in the training of cross-language summary model. In the process, the information of realizing these three tasks is used, and the overall task of determining cross-language summary information and its sub-tasks are considered, so that the cross-language summary model obtained by training is more accurate in extracting cross-language summary information.

Figure 202110902041

Description

一种文本处理方法、系统及存储介质和终端设备A text processing method, system, storage medium and terminal device

技术领域technical field

本发明涉及基于人工智能的信息处理技术领域,特别涉及一种文本处理方法、系统及存储介质和终端设备。The present invention relates to the technical field of information processing based on artificial intelligence, in particular to a text processing method, system, storage medium and terminal device.

背景技术Background technique

跨语言摘要技术是一项对源语言文本核心信息进行内容归纳,以目标语言的形式组织成摘要的任务。跨语言摘要技术的研究对于跨境电商(辅助用户进行决策)、舆情分析(帮助分析人员过滤冗余信息)和内容推荐(为用户推荐外语新闻)等应用场景具有重要意义。Cross-language summarization technology is a task of summarizing the core information of the source language text and organizing it into summaries in the form of the target language. The research on cross-language summarization technology is of great significance for application scenarios such as cross-border e-commerce (assisting users to make decisions), public opinion analysis (helping analysts filter redundant information), and content recommendation (recommending foreign language news for users).

但是现有的对文本处理时获取的跨语言摘要信息的效果比较差,如果进一步地将获取的跨语言摘要信息应用到多种场景时,会造成不好的用户体验。However, the effect of the existing cross-language summary information obtained during text processing is relatively poor. If the obtained cross-language summary information is further applied to various scenarios, it will cause a bad user experience.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种文本处理方法、系统及存储介质和终端设备,实现了采用较准确的跨语言摘要模型进行跨语言地提取摘要。The embodiments of the present invention provide a text processing method, a system, a storage medium and a terminal device, which realize cross-language abstract extraction using a relatively accurate cross-language abstract model.

本发明实施例一方面提供一种文本处理方法,包括:One aspect of an embodiment of the present invention provides a text processing method, including:

获取目标对象,调用预训练的跨语言摘要模型;Get the target object and call the pre-trained cross-language summarization model;

通过所述跨语言摘要模型提取所述目标对象的跨语言摘要信息;Extracting cross-language summary information of the target object by using the cross-language summary model;

其中,通过如下步骤预训练所述跨语言摘要模型:Wherein, the cross-language summarization model is pre-trained by the following steps:

确定初始训练模型,所述初始训练模型包括特征提取模块、特征编码模块和三个支路的编码后处理模块,所述特征提取模块用于提取样本对象的特征信息,特征编码模块用于对样本对象的特征信息进行编码得到编码后特征,所述三个支路中第一支路的编码后处理模块用于根据所述编码后特征确定所述样本对象的翻译信息,所述三个支路中第二支路的编码后处理模块用于根据所述编码后特征确定所述样本对象的单语言摘要信息,所述三个支路中第三支路的编码后处理模块用于根据所述编码后特征确定样本对象的跨语言摘要信息;Determine the initial training model, the initial training model includes a feature extraction module, a feature encoding module and a post-coding processing module of three branches, the feature extraction module is used to extract the feature information of the sample object, and the feature encoding module is used for the sample object. The feature information of the object is encoded to obtain the encoded feature, and the encoding post-processing module of the first branch in the three branches is used to determine the translation information of the sample object according to the encoded feature. The three branches The post-encoding processing module in the second branch is used to determine the monolingual abstract information of the sample object according to the post-encoding features, and the post-encoding processing module in the third branch in the three branches is used for The post-encoding feature determines the cross-language summary information of the sample object;

确定训练样本,所述训练样本中包括多个第一样本对象及其翻译标注、多个第二样本对象及其单语言摘要标注和多个第三样本对象及其跨语言摘要标注;determining a training sample, the training sample includes a plurality of first sample objects and their translation annotations, a plurality of second sample objects and their single-language abstract annotations, and a plurality of third sample objects and their cross-language abstract annotations;

根据所述初始训练模型和训练样本训练所述跨语言摘要模型。The cross-language summarization model is trained according to the initial training model and training samples.

本发明实施例另一方面提供一种文本处理系统,包括:Another aspect of an embodiment of the present invention provides a text processing system, including:

调用单元,用于获取目标对象,调用预训练的跨语言摘要模型;The calling unit is used to obtain the target object and call the pre-trained cross-language summary model;

摘要提取单元,用于通过所述跨语言摘要模型提取所述目标对象的跨语言摘要信息;A summary extraction unit, configured to extract the cross-language summary information of the target object through the cross-language summary model;

所述文本处理系统还包括:The text processing system also includes:

训练单元,用于确定初始训练模型,所述初始训练模型包括特征提取模块、特征编码模块和三个支路的编码后处理模块,所述特征提取模块用于提取样本对象的特征信息,特征编码模块用于对样本对象的特征信息进行编码得到编码后特征,所述三个支路中第一支路的编码后处理模块用于根据所述编码后特征确定所述样本对象的翻译信息,所述三个支路中第二支路的编码后处理模块用于根据所述编码后特征确定所述样本对象的单语言摘要信息,所述三个支路中第三支路的编码后处理模块用于根据所述编码后特征确定样本对象的跨语言摘要信息;确定训练样本,所述训练样本中包括多个第一样本对象及其翻译标注、多个第二样本对象及其单语言摘要标注和多个第三样本对象及其跨语言摘要标注;根据所述初始训练模型和训练样本训练所述跨语言摘要模型。The training unit is used to determine an initial training model, the initial training model includes a feature extraction module, a feature encoding module and a post-coding processing module of three branches, and the feature extraction module is used to extract the feature information of the sample object, and the feature encoding The module is used to encode the feature information of the sample object to obtain the encoded feature, and the post-encoded processing module of the first branch in the three branches is used to determine the translation information of the sample object according to the encoded feature, so The post-encoding processing module of the second branch in the three branches is used to determine the monolingual abstract information of the sample object according to the post-encoding features, and the post-encoding processing module of the third branch in the three branches for determining the cross-language summary information of the sample object according to the encoded features; determining a training sample, the training sample includes a plurality of first sample objects and their translation annotations, a plurality of second sample objects and their single-language abstracts Annotation and a plurality of third sample objects and their cross-language summary annotations; and the cross-language summary model is trained according to the initial training model and the training samples.

本发明实施例另一方面还提供一种计算机可读存储介质,所述计算机可读存储介质储存多个计算机程序,所述计算机程序适于由处理器加载并执行如本发明实施例一方面所述的文本处理方法。Another aspect of the embodiments of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a plurality of computer programs, and the computer programs are adapted to be loaded and executed by a processor as described in one aspect of the embodiments of the present invention the text processing method described.

本发明实施例另一方面还提供一种终端设备,包括处理器和存储器;Another aspect of the embodiments of the present invention further provides a terminal device, including a processor and a memory;

所述存储器用于储存多个计算机程序,所述计算机程序用于由处理器加载并执行如本发明实施例一方面所述的文本处理方法;所述处理器,用于实现所述多个计算机程序中的各个计算机程序。The memory is used for storing a plurality of computer programs, the computer programs are used for loading and executing the text processing method according to an aspect of the embodiment of the present invention by the processor; the processor is used for implementing the plurality of computers Individual computer programs within a program.

可见,在本实施例的方法中,文本处理系统会采用预训练的跨语言摘要模型提取目标对象的跨语言摘要信息,其中,在预训练跨语言摘要模型的过程中确定的初始训练模型中包括了三个支路的编码后处理模块,分别对应三个不同的任务,即确定翻译信息、单语言摘要信息和跨语言摘要信息,而这三个任务共享了同一特征编码模块和同一特征提取模块,由于确定跨语言摘要信息这个任务可以是确定翻译信息和单语言摘要信息这两个子任务的整合,在训练跨语言摘要模型的过程中利用了实现这三个任务的信息,同时考虑到了确定跨语言摘要信息这个整体任务及其包括的子任务,这样即时在跨语言摘要的数据集(即训练样本中的第三样本对象)较少的情况下,训练得到的跨语言摘要模型在提取跨语言摘要信息时也较为准确。It can be seen that in the method of this embodiment, the text processing system will use a pre-trained cross-language summary model to extract the cross-language summary information of the target object, wherein the initial training model determined in the process of pre-training the cross-language summary model includes: The encoding post-processing modules of three branches correspond to three different tasks, namely, determining translation information, single-language summary information and cross-language summary information, and these three tasks share the same feature encoding module and the same feature extraction module. , since the task of determining cross-language summary information can be an integration of the two sub-tasks of determining translation information and single-language summary information, the information that implements these three tasks is utilized in the process of training the cross-language summary model, while taking into account the determination of cross-language summary information. The overall task of language summary information and its sub-tasks, so that even when the data set of cross-language summarization (ie, the third sample object in the training sample) is small, the cross-language summarization model obtained by training can extract cross-language summaries. It is also more accurate when summarizing information.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1是本发明实施例提供的一种文本处理方法的示意图;1 is a schematic diagram of a text processing method provided by an embodiment of the present invention;

图2是本发明一个实施例提供的一种文本处理方法的流程图;2 is a flowchart of a text processing method provided by an embodiment of the present invention;

图3是本发明一个实施例中确定的初始训练模型的示意图;3 is a schematic diagram of an initial training model determined in an embodiment of the present invention;

图4是本发明一个实施例中训练跨语言摘要模型的方法流程图;4 is a flowchart of a method for training a cross-language summary model in an embodiment of the present invention;

图5是本发明一个应用实施例中训练跨语言摘要模型的方法流程图;5 is a flowchart of a method for training a cross-language abstract model in an application embodiment of the present invention;

图6是本发明一个应用实施例中确定的初始训练模型的示意图;6 is a schematic diagram of an initial training model determined in an application embodiment of the present invention;

图7是本发明另一应用实施例中文本处理方法所应用于的分布式系统的示意图;7 is a schematic diagram of a distributed system to which a text processing method is applied in another application embodiment of the present invention;

图8是本发明另一应用实施例中区块结构的示意图;8 is a schematic diagram of a block structure in another application embodiment of the present invention;

图9是本发明实施例提供的一种文本处理系统的逻辑结构示意图;9 is a schematic diagram of a logical structure of a text processing system provided by an embodiment of the present invention;

图10是本发明实施例提供的一种终端设备的逻辑结构示意图。FIG. 10 is a schematic diagram of a logical structure of a terminal device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排它的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if present) in the description and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to Describe a particular order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein can, for example, be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, eg, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

本发明实施例提供一种文本处理方法,主要应用于文本处理系统提取目标对象的跨语言摘要信息,如图1所示,文本处理系统可以通过如下步骤实现跨语言摘要信息的提取:An embodiment of the present invention provides a text processing method, which is mainly applied to a text processing system to extract cross-language abstract information of a target object. As shown in FIG. 1 , the text processing system can extract cross-language abstract information through the following steps:

获取目标对象,调用预训练的跨语言摘要模型;Get the target object and call the pre-trained cross-language summarization model;

通过所述跨语言摘要模型提取所述目标对象的跨语言摘要信息;Extracting cross-language summary information of the target object by using the cross-language summary model;

其中,通过如下步骤预训练所述跨语言摘要模型:Wherein, the cross-language summarization model is pre-trained by the following steps:

确定初始训练模型,所述初始训练模型包括特征提取模块、特征编码模块和三个支路的编码后处理模块,所述特征提取模块用于提取样本对象的特征信息,特征编码模块用于对样本对象的特征信息进行编码得到编码后特征,所述三个支路中第一支路的编码后处理模块用于根据所述编码后特征确定所述样本对象的翻译信息,所述三个支路中第二支路的编码后处理模块用于根据所述编码后特征确定所述样本对象的单语言摘要信息,所述三个支路中第三支路的编码后处理模块用于根据所述编码后特征确定样本对象的跨语言摘要信息;Determine the initial training model, the initial training model includes a feature extraction module, a feature encoding module and a post-coding processing module of three branches, the feature extraction module is used to extract the feature information of the sample object, and the feature encoding module is used for the sample object. The feature information of the object is encoded to obtain the encoded feature, and the encoding post-processing module of the first branch in the three branches is used to determine the translation information of the sample object according to the encoded feature. The three branches The post-encoding processing module in the second branch is used to determine the monolingual abstract information of the sample object according to the post-encoding features, and the post-encoding processing module in the third branch in the three branches is used for The post-encoding feature determines the cross-language summary information of the sample object;

确定训练样本,所述训练样本中包括多个第一样本对象及其翻译标注、多个第二样本对象及其单语言摘要标注和多个第三样本对象及其跨语言摘要标注;determining a training sample, the training sample includes a plurality of first sample objects and their translation annotations, a plurality of second sample objects and their single-language abstract annotations, and a plurality of third sample objects and their cross-language abstract annotations;

根据所述初始训练模型和训练样本训练所述跨语言摘要模型。The cross-language summarization model is trained according to the initial training model and training samples.

在实际应用中,文本处理系统可以应用于应用终端或服务器中,比如应用在实现数据推荐的服务器、对话摘要服务器或是搜索服务器等。In practical applications, the text processing system can be applied to an application terminal or server, such as a server implementing data recommendation, a dialogue summary server, or a search server.

需要说说明的是,上述跨语言摘要模型是一种基于人工智能的机器学习模型,其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。It should be noted that the above-mentioned cross-language summarization model is a machine learning model based on artificial intelligence, wherein artificial intelligence (Artificial Intelligence, AI) is the use of digital computers or digital computer-controlled machines to simulate, extend and expand human intelligence. , the theory, method, technology and application system of perceiving the environment, acquiring knowledge and using knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.

人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

而机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习等技术。Machine Learning (ML) is a multi-domain interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other techniques.

这样,在预训练跨语言摘要模型的过程中确定的初始训练模型中包括了三个支路的编码后处理模块,分别对应三个不同的任务,即确定翻译信息、单语言摘要信息和跨语言摘要信息,而这三个任务共享了同一特征编码模块和同一特征提取模块,由于确定跨语言摘要信息这个任务可以是确定翻译信息和单语言摘要信息这两个子任务的整合,在训练跨语言摘要模型的过程中利用了实现这三个任务的信息,同时考虑到了确定跨语言摘要信息这个整体任务及其包括的子任务,这样即时在跨语言摘要的数据集(即训练样本中的第三样本对象)较少的情况下,训练得到的跨语言摘要模型在提取跨语言摘要信息时也较为准确。In this way, the initial training model determined in the process of pre-training the cross-language summary model includes three branches of post-coding processing modules, corresponding to three different tasks, namely, determining translation information, single-language summary information, and cross-language summary information. Abstract information, and these three tasks share the same feature encoding module and the same feature extraction module, since the task of determining cross-language abstract information can be the integration of the two sub-tasks of determining translation information and single-language abstract information, when training cross-language abstract information In the process of the model, the information to achieve these three tasks is used, and the overall task of determining the cross-language summary information and the sub-tasks it includes are taken into account, so that the data set of the cross-language summary (that is, the third sample in the training sample) is immediately available. In the case of fewer objects), the trained cross-language summary model is also more accurate in extracting cross-language summary information.

本发明实施例提供一种文本处理方法,主要是文本处理系统所执行的方法,流程图如图2所示,包括:An embodiment of the present invention provides a text processing method, mainly a method executed by a text processing system. The flowchart is shown in FIG. 2 , including:

步骤101,获取目标对象,调用预训练的跨语言摘要模型。In step 101, the target object is acquired, and the pre-trained cross-language summary model is invoked.

可以理解,一种情况下,文本处理系统会提供用户接口,这样用户通过用户接口可以输入目标对象,从而发起本实施例的跨语言摘要流程。另一种情况下,文本处理系统可以主动地将一个文本信息作为目标对象,并发起本实施例的跨语言摘要流程。其中,目标对象可以是指某一种语言形式的文本信息。It can be understood that, in one case, the text processing system will provide a user interface, so that the user can input the target object through the user interface, thereby initiating the cross-language summarization process in this embodiment. In another case, the text processing system may actively take a text message as the target object, and initiate the cross-language summarization process of this embodiment. The target object may refer to text information in a certain language.

当文本处理系统获取到目标对象后,会调用在系统中预置的跨语言摘要模型,该跨语言摘要模型是一种机器学习模型,可以通过一定的方法训练得到,且将其运行逻辑事先设置到文本处理系统中。When the text processing system obtains the target object, it will call the cross-language summarization model preset in the system. The cross-language summarization model is a machine learning model, which can be trained by a certain method, and its operation logic is set in advance into a text processing system.

步骤102,通过跨语言摘要模型提取目标对象的跨语言摘要信息。Step 102, extracting cross-language summary information of the target object through a cross-language summary model.

其中,文本处理系统可以在执行上述步骤101和102之前,通过如下步骤预训练跨语言摘要模型:Wherein, the text processing system can pre-train the cross-language summarization model through the following steps before executing the above steps 101 and 102:

步骤201,确定初始训练模型。Step 201, determining an initial training model.

可以理解,文本处理系统在确定初始训练模型时,会确定初始训练模型所包括的多层结构和各层机构中参数的初始值。其中,初始训练模型的参数是指初始训练模型中各层结构在计算过程中所用到的固定的,不需要随时赋值的参数,比如参数规模、网络层数、用户向量长度等参数。It can be understood that, when determining the initial training model, the text processing system will determine the multi-layer structure included in the initial training model and the initial values of the parameters in each layer mechanism. Among them, the parameters of the initial training model refer to the fixed parameters that are used in the calculation process of each layer structure in the initial training model and do not need to be assigned at any time, such as parameters such as parameter scale, number of network layers, and user vector length.

初始训练模型结构可以如图3所示,具体地,初始训练模型包括:特征提取模块、特征编码模块和三个支路的编码后处理模块,特征提取模块,用于提取样本对象的特征信息;特征编码模块用于对样本对象的特征信息进行编码得到编码后特征;三个支路中第一支路的编码后处理模块用于根据编码后特征确定样本对象的翻译信息,三个支路中第二支路的编码后处理模块用于根据编码后特征确定样本对象的单语言摘要信息,三个支路中第三支路的编码后处理模块用于根据编码后特征确定样本对象的跨语言摘要信息。The structure of the initial training model can be shown in Figure 3. Specifically, the initial training model includes: a feature extraction module, a feature encoding module, and a post-coding processing module of three branches, and a feature extraction module for extracting the feature information of the sample object; The feature encoding module is used to encode the feature information of the sample object to obtain the encoded features; the post-encoding processing module of the first branch of the three branches is used to determine the translation information of the sample object according to the encoded features. The post-encoding processing module of the second branch is used to determine the monolingual abstract information of the sample object according to the post-encoding features, and the post-coding processing module of the third branch of the three branches is used to determine the cross-language of the sample object according to the post-encoding features Summary information.

其中,翻译信息是指样本对象中一种语言的文本信息对应的另一种语言的文本信息,单语言摘要信息是指样本对象中一种语言的文本信息对应的该语言的摘要信息,跨语言摘要信息是指样本对象中一个语言的文本信息对应的另一种语言的摘要信息。例如,样本对象为中文的文本信息,则翻译信息可以为对应的英文文本信息,单语言摘要信息为对应的中文摘要信息,跨语言摘要信息可以为对应的英文摘要信息。Among them, the translation information refers to the text information of one language in the sample object corresponding to the text information of another language, and the single-language summary information refers to the summary information of the language corresponding to the text information of one language in the sample object. The summary information refers to the summary information of another language corresponding to the text information of one language in the sample object. For example, if the sample object is Chinese text information, the translation information may be the corresponding English text information, the single-language abstract information may be the corresponding Chinese abstract information, and the cross-language abstract information may be the corresponding English abstract information.

进一步地,如图3所示,初始训练模型还可以包括三个支路分别对应的注意力模型,具体地:Further, as shown in FIG. 3 , the initial training model may also include attention models corresponding to the three branches respectively, specifically:

(1)若样本对象的翻译信息包括多个翻译词,这多个翻译词在确定的过程中是依次确定的,在本实施例中当确定多个翻译词中的任一翻译词时,需要结合已确定的翻译词的特征来确定,这样在多个翻译词确定的过程中,考虑到了相邻翻译词之间的关系,即翻译词的其上下文信息,使得最终确定的翻译信息更准确。(1) If the translation information of the sample object includes multiple translation words, the multiple translation words are determined in sequence during the determination process. In this embodiment, when any translation word among the multiple translation words is determined, it is necessary to It is determined by combining the features of the determined translation words, so that in the process of determining multiple translation words, the relationship between adjacent translation words, that is, the context information of the translation words, is considered, so that the finally determined translation information is more accurate.

第一支路的注意力模块,用于对已确定的翻译词进行基于注意力机制的编码得到第一历史编码后特征,而第一支路的编码后处理模块用于根据编码后特征及第一历史编码后特征确定已确定的翻译词之后的另一翻译词。The attention module of the first branch is used to encode the determined translation word based on the attention mechanism to obtain the first historical encoded feature, and the post-encoded processing module of the first branch is used to obtain the first post-encoded feature according to the encoded feature and the first post-encoded feature. A historical post-encoding feature determines another translation word after the determined translation word.

(2)若样本对象的单语言摘要信息包括多个单语言词,这多个单语言词在确定的过程中是依次确定的,在本实施例中当确定多个单语言词中的任一单语言词时,需要结合已确定的单语言词的特征来确定,这样在多个单语言词确定的过程中,考虑到了相邻单语言词之间的关系,即单语言词的其上下文信息,使得最终确定的单语言摘要信息更准确。(2) If the single-language summary information of the sample object includes multiple single-language words, the multiple single-language words are determined in sequence during the determination process. In this embodiment, when any one of the multiple single-language words is determined When a single-language word is used, it needs to be determined by combining the characteristics of the determined single-language word, so that in the process of determining multiple single-language words, the relationship between adjacent single-language words, that is, the context information of the single-language word is considered. , making the finalized single-language summary information more accurate.

第二支路的注意力模块,用于对已确定的单语言词进行基于注意力机制的编码得到第二历史编码后特征,而第二支路的编码后处理模块用于根据编码后特征及第二历史编码后特征确定已确定的单语言词之后的另一单语言词。The attention module of the second branch is used to encode the determined monolingual words based on the attention mechanism to obtain the second historical post-coding feature, and the post-coding processing module of the second branch is used to obtain the second post-coding feature according to the post-coding feature and The second historical post-encoding feature determines another monolingual word following the determined monolingual word.

(3)若样本对象的跨语言摘要信息包括多个跨语言词,这多个跨语言词在确定的过程中是依次确定的,在本实施例中当确定多个跨语言词中的任一跨语言词时,需要结合已确定的跨语言词的特征来确定,这样在多个跨语言词确定的过程中,考虑到了相邻跨语言词之间的关系,即跨语言词的其上下文信息,使得最终确定的跨语言摘要信息更准确。(3) If the cross-language summary information of the sample object includes multiple cross-language words, the multiple cross-language words are determined in sequence during the determination process. In this embodiment, when any one of the multiple cross-language words is determined When cross-language words are used, they need to be determined by combining the characteristics of the determined cross-language words, so that in the process of determining multiple cross-language words, the relationship between adjacent cross-language words, that is, the context information of cross-language words is considered. , making the finalized cross-language summary information more accurate.

这些跨语言词在确定的过程汇总第三支路的注意力模块,用于已确定的跨语言词进行基于注意力机制的编码得到第三历史编码后特征,第三支路的编码后处理模块用于根据所述编码后特征及第三历史编码后特征确定所述已确定的跨语言词之后的另一跨语言词。In the process of determining these cross-language words, the attention module of the third branch is summarized, and the identified cross-language words are encoded based on the attention mechanism to obtain the third historical post-encoding feature, and the post-encoding processing module of the third branch for determining another cross-language word subsequent to the determined cross-language word based on the post-encoded feature and a third historical post-encoded feature.

其中,三个支路分别对应的注意力模块在基于注意力机制进行编码时,可以基于单头自注意机制(Self-Attention)或基于多头自注意机制(MultiHead)进行编码。其中,注意力机制主要是通过注意力函数对目标对象进行处理,而注意力函数的本质是一个查询(query,Q)到一系列键(key,K)-值(value,V)对的映射,通过注意力函数可以得到目标对象的注意特征,用于表征目标对象中比较重要且值得注意的特征。具体地,在得到上述第一历史编码后特征时,使用的注意力函数的输入具体为已确定的翻译词的特征,得到的第一历史编码后特征主要用于表征已确定的翻译词中比较重要的翻译词的特征;在得到上述第二历史编码后特征时,使用的注意力函数的输入具体为已确定的单语言词的特征,得到的第二历史编码后特征主要用于表征已确定的单语言词中比较重要的单语言词的特征;在得到上述第三历史编码后特征时,使用的注意力函数的输入具体为已确定的跨语言词的特征,得到的第三历史编码后特征主要用于表征已确定的跨语言词中比较重要的跨语言词的特征。Among them, when the attention modules corresponding to the three branches are encoded based on the attention mechanism, they can be encoded based on a single-head self-attention mechanism (Self-Attention) or a multi-head self-attention mechanism (MultiHead). Among them, the attention mechanism mainly processes the target object through the attention function, and the essence of the attention function is the mapping from a query (query, Q) to a series of key (key, K)-value (value, V) pairs , the attention feature of the target object can be obtained through the attention function, which is used to represent the more important and noteworthy features of the target object. Specifically, when obtaining the above-mentioned first historically encoded features, the input of the attention function used is specifically the features of the determined translation words, and the obtained first historically encoded features are mainly used to characterize the determined translation words for comparison. The features of important translated words; when obtaining the above-mentioned second historical encoded features, the input of the attention function used is specifically the features of the determined monolingual words, and the obtained second historical encoded features are mainly used to represent the determined features. The characteristics of the more important monolingual words among the monolingual words of The feature is mainly used to characterize the features of the more important cross-language words among the determined cross-language words.

其中,相似单头自注意机制中直接基于其输入(即Q、K和V)得到最终的注意特征;多头自注意机制是指先对其输入(即Q、K和V)进行多次线性变换后,分别基于多次变换后的输入得到相应的注意特征,然后综合多次得到的注意特征来确定最终的注意特征。Among them, the similar single-head self-attention mechanism directly obtains the final attention feature based on its input (ie Q, K and V); the multi-head self-attention mechanism refers to the multiple linear transformation of its input (ie Q, K and V) first. , obtain the corresponding attention features based on the input after multiple transformations, and then synthesize the attention features obtained multiple times to determine the final attention features.

在具体实现过程中,初始训练模型具体可以为多任务卷积神经网络(Multi-taskconvolutional neural network,MTCNN)等形式的神经网络,比如MTCNN的输出网络(Output Network,ONet)。In a specific implementation process, the initial training model may specifically be a neural network in the form of a multi-task convolutional neural network (MTCNN), for example, an output network (ONet) of MTCNN.

步骤202,确定训练样本,训练样本中包括多个第一样本对象及其翻译标注、多个第二样本对象及其单语言摘要标注和多个第三样本对象及其跨语言摘要标注。Step 202: Determine a training sample, the training sample includes a plurality of first sample objects and their translation annotations, a plurality of second sample objects and their single-language abstract annotations, and a plurality of third sample objects and their cross-language abstract annotations.

步骤203,根据初始训练模型和训练样本训练跨语言摘要模型。Step 203, train a cross-language summarization model according to the initial training model and the training samples.

具体地,如图4所示,文本处理系统可以通过如下的步骤来训练跨语言摘要模型:Specifically, as shown in Figure 4, the text processing system can train a cross-language summarization model through the following steps:

步骤2031,通过初始训练模型分别确定各个第一样本对象的翻译信息、第二样本对象的单语言摘要信息和第三样本对象的跨语言摘要信息。Step 2031: Determine the translation information of each first sample object, the single-language summary information of the second sample object, and the cross-language summary information of the third sample object through the initial training model.

步骤2032,根据初始训练模型确定的翻译信息、单语言摘要信息和跨语言摘要信息,及训练样本中相应样本对象的翻译标注、单语言摘要标注和跨语言摘要标注,调整初始训练模型,以得到最终的训练模型。Step 2032: Adjust the initial training model according to the translation information, single-language summary information and cross-language summary information determined by the initial training model, as well as the translation annotations, single-language summary annotations and cross-language summary annotations of the corresponding sample objects in the training samples, to obtain: The final trained model.

具体地,文本处理系统会先根据上述步骤2031中得到的各个样本对象的翻译信息、单语言摘要信息和跨语言摘要信息,及训练样本中相应样本对象的翻译标注、单语言摘要标注和跨语言摘要标注,计算与初始训练模型相关的整体损失函数,该整体损失函数用于指示初始训练模型检测的各个样本对象的翻译信息、单语言摘要信息和跨语言摘要信息,与相应样本对象中实际的翻译信息、单语言摘要信息和跨语言摘要信息之间的误差,比如交叉熵损失函数等;然后再根据计算的整体损失函数调整初始训练模型中参数的参数值。Specifically, the text processing system will first obtain the translation information, single-language summary information and cross-language summary information of each sample object obtained in the above step 2031, and the translation annotation, single-language summary annotation and cross-language summary information of the corresponding sample objects in the training sample. Abstract annotation, calculates the overall loss function related to the initial training model. The overall loss function is used to indicate that the translation information, single-language summary information and cross-language summary information of each sample object detected by the initial training model are different from the actual ones in the corresponding sample objects. Errors between translation information, single-language summary information, and cross-language summary information, such as the cross-entropy loss function, etc.; and then adjust the parameter values of the parameters in the initial training model according to the calculated overall loss function.

其中,对训练模型进行训练的过程就是需要尽量减少上述误差的值,该训练过程是通过反向传播求导以及梯度下降等一系列数学优化手段不断的优化上述步骤201中确定的初始训练模型中参数的参数值,并使得上述整体损失函数的计算值降至最低。Among them, the process of training the training model is to reduce the value of the above error as much as possible. The training process is to continuously optimize the initial training model determined in the above step 201 through a series of mathematical optimization methods such as backpropagation derivation and gradient descent. The parameter value of the parameter, and minimize the calculated value of the above overall loss function.

在本实施例中,在计算与初始训练模型相关的整体损失函数时,主要可以包括如下几个部分:In this embodiment, when calculating the overall loss function related to the initial training model, it can mainly include the following parts:

根据初始训练模型确定的第一样本对象的翻译信息及训练样本中相应第一样本对象的翻译标注,计算与第一支路的编码后处理模块相关的第一损失函数;根据初始训练模型确定的第二样本对象的单语言摘要信息及训练样本中相应第二样本对象的单语言摘要标注,计算与第二支路的编码后处理模块相关的第二损失函数;根据初始训练模型确定的第三样本对象的跨语言摘要信息及训练样本中相应第三样本对象的跨语言摘要标注,计算与第三支路的编码后处理模块相关的第三损失函数;根据第一损失函数、第二损失函数和第三损失函数计算整体损失函数。According to the translation information of the first sample object determined by the initial training model and the translation label of the corresponding first sample object in the training sample, the first loss function related to the post-coding processing module of the first branch is calculated; according to the initial training model The determined single-language summary information of the second sample object and the single-language summary annotation of the corresponding second sample object in the training sample are calculated, and the second loss function related to the post-coding processing module of the second branch is calculated; determined according to the initial training model The cross-language summary information of the third sample object and the cross-language summary annotation of the corresponding third sample object in the training sample, calculate the third loss function related to the encoding post-processing module of the third branch; The loss function and the third loss function calculate the overall loss function.

其中,在计算整体损失函数时,可以先确定第一权重值和第二权重值,然后将第一权重值与第一损失函数的第一乘积,第二权重值与第二损失函数的第二乘积,及第三损失函数的和值作为整体损失函数。其中,第一权重值和第二权重值是可以动态地进行调节,具体地,在调整第一权重值时,调整后第一权重值为调整前第一权重值、第一支路的编码后处理模块的总训练步数及训练步数的当前计数之间的函数计算值;在调整第二权重值时,调整后第二权重值为调整前第二权重值、第二支路的编码后处理模块的总训练步数及训练步数的当前计数之间的函数计算值。Wherein, when calculating the overall loss function, the first weight value and the second weight value may be determined first, and then the first product of the first weight value and the first loss function, the second weight value and the second loss function The product, and the sum of the third loss function are used as the overall loss function. The first weight value and the second weight value can be adjusted dynamically. Specifically, when the first weight value is adjusted, the adjusted first weight value is the first weight value before adjustment and the encoded first branch. The function calculation value between the total number of training steps of the processing module and the current count of the number of training steps; when adjusting the second weight value, the adjusted second weight value is the second weight value before adjustment, and the second branch after encoding The function computed between the total number of training steps for the processing module and the current count of training steps.

需要说明的是,上述步骤2031到2032是通过初始训练模型确定的翻译信息、单语言摘要信息和跨语言摘要信息,对初始训练模型中的参数值的一次调整,而在实际应用中,需要通过不断地循环执行上述步骤2031到2032,直到对参数值的调整满足一定的停止条件为止。It should be noted that the above steps 2031 to 2032 are one-time adjustments to the parameter values in the initial training model based on the translation information, single-language summary information and cross-language summary information determined by the initial training model. The above steps 2031 to 2032 are continuously executed in a loop until the adjustment of the parameter value satisfies a certain stopping condition.

因此,文本处理系统在执行了上述实施例步骤2031到2032之后,还需要判断当前对参数值的调整是否满足预置的停止条件,当满足时,将上述步骤2032调整的参数值作为最终训练得到的训练模型中的参数值;当不满足时,则针对调整参数值后的初始训练模型,返回执行上述步骤2031到2032。其中,预置的停止条件包括但不限于如下条件中的任何一个:当前调整的参数值与上一次调整的参数值的差值小于一阈值,即调整的参数值达到收敛;及对参数值的调整次数等于预置的次数等。Therefore, after executing steps 2031 to 2032 in the above embodiment, the text processing system also needs to judge whether the current adjustment of the parameter value satisfies the preset stopping condition, and if so, the parameter value adjusted in the above step 2032 is used as the final training result. The parameter values in the training model of ; if not satisfied, return to the above steps 2031 to 2032 for the initial training model after adjusting the parameter values. Wherein, the preset stop condition includes but is not limited to any one of the following conditions: the difference between the parameter value currently adjusted and the parameter value adjusted last time is less than a threshold, that is, the adjusted parameter value has reached convergence; The adjustment times are equal to the preset times and so on.

步骤2033,确定预训练的跨语言摘要模型可以包括最终的训练模型中的特征提取模块、特征编码模块和第三支路的编码后处理模块。Step 2033, it is determined that the pre-trained cross-language abstract model may include a feature extraction module, a feature encoding module and a post-encoding processing module of the third branch in the final training model.

可见,在本实施例的方法中,文本处理系统会采用预训练的跨语言摘要模型提取目标对象的跨语言摘要信息,其中,在预训练跨语言摘要模型的过程中确定的初始训练模型中包括了三个支路的编码后处理模块,分别对应三个不同的任务,即确定翻译信息、单语言摘要信息和跨语言摘要信息,而这三个任务共享了同一特征编码模块和同一特征提取模块,由于确定跨语言摘要信息这个任务可以是确定翻译信息和单语言摘要信息这两个子任务的整合,在训练跨语言摘要模型的过程中利用了实现这三个任务的信息,同时考虑到了确定跨语言摘要信息这个整体任务及其包括的子任务,这样即时在跨语言摘要的数据集(即训练样本中的第三样本对象)较少的情况下,训练得到的跨语言摘要模型在提取跨语言摘要信息时也较为准确。It can be seen that in the method of this embodiment, the text processing system will use a pre-trained cross-language summary model to extract the cross-language summary information of the target object, wherein the initial training model determined in the process of pre-training the cross-language summary model includes: The encoding post-processing modules of three branches correspond to three different tasks, namely, determining translation information, single-language summary information and cross-language summary information, and these three tasks share the same feature encoding module and the same feature extraction module. , since the task of determining cross-language summary information can be an integration of the two sub-tasks of determining translation information and single-language summary information, the information that implements these three tasks is utilized in the process of training the cross-language summary model, while taking into account the determination of cross-language summary information. The overall task of language summary information and its sub-tasks, so that even when the data set of cross-language summarization (ie, the third sample object in the training sample) is small, the cross-language summarization model obtained by training can extract cross-language summaries. It is also more accurate when summarizing information.

以下一个具体的应用实例来说明本发明的文本处理方法,本实施例的方法可以包括如下两个部分:The text processing method of the present invention is described with a specific application example below. The method of this embodiment may include the following two parts:

(一)如图5所示,可以通过如下步骤实现对跨语言摘要模型的训练:(1) As shown in Figure 5, the training of the cross-language summary model can be realized by the following steps:

步骤301,确定训练样本,具体可以包括三个部分,第一部分训练样本包括多个第一样本对象及其翻译标注,第二部分训练样本包括多个第二样本对象及其单语言摘要标注,第三部分训练样本包括多个第三样本对象及其跨语言摘要标注。Step 301: Determine the training sample, which may specifically include three parts, the first part of the training sample includes a plurality of first sample objects and their translation annotations, the second part of the training sample includes a plurality of second sample objects and their single-language abstract annotations, The third part of the training samples includes multiple third sample objects and their cross-language summary annotations.

步骤302,确定初始训练模型,其结构如图6所示,可以包括:特征提取模块,特征编码模块具体为编码器(encoder),三个支路的编码后处理模块分别具体包括解码器(decoder)和概率输出(比如softmax函数),及三个支路的注意力模块。Step 302, determine the initial training model, whose structure is shown in Figure 6, and may include: a feature extraction module, the feature encoding module is specifically an encoder (encoder), and the post-coding processing modules of the three branches specifically include a decoder (decoder) respectively. ) and probabilistic outputs (such as the softmax function), and an attention module with three branches.

其中,特征提取模块用于提取各个部分的样本对象的特征信息,具体可以为嵌入特征(Embedding),比如标记嵌入特征(token embeddings)和位置嵌入特征(positionalembeddings)。Among them, the feature extraction module is used to extract the feature information of each part of the sample object, which can be embedded features (Embedding), such as token embedding features (token embeddings) and positional embedded features (positionalembeddings).

编码器,用于对特征提取模块输出的嵌入特征进行编码得到编码后特征。The encoder is used to encode the embedded features output by the feature extraction module to obtain encoded features.

第一支路的注意力模块用于对已确定的翻译词Ymt,<t进行基于注意力机制的编码得到第一历史编码后特征

Figure BDA0003200293720000111
具体地可以采用如下公式1-1来表示,其中,基于注意力机制具体为多头自注意机制(在其它实施例中可以采用单头自注意机制),其输入为已确定的翻译词Ymt,<t的嵌入特征y1;第一支路的解码器MT用于根据第一历史编码后特征
Figure BDA0003200293720000112
及针对第一样本对象得到的编码后特征
Figure BDA0003200293720000113
确定一个交互表示特征Omt,t,具体可以通过如下公式1-2来表示,其中,可以在前馈神经网络(Feedforward Neural Network,FFN)中采用多头自注意机制来确定交互表示特征Omt,t,这里t为某一个翻译词;第一支路的概率输出用于通过softmax函数输出第一样本对象的翻译信息(包括多个翻译词)的概率pmt,具体可以通过如下公式1-3来表示,其中,Wmt和bmt是需要学习的参数,需要先确定相应的初始值,Xmt表示第一样本对象的嵌入特征,Ymt,t表示已确定的翻译词Ymt,<t之后另一翻译词的概率。The attention module of the first branch is used to encode the determined translation word Y mt,<t based on the attention mechanism to obtain the first historical encoded feature
Figure BDA0003200293720000111
Specifically, it can be represented by the following formula 1-1, wherein, the attention-based mechanism is specifically a multi-head self-attention mechanism (in other embodiments, a single-head self-attention mechanism can be used), and its input is the determined translation word Y mt, The embedded feature y1 of <t ; the decoder MT of the first branch is used to encode the feature according to the first history
Figure BDA0003200293720000112
and the encoded features obtained for the first sample object
Figure BDA0003200293720000113
Determining an interactive representation feature O mt,t can be specifically represented by the following formula 1-2, wherein a multi-head self-attention mechanism can be used in a feedforward neural network (Feedforward Neural Network, FFN) to determine the interactive representation feature O mt, t , where t is a certain translation word; the probability output of the first branch is used to output the probability p mt of the translation information (including multiple translation words) of the first sample object through the softmax function, which can be specified by the following formula 1- 3 to represent, in which, W mt and b mt are parameters that need to be learned, and the corresponding initial values need to be determined first, X mt represents the embedded feature of the first sample object, Y mt, t represents the determined translation word Y mt, The probability of another translated word after <t .

Figure BDA0003200293720000114
Figure BDA0003200293720000114

Figure BDA0003200293720000115
Figure BDA0003200293720000115

Figure BDA0003200293720000116
Figure BDA0003200293720000116

第二支路的注意力模块用于对已确定的单语言词Yms,<t进行基于注意力机制的编码得到第二历史编码后特征

Figure BDA0003200293720000121
具体地可以采用如下公式2-1来表示,其中,基于注意力机制具体为多头自注意机制,其输入为已确定的单语言词Yms,<t的嵌入特征y2;第二支路的解码器MS用于根据第二历史编码后特征
Figure BDA0003200293720000122
及针对第二样本对象得到的编码后特征
Figure BDA0003200293720000123
确定一个交互表示特征Oms,t,具体可以通过如下公式2-2来表示,其中,可以在FFN网络中采用多头自注意机制来确定交互表示特征Oms,t,这里t为某一个单语言词;第二支路的概率输出用于通过softmax函数输出第二样本对象的单语言摘要信息(包括多个单语言词)的概率pmt,具体可以通过如下公式2-3来表示,其中,Wms和bms是需要学习的参数,需要先确定相应的初始值,Xms表示第二样本对象的嵌入特征,Yms,t表示已确定的单语言词Yms,<t之后另一单语言词的概率。The attention module of the second branch is used to encode the determined monolingual word Y ms,<t based on the attention mechanism to obtain the second historical encoded feature
Figure BDA0003200293720000121
Specifically, it can be represented by the following formula 2-1, wherein, the attention-based mechanism is specifically a multi-head self-attention mechanism, and its input is the embedded feature y2 of the determined single-language word Y ms,<t ; the decoding of the second branch The device MS is used to encode the features according to the second history
Figure BDA0003200293720000122
and the encoded features obtained for the second sample object
Figure BDA0003200293720000123
Determining an interactive representation feature O ms,t can be expressed by the following formula 2-2, in which the multi-head self-attention mechanism can be used in the FFN network to determine the interactive representation feature O ms,t , where t is a certain single language word; the probability output of the second branch is used to output the probability p mt of the monolingual summary information (including multiple monolingual words) of the second sample object through the softmax function, which can be specifically expressed by the following formula 2-3, wherein, W ms and b ms are parameters that need to be learned, and the corresponding initial values need to be determined first, X ms represents the embedded feature of the second sample object, Y ms, t represents the determined single-language word Y ms, after < t another single Probability of language words.

Figure BDA0003200293720000124
Figure BDA0003200293720000124

Figure BDA0003200293720000125
Figure BDA0003200293720000125

Figure BDA0003200293720000126
Figure BDA0003200293720000126

第三支路的注意力模块用于对已确定的跨语言词Ycls,<t进行基于注意力机制的编码得到第三历史编码后特征

Figure BDA0003200293720000127
具体地可以采用如下公式3-1来表示,其中,基于注意力机制具体为多头自注意机制,其输入为已确定的跨语言词Ycls,<t的嵌入特征y3;第三支路的解码器CLS用于根据第三历史编码后特征
Figure BDA0003200293720000128
及针对第三样本对象得到的编码后特征
Figure BDA0003200293720000129
确定一个交互表示特征Ocls,t,具体可以通过如下公式3-2来表示,其中,可以在FFN网络中采用多头自注意机制来确定交互表示特征Ocls,t,这里t为某一个跨语言词;第三支路的概率输出用于通过softmax函数输出第三样本对象的跨语言摘要信息(包括多个跨语言词)的概率pcls,具体可以通过如下公式3-3来表示,其中,Wcls和bcls是需要学习的参数,需要先确定相应的初始值,Xcls表示第三样本对象的嵌入特征,Ycls,t表示已确定的跨语言词Ycls,<t之后另一跨语言词的概率。The attention module of the third branch is used to encode the determined cross-language word Y cls,<t based on the attention mechanism to obtain the third historical encoded feature
Figure BDA0003200293720000127
Specifically, it can be represented by the following formula 3-1, wherein, the attention-based mechanism is specifically a multi-head self-attention mechanism, and its input is the embedded feature y3 of the determined cross-language word Y cls,<t ; the decoding of the third branch The device CLS is used to encode the features according to the third history
Figure BDA0003200293720000128
and the encoded features obtained for the third sample object
Figure BDA0003200293720000129
Determining an interactive representation feature O cls,t can be specifically represented by the following formula 3-2, in which the multi-head self-attention mechanism can be used in the FFN network to determine the interactive representation feature O cls,t , where t is a certain cross-language word; the probability output of the third branch is used to output the probability p cls of the cross-language summary information (including multiple cross-language words) of the third sample object through the softmax function, which can be specifically expressed by the following formula 3-3, wherein, W cls and b cls are parameters that need to be learned, and the corresponding initial values need to be determined first, X cls represents the embedded feature of the third sample object, Y cls, t represents the determined cross-language word Y cls, <t after another cross Probability of language words.

Figure BDA0003200293720000131
Figure BDA0003200293720000131

Figure BDA0003200293720000132
Figure BDA0003200293720000132

Figure BDA0003200293720000133
Figure BDA0003200293720000133

步骤303,通过初始训练模型分别确定第一样本对象的翻译信息、第二样本对象的单语言摘要信息和第三样本对象的跨语言摘要信息。Step 303: Determine the translation information of the first sample object, the single-language summary information of the second sample object, and the cross-language summary information of the third sample object through the initial training model.

步骤304,根据初始训练模型确定的第一样本对象的翻译信息及训练样本中翻译标注计算第一损失函数,具体可以通过如下公式4-1来表示;根据初始训练模型确定的第二样本对象的单语言摘要信息及训练样本中单语言摘要标注计算第二损失函数,具体可以通过如下公式4-2来表示;根据初始训练模型确定的第三样本对象的跨语言摘要信息及训练样本中跨语言摘要标注计算第三损失函数,具体可以通过如下公式4-3来表示。Step 304: Calculate the first loss function according to the translation information of the first sample object determined by the initial training model and the translation labels in the training sample, which can be specifically expressed by the following formula 4-1; the second sample object determined according to the initial training model The second loss function is calculated based on the single-language summary information and the single-language summary annotations in the training samples, which can be specifically expressed by the following formula 4-2; the cross-language summary information of the third sample object determined according to the initial training model and the cross-language summary information of the training sample The language summary annotation calculates the third loss function, which can be specifically expressed by the following formula 4-3.

Figure BDA0003200293720000134
Figure BDA0003200293720000134

Figure BDA0003200293720000135
Figure BDA0003200293720000135

Figure BDA0003200293720000136
Figure BDA0003200293720000136

步骤305,根据第一损失函数LMT、第二损失函数LMS和第三损失函数LCLS计算与初始训练模型相关的整体损失函数,具体地,可以通过如下公式5来表示:Step 305: Calculate the overall loss function related to the initial training model according to the first loss function L MT , the second loss function L MS and the third loss function L CLS . Specifically, it can be expressed by the following formula 5:

L=LCLS+αLMT+βLMS (5)L=L CLS + αL MT + βL MS (5)

其中,α和β分别为第一权重值和第二权重值,可以动态地进行调整,具体地,在调整第一权重值时可以通过如下公式6来进行调整,在调整第二权重值时可以通过如下公式7来进行调整:Among them, α and β are the first weight value and the second weight value respectively, which can be adjusted dynamically. Specifically, when adjusting the first weight value, the following formula 6 can be used to adjust, and when adjusting the second weight value, The adjustment is made by the following formula 7:

α2=max(0,α1-d1),d1=α1*t2/T3 (6)α2=max(0,α1-d1), d1=α1*t2/T3 (6)

β2=max(0,β1-d2),d2=β1*t2/T4 (7)β2=max(0,β1-d2), d2=β1*t2/T4 (7)

其中,α1和α2分别为调整前后的第一权重值,β1和β2分别为调整前后的第二权重值,t2为训练步数的当前计数,T3和T4分别为单语言摘要任务和跨语言摘要任务的总训练步数。其中,在训练过程中需要对训练步数进行从1开始计数,而随着训练步数的计数的增加,第一权重值和第二权重值会逐渐减小,直到训练步数的当前计数与总训练步数相等即t2=T3时,第一权重值和第二权重值都为零,这样在训练跨语言摘要模型的过程中,利用翻译任务和单语言摘要任务的信息的同时,又能使得跨语言摘要任务占主导地位,使得最终得到的跨语言摘要模型更准确。Among them, α1 and α2 are the first weight values before and after adjustment, β1 and β2 are the second weight values before and after adjustment, t2 is the current count of training steps, T3 and T4 are the single-language summarization task and cross-language summarization, respectively The total number of training steps for the task. Among them, the number of training steps needs to be counted from 1 during the training process, and as the number of training steps increases, the first weight value and the second weight value will gradually decrease until the current count of the number of training steps and The total number of training steps is equal, that is, when t2=T3, the first weight value and the second weight value are both zero, so that in the process of training the cross-language summary model, while using the information of the translation task and the single-language summary task, it can also Making the cross-lingual summarization task dominant makes the resulting cross-language summarization model more accurate.

步骤306,根据整体损失函数调整初始训练模型中参数的参数值,已得到最终的训练模型。In step 306, the parameter values of the parameters in the initial training model are adjusted according to the overall loss function, and the final training model has been obtained.

步骤307,判断通过步骤306调整后的参数值是否满足预置的停止条件,若满足,则将上述步骤306调整的参数值作为最终训练得到的训练模型中的参数值,并继续执行步骤308;若不满足,则返回执行上述步骤303。Step 307, determine whether the parameter value adjusted in step 306 satisfies the preset stop condition, if so, take the parameter value adjusted in the above step 306 as the parameter value in the training model obtained by the final training, and continue to perform step 308; If it is not satisfied, return to the above-mentioned step 303.

步骤308,确定预训练的跨语言摘要模型可以包括最终的训练模型中的特征提取模块、编码器和第三支路的解码器CLS、概率输出及注意力模块,并将训练得到的跨语言摘要模型预置到文本处理系统中。In step 308, it is determined that the pre-trained cross-language abstract model may include the feature extraction module, the encoder and the decoder CLS of the third branch, the probability output and the attention module in the final training model, and the cross-language abstract obtained by training is used. Models are preset into the text processing system.

(二)当文本处理系统发起获取目标对象的跨语言摘要信息流程时,可以调用预训练的跨语言摘要模型,并通过跨语言摘要模型直接得到目标对象的跨语言摘要信息。(2) When the text processing system initiates the process of obtaining the cross-language summary information of the target object, the pre-trained cross-language summary model can be called, and the cross-language summary information of the target object can be directly obtained through the cross-language summary model.

在实际测试中,先采用现有的方法和采用本发明实施例的方法训练得到两个跨语言摘要模型,其中一个为从中文到英文的跨语言摘要模型,另一个为从英文到中文的跨语言摘要模型,用两个跨语言摘要模型分别获取相应的样本对象的跨语言摘要信息后,计算的跨语言摘要信息与样本对象实际的跨语言摘要信息的匹配率(简称正确匹配率)如下表1所示,可见,采用本发明实施例的方法训练得到的跨语言摘要模型获取的跨语言摘要信息更准确:In the actual test, two cross-language summarization models are obtained by training using the existing method and the method of the embodiment of the present invention, one of which is a cross-language summarization model from Chinese to English, and the other is a cross-language summarization model from English to Chinese. The language summary model, after obtaining the cross-language summary information of the corresponding sample objects by using two cross-language summary models respectively, the matching rate between the calculated cross-language summary information and the actual cross-language summary information of the sample object (referred to as the correct matching rate) is as follows: 1, it can be seen that the cross-language summary information obtained by the cross-language summary model trained by the method of the embodiment of the present invention is more accurate:

Figure BDA0003200293720000151
Figure BDA0003200293720000151

表1Table 1

可见,本实施例中训练跨语言摘要模型的过程中综合了三个任务(即跨语言翻译任务、单语言摘要任务和跨语言摘要任务)的实现信息,增强了训练得到的跨语言摘要模型的性能,特别地,在跨语言摘要标注的训练样本稀缺的场景下,训练得到的跨语言摘要模型也较准确。It can be seen that, in the process of training the cross-language summarization model in this embodiment, the realization information of three tasks (that is, the cross-language translation task, the single-language summarization task and the cross-language summarization task) is synthesized, which enhances the performance of the cross-language summarization model obtained by training. Performance, in particular, the trained cross-language summarization model is also more accurate in scenarios where training samples annotated by cross-language summaries are scarce.

以下以另一具体的应用实例来说明本发明中文本处理方法,本发明实施例中的文本处理系统主要为分布式系统100,该分布式系统可以包括客户端300及多个节点200(接入网络中的任意形式的计算设备,如服务器、用户终端),客户端300与节点200之间通过网络通信的形式连接。The text processing method in the present invention is described below with another specific application example. The text processing system in the embodiment of the present invention is mainly a distributed system 100, and the distributed system may include a client 300 and a plurality of nodes 200 (access Any computing device in the network, such as a server, a user terminal), the client 300 and the node 200 are connected in the form of network communication.

以分布式系统为区块链系统为例,参见图7是本发明实施例提供的分布式系统100应用于区块链系统的一个可选的结构示意图,由多个节点200(接入网络中的任意形式的计算设备,如服务器、用户终端)和客户端300形成,节点之间形成组成的点对点(P2P,Peer ToPeer)网络,P2P协议是一个运行在传输控制协议(TCP,Transmission Control Protocol)协议之上的应用层协议。在分布式系统中,任何机器如服务器、终端都可以加入而成为节点,节点包括硬件层、中间层、操作系统层和应用层。Taking the distributed system as the blockchain system as an example, referring to FIG. 7 is an optional schematic structural diagram of the distributed system 100 provided in the embodiment of the present invention applied to the blockchain system. Any form of computing device (such as server, user terminal) and client 300 are formed, and a peer-to-peer (P2P, Peer To Peer) network is formed between nodes. The P2P protocol is a transmission control protocol (TCP, Transmission Control Protocol) Application layer protocol on top of the protocol. In a distributed system, any machine such as a server and a terminal can join to become a node, and a node includes a hardware layer, a middle layer, an operating system layer and an application layer.

参见图7示出的区块链系统中各节点的功能,涉及的功能包括:Referring to the functions of each node in the blockchain system shown in Figure 7, the involved functions include:

1)路由,节点具有的基本功能,用于支持节点之间的通信。1) Routing, a basic function that a node has to support communication between nodes.

节点除具有路由功能外,还可以具有以下功能:In addition to the routing function, a node can also have the following functions:

2)应用,用于部署在区块链中,根据实际业务需求而实现特定业务,记录实现功能相关的数据形成记录数据,在记录数据中携带数字签名以表示任务数据的来源,将记录数据发送到区块链系统中的其它节点,供其它节点在验证记录数据来源以及完整性成功时,将记录数据添加到临时区块中。2) Application, used to deploy in the blockchain, implement specific business according to actual business needs, record data related to the realization of functions to form record data, carry a digital signature in the record data to indicate the source of the task data, and send the record data To other nodes in the blockchain system, for other nodes to add the record data to the temporary block when verifying the source and integrity of the record data successfully.

例如,应用实现的业务包括实现跨语言摘要功能的代码,该跨语言摘要功能主要包括:For example, the business implemented by the application includes the code for implementing the cross-language summarization function, and the cross-language summarization function mainly includes:

获取目标对象,调用预训练的跨语言摘要模型;通过所述跨语言摘要模型提取所述目标对象的跨语言摘要信息;其中,通过如下步骤预训练所述跨语言摘要模型:确定初始训练模型,所述初始训练模型包括特征提取模块、特征编码模块和三个支路的编码后处理模块,所述特征提取模块用于提取样本对象的特征信息,特征编码模块用于对样本对象的特征信息进行编码得到编码后特征,所述三个支路中第一支路的编码后处理模块用于根据所述编码后特征确定所述样本对象的翻译信息,所述三个支路中第二支路的编码后处理模块用于根据所述编码后特征确定所述样本对象的单语言摘要信息,所述三个支路中第三支路的编码后处理模块用于根据所述编码后特征确定样本对象的跨语言摘要信息;确定训练样本,所述训练样本中包括多个第一样本对象及其翻译标注、多个第二样本对象及其单语言摘要标注和多个第三样本对象及其跨语言摘要标注;根据所述初始训练模型和训练样本训练所述跨语言摘要模型。Obtain the target object, call the pre-trained cross-language summary model; extract the cross-language summary information of the target object through the cross-language summary model; wherein, pre-train the cross-language summary model through the following steps: determining an initial training model, The initial training model includes a feature extraction module, a feature encoding module, and a post-coding processing module of three branches. The feature extraction module is used to extract the feature information of the sample object, and the feature encoding module is used to perform the feature information of the sample object. The encoded feature is obtained by encoding, and the encoding post-processing module of the first branch in the three branches is used to determine the translation information of the sample object according to the encoded feature, and the second branch in the three branches. The post-coding processing module is used to determine the monolingual abstract information of the sample object according to the post-coding feature, and the post-coding processing module of the third branch in the three branches is used to determine the sample based on the post-coding feature. Cross-language summary information of objects; determine training samples, the training samples include multiple first sample objects and their translation annotations, multiple second sample objects and their single-language summary annotations, and multiple third sample objects and their translation annotations. Cross-language summary annotation; train the cross-language summary model according to the initial training model and training samples.

3)区块链,包括一系列按照产生的先后时间顺序相互接续的区块(Block),新区块一旦加入到区块链中就不会再被移除,区块中记录了区块链系统中节点提交的记录数据。3) Blockchain, including a series of blocks (Blocks) that follow each other in chronological order. Once a new block is added to the blockchain, it will not be removed. The block records the blockchain system. The record data submitted by the middle node.

参见图8为本发明实施例提供的区块结构(Block Structure)一个可选的示意图,每个区块中包括本区块存储交易记录的哈希值(本区块的哈希值)、以及前一区块的哈希值,各区块通过哈希值连接形成区块链。另外,区块中还可以包括有区块生成时的时间戳等信息。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了相关的信息,用于验证其信息的有效性(防伪)和生成下一个区块。8 is an optional schematic diagram of a block structure (Block Structure) provided by an embodiment of the present invention, and each block includes a hash value (the hash value of this block) of the transaction record stored in this block, and The hash value of the previous block, each block is connected by the hash value to form a blockchain. In addition, the block may also include information such as a timestamp when the block was generated. Blockchain (Blockchain) is essentially a decentralized database, which is a series of data blocks associated with cryptographic methods. Each data block contains relevant information to verify the validity of its information. (anti-counterfeiting) and generate the next block.

本发明实施例还提供一种文本处理系统,其结构示意图如图9所示,具体可以包括:An embodiment of the present invention further provides a text processing system, the schematic diagram of which is shown in FIG. 9 , and may specifically include:

调用单元10,用于获取目标对象,调用预训练的跨语言摘要模型;The calling unit 10 is used to obtain the target object and call the pre-trained cross-language summary model;

摘要提取单元11,用于通过所述调用单元10调用的对象跨语言摘要模型提取所述目标对象的跨语言摘要信息;The abstract extraction unit 11 is configured to extract the cross-language abstract information of the target object through the object cross-language abstract model invoked by the invoking unit 10;

所述文本处理系统还包括:The text processing system also includes:

训练单元12,用于确定初始训练模型,所述初始训练模型包括特征提取模块、特征编码模块和三个支路的编码后处理模块,所述特征提取模块用于提取样本对象的特征信息,特征编码模块用于对样本对象的特征信息进行编码得到编码后特征,所述三个支路中第一支路的编码后处理模块用于根据所述编码后特征确定所述样本对象的翻译信息,所述三个支路中第二支路的编码后处理模块用于根据所述编码后特征确定所述样本对象的单语言摘要信息,所述三个支路中第三支路的编码后处理模块用于根据所述编码后特征确定样本对象的跨语言摘要信息;确定训练样本,所述训练样本中包括多个第一样本对象及其翻译标注、多个第二样本对象及其单语言摘要标注和多个第三样本对象及其跨语言摘要标注;根据所述初始训练模型和训练样本训练所述跨语言摘要模型。则上述调用单元10会调用训练单元12训练得到的跨语言摘要模型。The training unit 12 is used to determine an initial training model, the initial training model includes a feature extraction module, a feature encoding module and a post-coding processing module of three branches, and the feature extraction module is used to extract the feature information of the sample object, the feature The encoding module is used to encode the feature information of the sample object to obtain the encoded feature, and the encoding post-processing module of the first branch in the three branches is used to determine the translation information of the sample object according to the encoded feature, The post-encoding processing module of the second branch of the three branches is used to determine the monolingual abstract information of the sample object according to the post-encoding characteristics, and the post-encoding processing of the third branch of the three branches The module is used to determine the cross-language summary information of the sample object according to the encoded features; determine the training sample, the training sample includes a plurality of first sample objects and their translation annotations, a plurality of second sample objects and their single language Abstract annotations and a plurality of third sample objects and their cross-language abstract annotations; the cross-language abstract model is trained according to the initial training model and the training samples. Then, the calling unit 10 will call the cross-language summary model trained by the training unit 12 .

其中,训练单元12确定的初始训练模型还包括所述三个支路分别对应的注意力模块;具体地,所述样本对象的翻译信息包括多个翻译词,所述第一支路的注意力模块,用于对已确定的翻译词进行基于注意力机制的编码得到第一历史编码后特征,所述第一支路的编码后处理模块用于根据所述编码后特征及第一历史编码后特征确定所述已确定的翻译词之后的另一翻译词;所述样本对象的单语言摘要信息包括多个单语言词,所述第二支路的注意力模块,用于对已确定的单语言词进行基于注意力机制的编码得到第二历史编码后特征,所述第二支路的编码后处理模块用于根据所述编码后特征及第二历史编码后特征确定所述已确定的单语言词之后的另一单语言词;所述样本对象的跨语言摘要信息包括多个跨语言词,所述第三支路的注意力模块,用于对已确定的跨语言词进行基于注意力机制的编码得到第三历史编码后特征,第三支路的编码后处理模块用于根据所述编码后特征及第三历史编码后特征确定所述已确定的跨语言词之后的另一跨语言词。The initial training model determined by the training unit 12 further includes attention modules corresponding to the three branches respectively; specifically, the translation information of the sample object includes a plurality of translation words, and the attention of the first branch The module is used to encode the determined translation word based on the attention mechanism to obtain the first historical post-coding feature, and the post-coding processing module of the first branch is used to obtain the post-coding feature according to the post-coding feature and the first historical post-coding feature. The feature determines another translation word after the determined translation word; the single-language summary information of the sample object includes a plurality of single-language words, and the attention module of the second branch is used to analyze the determined single-language word. The language word is encoded based on the attention mechanism to obtain the second historical post-encoding feature, and the post-encoding processing module of the second branch is used to determine the determined single feature according to the post-encoding feature and the second historical post-encoding feature. Another single language word after the language word; the cross-language summary information of the sample object includes a plurality of cross-language words, and the attention module of the third branch is used to perform attention-based attention on the determined cross-language words The encoding of the mechanism obtains the third historical post-encoding feature, and the post-encoding processing module of the third branch is used to determine another cross-language after the determined cross-language word according to the post-encoding feature and the third historical post-encoding feature word.

其中,所述三个支路分别对应的注意力模块用于基于单头自注意机制或基于多头自注意机制进行编码。The attention modules corresponding to the three branches are used for encoding based on a single-head self-attention mechanism or a multi-head self-attention mechanism.

进一步地,训练单元12在根据所述初始训练模型和训练样本训练所述跨语言摘要模型时,具体用于通过所述初始训练模型分别确定所述各个第一样本对象的翻译信息、第二样本对象的单语言摘要信息和第三样本对象的跨语言摘要信息;根据所述初始训练模型确定的翻译信息、单语言摘要信息和跨语言摘要信息,及所述训练样本中相应样本对象的翻译标注、单语言摘要标注和跨语言摘要标注,调整初始训练模型,以得到最终的训练模型;确定所述预训练的跨语言摘要模型包括所述最终的训练模型中的特征提取模块、特征编码模块和第三支路的编码后处理模块。Further, when training the cross-language summary model according to the initial training model and the training samples, the training unit 12 is specifically configured to use the initial training model to respectively determine the translation information of each first sample object, the second The single-language summary information of the sample object and the cross-language summary information of the third sample object; the translation information, the single-language summary information and the cross-language summary information determined according to the initial training model, and the translation of the corresponding sample object in the training sample Labeling, single-language abstract labeling and cross-language abstract labeling, and adjusting the initial training model to obtain the final training model; it is determined that the pre-trained cross-language abstract model includes the feature extraction module and the feature encoding module in the final training model And the encoding post-processing module of the third branch.

其中,训练单元12在根据所述初始训练模型确定的翻译信息、单语言摘要信息和跨语言摘要信息,及所述训练样本中相应样本对象的翻译标注、单语言摘要标注和跨语言摘要标注,调整初始训练模型时,具体用于根据所述初始训练模型确定的翻译信息、单语言摘要信息和跨语言摘要信息,及所述训练样本中相应样本对象的翻译标注、单语言摘要标注和跨语言摘要标注,计算与所述初始训练模型相关的整体损失函数;根据所述整体损失函数调整所述初始训练模型中参数的参数值。Wherein, the training unit 12 determines the translation information, single-language summary information and cross-language summary information according to the initial training model, and the translation annotation, single-language summary annotation and cross-language summary annotation of corresponding sample objects in the training sample, When adjusting the initial training model, it is specifically used for the translation information, single-language summary information and cross-language summary information determined according to the initial training model, and the translation annotation, single-language summary annotation and cross-language summary information of the corresponding sample objects in the training sample Summary annotation, calculate the overall loss function related to the initial training model; adjust the parameter values of the parameters in the initial training model according to the overall loss function.

其中,训练单元12在根据所述初始训练模型确定的翻译信息、单语言摘要信息和跨语言摘要信息,及所述训练样本中相应样本对象的翻译标注、单语言摘要标注和跨语言摘要标注,计算与所述初始训练模型相关的整体损失函数时,具体用于根据所述初始训练模型确定的第一样本对象的翻译信息及训练样本中相应第一样本对象的翻译标注,计算与所述第一支路的编码后处理模块相关的第一损失函数;根据所述初始训练模型确定的第二样本对象的单语言摘要信息及训练样本中相应第二样本对象的单语言摘要标注,计算与所述第二支路的编码后处理模块相关的第二损失函数;根据所述初始训练模型确定的第三样本对象的跨语言摘要信息及训练样本中相应第三样本对象的跨语言摘要标注,计算与所述第三支路的编码后处理模块相关的第三损失函数;根据所述第一损失函数、第二损失函数和第三损失函数计算所述整体损失函数。Wherein, the training unit 12 determines the translation information, single-language summary information and cross-language summary information according to the initial training model, and the translation annotation, single-language summary annotation and cross-language summary annotation of corresponding sample objects in the training sample, When calculating the overall loss function related to the initial training model, it is specifically used to calculate the translation information of the first sample object determined according to the initial training model and the translation annotation of the corresponding first sample object in the training sample, and calculate and The first loss function related to the encoding post-processing module of the first branch; according to the single-language summary information of the second sample object determined by the initial training model and the single-language summary annotation of the corresponding second sample object in the training sample, calculate The second loss function related to the post-coding processing module of the second branch; the cross-language summary information of the third sample object determined according to the initial training model and the cross-language summary annotation of the corresponding third sample object in the training sample , calculating a third loss function related to the post-encoding processing module of the third branch; calculating the overall loss function according to the first loss function, the second loss function and the third loss function.

其中,训练单元12在根据所述第一损失函数、第二损失函数和第三损失函数计算所述整体损失函数,具体用于确定第一权重值和第二权重值;将所述第一权重值与第一损失函数的第一乘积,第二权重值与第二损失函数的第二乘积,及所述第三损失函数的和值作为所述整体损失函数。这种情况下,文本处理系统还可以包括调整单元13,用于确定调整后第一权重值为调整前的第一权重值、所述第一支路的编码后处理模块的总训练步数及训练步数的当前计数之间的函数计算值;确定调整后第二权重值为调整前的第二权重值、训练步数及所述第二支路的编码后处理模块的总训练步数及训练步数的当前计数之间的函数计算值。Wherein, the training unit 12 calculates the overall loss function according to the first loss function, the second loss function and the third loss function, and is specifically used to determine the first weight value and the second weight value; The first product of the value and the first loss function, the second product of the second weight value and the second loss function, and the sum of the third loss function are used as the overall loss function. In this case, the text processing system may further include an adjustment unit 13 for determining the first weight value after adjustment, the first weight value before adjustment, the total number of training steps of the post-encoding processing module of the first branch, and The function calculation value between the current counts of the number of training steps; the second weight value after the adjustment is determined to be the second weight value before the adjustment, the number of training steps and the total number of training steps of the post-coding processing module of the second branch and The function evaluates between the current count of training steps.

进一步地,训练单元12,还用于当对所述初始训练模型中参数的参数值的调整次数等于预置的次数时,或若当前调整的参数值与上一次调整的参数值的差值小于一阈值时,则停止对所述参数值的调整。Further, the training unit 12 is also used to adjust the parameter value of the parameter in the initial training model when the number of times of adjustment is equal to the preset number of times, or if the difference between the currently adjusted parameter value and the last adjusted parameter value is less than When a threshold value is reached, the adjustment of the parameter value is stopped.

可见,在本实施例的文本处理系统中,摘要提取单元11会采用预训练的跨语言摘要模型提取目标对象的跨语言摘要信息,其中,在预训练跨语言摘要模型的过程中确定的初始训练模型中包括了三个支路的编码后处理模块,分别对应三个不同的任务,即确定翻译信息、单语言摘要信息和跨语言摘要信息,而这三个任务共享了同一特征编码模块和同一特征提取模块,由于确定跨语言摘要信息这个任务可以是确定翻译信息和单语言摘要信息这两个子任务的整合,在训练跨语言摘要模型的过程中利用了实现这三个任务的信息,同时考虑到了确定跨语言摘要信息这个整体任务及其包括的子任务,这样即时在跨语言摘要的数据集(即训练样本中的第三样本对象)较少的情况下,训练得到的跨语言摘要模型在提取跨语言摘要信息时也较为准确。It can be seen that, in the text processing system of this embodiment, the abstract extraction unit 11 will use a pre-trained cross-language abstract model to extract the cross-language abstract information of the target object, wherein the initial training determined in the process of pre-training the cross-lingual abstract model The model includes three branches of encoding post-processing modules, corresponding to three different tasks, namely determining translation information, single-language summary information and cross-language summary information, and these three tasks share the same feature encoding module and the same The feature extraction module, since the task of determining cross-language summary information can be an integration of the two sub-tasks of determining translation information and single-language summary information, uses the information to achieve these three tasks in the process of training the cross-language summary model, while considering When it comes to determining the overall task of cross-language summarization information and its sub-tasks, even when the data set of cross-language summarization (that is, the third sample object in the training sample) is small, the cross-language summarization model obtained by training is It is also more accurate when extracting cross-language summary information.

本发明实施例还提供一种终端设备,其结构示意图如图10所示,该终端设备可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(centralprocessing units,CPU)20(例如,一个或一个以上处理器)和存储器21,一个或一个以上存储应用程序221或数据222的存储介质22(例如一个或一个以上海量存储设备)。其中,存储器21和存储介质22可以是短暂存储或持久存储。存储在存储介质22的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对终端设备中的一系列指令操作。更进一步地,中央处理器20可以设置为与存储介质22通信,在终端设备上执行存储介质22中的一系列指令操作。An embodiment of the present invention also provides a terminal device, the schematic diagram of which is shown in FIG. 10 , the terminal device may have relatively large differences due to different configurations or performance, and may include one or more central processing units (central processing units, CPUs) ) 20 (eg, one or more processors) and memory 21, one or more storage media 22 (eg, one or more mass storage devices) storing application programs 221 or data 222. Wherein, the memory 21 and the storage medium 22 may be short-term storage or persistent storage. The program stored in the storage medium 22 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the terminal device. Furthermore, the central processing unit 20 may be configured to communicate with the storage medium 22 to execute a series of instruction operations in the storage medium 22 on the terminal device.

具体地,在存储介质22中储存的应用程序221包括跨语言摘要的应用程序,且该程序可以包括上述文本处理系统中的调用单元10,摘要提取单元11和训练单元12和调整单元13,在此不进行赘述。更进一步地,中央处理器20可以设置为与存储介质22通信,在终端设备上执行存储介质22中储存的跨语言摘要的应用程序对应的一系列操作。Specifically, the application program 221 stored in the storage medium 22 includes a cross-language abstract application program, and the program may include the calling unit 10, the abstract extraction unit 11, the training unit 12 and the adjustment unit 13 in the above-mentioned text processing system. This is not repeated here. Furthermore, the central processing unit 20 may be configured to communicate with the storage medium 22 to execute a series of operations corresponding to the application program of the cross-language abstract stored in the storage medium 22 on the terminal device.

终端设备还可以包括一个或一个以上电源23,一个或一个以上有线或无线网络接口24,一个或一个以上输入输出接口25,和/或,一个或一个以上操作系统223,例如WindowsServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。The terminal device may also include one or more power supplies 23, one or more wired or wireless network interfaces 24, one or more input and output interfaces 25, and/or, one or more operating systems 223, such as WindowsServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.

上述方法实施例中所述的由文本处理系统所执行的步骤可以基于该图10所示的终端设备的结构。The steps performed by the text processing system described in the above method embodiments may be based on the structure of the terminal device shown in FIG. 10 .

本发明实施例另一方面还提供一种计算机可读存储介质,所述计算机可读存储介质储存多个计算机程序,所述计算机程序适于由处理器加载并执行如上述文本处理系统所执行的文本处理方法。Another aspect of the embodiments of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a plurality of computer programs, and the computer programs are adapted to be loaded by a processor and execute the text processing system as described above. text processing methods.

本发明实施例另一方面还提供一种终端设备,包括处理器和存储器;所述存储器用于储存多个计算机程序,所述计算机程序用于由处理器加载并执行如上述文本处理系统所执行的文本处理方法;所述处理器,用于实现所述多个计算机程序中的各个计算机程序。Another aspect of the embodiments of the present invention further provides a terminal device, including a processor and a memory; the memory is used to store a plurality of computer programs, and the computer programs are used to be loaded and executed by the processor as executed by the above text processing system The text processing method; the processor is used to implement each computer program in the plurality of computer programs.

根据本申请的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各种可选实现方式中提供的文本处理方法。According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the text processing methods provided in the various optional implementations described above.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器(ROM)、随机存取存储器(RAM)、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the storage medium can include: Read only memory (ROM), random access memory (RAM), magnetic or optical disk, etc.

以上对本发明实施例所提供的一种文本处理方法、系统及存储介质和终端设备进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The text processing method, system, storage medium, and terminal device provided by the embodiments of the present invention have been described above in detail. The principles and implementations of the present invention are described with specific examples in this paper. The descriptions of the above embodiments are only It is used to help understand the method of the present invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific embodiments and application scope. The contents of the description should not be construed as limiting the present invention.

Claims (12)

1. A method of text processing, comprising:
acquiring a target object, and calling a pre-trained cross-language abstract model;
extracting cross-language abstract information of the target object through the cross-language abstract model;
wherein the cross-language abstract model is pre-trained by:
determining an initial training model, wherein the initial training model comprises a feature extraction module, a feature coding module and a coding post-processing module of three branches, the feature extraction module is used for extracting feature information of a sample object, the feature coding module is used for coding the feature information of the sample object to obtain coded features, the coding post-processing module of a first branch in the three branches is used for determining translation information of the sample object according to the coded features, the coding post-processing module of a second branch in the three branches is used for determining single-language abstract information of the sample object according to the coded features, and the coding post-processing module of a third branch in the three branches is used for determining cross-language abstract information of the sample object according to the coded features;
determining a training sample, wherein the training sample comprises a plurality of first sample objects and translation labels thereof, a plurality of second sample objects and single-language abstract labels thereof, and a plurality of third sample objects and cross-language abstract labels thereof;
and training the cross-language abstract model according to the initial training model and the training samples.
2. The method of claim 1, wherein the initial training model further comprises attention modules corresponding to the three branches, respectively;
the translation information of the sample object comprises a plurality of translation words, the attention module of the first branch is used for encoding the determined translation words based on an attention mechanism to obtain first historical encoded features, and the post-encoding processing module of the first branch is used for determining another translation word after the determined translation words according to the encoded features and the first historical encoded features;
the single-language abstract information of the sample object comprises a plurality of single-language words, the attention module of the second branch is used for coding the determined single-language words based on an attention mechanism to obtain second historical coded features, and the coded post-processing module of the second branch is used for determining another single-language word behind the determined single-language word according to the coded features and the second historical coded features;
the cross-language abstract information of the sample object comprises a plurality of cross-language words, the attention module of the third branch is used for coding the determined cross-language words based on an attention mechanism to obtain features after third history coding, and the coded post-processing module of the third branch is used for determining another cross-language word after the determined cross-language word according to the coded features and the features after third history coding.
3. The method of claim 2, wherein the attention modules corresponding to the three branches are used for encoding based on a single-head self-attention mechanism or based on a multi-head self-attention mechanism.
4. The method according to any one of claims 1 to 3, wherein the training of the cross-language digest model based on the initial training model and training samples specifically comprises:
respectively determining translation information of the first sample object, monolingual abstract information of the second sample object and cross-language abstract information of the third sample object through the initial training model;
adjusting the initial training model according to the translation information, the single-language abstract information and the cross-language abstract information determined by the initial training model, and the translation labels, the single-language abstract labels and the cross-language abstract labels of the corresponding sample objects in the training sample to obtain a final training model;
and determining that the pre-trained cross-language abstract model comprises a feature extraction module, a feature coding module and a coding post-processing module of a third branch in the final training model.
5. The method of claim 4, wherein the adjusting the initial training model based on the translation information, the monolingual abstract information, and the cross-language abstract information determined by the initial training model and the translation labels, the monolingual abstract labels, and the cross-language abstract labels of the corresponding sample objects in the training sample comprises:
calculating an overall loss function related to the initial training model according to the translation information, the single-language abstract information and the cross-language abstract information determined by the initial training model, and the translation label, the single-language abstract label and the cross-language abstract label of the corresponding sample object in the training sample;
and adjusting parameter values of parameters in the initial training model according to the overall loss function.
6. The method according to claim 5, wherein the calculating an overall loss function associated with the initial training model according to the translation information, the single-language abstract information and the cross-language abstract information determined by the initial training model and the translation labels, the single-language abstract labels and the cross-language abstract labels of the corresponding sample objects in the training sample comprises:
calculating a first loss function related to a coding post-processing module of the first branch according to the translation information of the first sample object determined by the initial training model and the translation label of the corresponding first sample object in the training sample;
calculating a second loss function related to the coding post-processing module of the second branch according to the monolingual abstract information of the second sample object determined by the initial training model and the monolingual abstract label of the corresponding second sample object in the training sample;
calculating a third loss function related to the coded post-processing module of the third branch according to the cross-language abstract information of the third sample object determined by the initial training model and the cross-language abstract label of the corresponding third sample object in the training sample;
and calculating the overall loss function according to the first loss function, the second loss function and the third loss function.
7. The method of claim 6, wherein said calculating the overall loss function from the first, second, and third loss functions comprises:
determining a first weight value and a second weight value;
taking a first product of the first weight value and a first loss function, a second product of a second weight value and a second loss function, and a sum of the third loss function as the overall loss function.
8. The method of claim 7, further comprising:
determining the adjusted first weight value as a function calculation value among the first weight value before adjustment, the total training step number of the coding post-processing module of the first branch and the current count of the training step number;
and determining the adjusted second weight value as a function calculation value between the second weight value before adjustment, the training step number, the total training step number of the coding post-processing module of the second branch and the current count of the training step number.
9. The method of claim 5, wherein the adjusting of the parameter value is stopped when the number of times of adjusting the parameter value of the parameter in the initial training model is equal to a preset number of times or if the difference between the currently adjusted parameter value and the last adjusted parameter value is less than a threshold value.
10. A text processing system, comprising:
the calling unit is used for acquiring a target object and calling a pre-trained cross-language abstract model;
the abstract extracting unit is used for extracting cross-language abstract information of the target object through the cross-language abstract model;
the text processing system further includes:
the training unit is used for determining an initial training model, the initial training model comprises a feature extraction module, a feature coding module and three branch coding post-processing modules, the feature extraction module is used for extracting feature information of a sample object, the feature coding module is used for coding the feature information of the sample object to obtain coded features, the coding post-processing module of a first branch in the three branches is used for determining translation information of the sample object according to the coded features, the coding post-processing module of a second branch in the three branches is used for determining single-language abstract information of the sample object according to the coded features, and the coding post-processing module of a third branch in the three branches is used for determining cross-language abstract information of the sample object according to the coded features; determining a training sample, wherein the training sample comprises a plurality of first sample objects and translation labels thereof, a plurality of second sample objects and single-language abstract labels thereof, and a plurality of third sample objects and cross-language abstract labels thereof; and training the cross-language abstract model according to the initial training model and the training samples.
11. A computer-readable storage medium, characterized in that it stores a plurality of computer programs adapted to be loaded by a processor and to execute the text processing method according to any one of claims 1 to 9.
12. A terminal device comprising a processor and a memory;
the memory is used for storing a plurality of computer programs, and the computer programs are used for being loaded by the processor and executing the text processing method according to any one of claims 1 to 9; the processor is configured to implement each of the plurality of computer programs.
CN202110902041.XA 2021-08-06 2021-08-06 A text processing method, system, storage medium and terminal device Active CN114328805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110902041.XA CN114328805B (en) 2021-08-06 2021-08-06 A text processing method, system, storage medium and terminal device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110902041.XA CN114328805B (en) 2021-08-06 2021-08-06 A text processing method, system, storage medium and terminal device

Publications (2)

Publication Number Publication Date
CN114328805A true CN114328805A (en) 2022-04-12
CN114328805B CN114328805B (en) 2025-05-30

Family

ID=81044187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110902041.XA Active CN114328805B (en) 2021-08-06 2021-08-06 A text processing method, system, storage medium and terminal device

Country Status (1)

Country Link
CN (1) CN114328805B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8135575B1 (en) * 2003-08-21 2012-03-13 Google Inc. Cross-lingual indexing and information retrieval
CN110188358A (en) * 2019-05-31 2019-08-30 北京神州泰岳软件股份有限公司 The training method and device of Natural Language Processing Models
CN111382261A (en) * 2020-03-17 2020-07-07 北京字节跳动网络技术有限公司 Abstract generation method and device, electronic equipment and storage medium
CN111581374A (en) * 2020-05-09 2020-08-25 联想(北京)有限公司 Text abstract obtaining method and device and electronic equipment
CN112541343A (en) * 2020-12-03 2021-03-23 昆明理工大学 Semi-supervised counterstudy cross-language abstract generation method based on word alignment
CN112732902A (en) * 2021-01-31 2021-04-30 云知声智能科技股份有限公司 Cross-language abstract generation method and device, electronic equipment and computer readable medium
CN112906385A (en) * 2021-05-06 2021-06-04 平安科技(深圳)有限公司 Text abstract generation method, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8135575B1 (en) * 2003-08-21 2012-03-13 Google Inc. Cross-lingual indexing and information retrieval
CN110188358A (en) * 2019-05-31 2019-08-30 北京神州泰岳软件股份有限公司 The training method and device of Natural Language Processing Models
CN111382261A (en) * 2020-03-17 2020-07-07 北京字节跳动网络技术有限公司 Abstract generation method and device, electronic equipment and storage medium
CN111581374A (en) * 2020-05-09 2020-08-25 联想(北京)有限公司 Text abstract obtaining method and device and electronic equipment
CN112541343A (en) * 2020-12-03 2021-03-23 昆明理工大学 Semi-supervised counterstudy cross-language abstract generation method based on word alignment
CN112732902A (en) * 2021-01-31 2021-04-30 云知声智能科技股份有限公司 Cross-language abstract generation method and device, electronic equipment and computer readable medium
CN112906385A (en) * 2021-05-06 2021-06-04 平安科技(深圳)有限公司 Text abstract generation method, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ELVYS LINHARES PONTES,等: "Cross-Lingual Speech-to-Text Summarization", 《MULTIMEDIA AND NETWORK INFORMATION SYSTEMS 》, 15 August 2018 (2018-08-15), pages 385 *
殷明明,等: "基于对比注意力机制的跨语言句子摘要系统", 《计算机工程》, vol. 46, no. 5, 7 August 2019 (2019-08-07), pages 86 - 93 *

Also Published As

Publication number Publication date
CN114328805B (en) 2025-05-30

Similar Documents

Publication Publication Date Title
CN112784092A (en) Cross-modal image text retrieval method of hybrid fusion model
CN113095415A (en) Cross-modal hashing method and system based on multi-modal attention mechanism
CN113761220B (en) Information acquisition method, device, equipment and storage medium
CN114443899A (en) Video classification method, device, equipment and medium
CN111966811B (en) Intent recognition and slot filling method, device, readable storage medium and terminal device
CN118779469B (en) A method for constructing a multimodal knowledge base of a domain-wide large model based on feature representation
CN113987155B (en) Conversational retrieval method integrating knowledge graph and large-scale user log
CN113127643A (en) Deep learning rumor detection method integrating microblog themes and comments
CN114385803A (en) Extraction type reading understanding method based on external knowledge and segment selection
CN117785964B (en) Data processing method and system applied to network services
CN116775497B (en) Database test case generation demand description coding method
CN114610865A (en) Recall Text Recommended Methods, Apparatus, Equipment and Storage Media
WO2025055581A1 (en) Speech encoder training method and apparatus, and device, medium and program product
CN113822018A (en) Entity Relation Joint Extraction Method
CN114281934A (en) Text recognition method, device, equipment and storage medium
CN117972033A (en) Large model illusion detection method, device, computer equipment and storage medium
WO2025178877A1 (en) Advanced systems and methods for multi-modal ai: generative multi-modal large language and deep learning models with applications across diverse domains
CN115686868A (en) A Cross-Node Multimodal Retrieval Method Based on Federated Hash Learning
CN119622098A (en) Multi-source software-defined science and education resource recommendation method and system based on knowledge graph
CN113515941A (en) Named entity recognition method, training method, device, equipment and medium
CN114328805A (en) Text processing method, system, storage medium and terminal equipment
CN114329068B (en) Data processing method and device, electronic equipment and storage medium
CN116956861A (en) Text matching method, device, equipment and storage medium
CN115563203A (en) A model training method, system, storage medium, and terminal device
CN114328818A (en) Text corpus processing method, device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant