[go: up one dir, main page]

CN114357122A - Text processing method and device - Google Patents

Text processing method and device Download PDF

Info

Publication number
CN114357122A
CN114357122A CN202210257335.6A CN202210257335A CN114357122A CN 114357122 A CN114357122 A CN 114357122A CN 202210257335 A CN202210257335 A CN 202210257335A CN 114357122 A CN114357122 A CN 114357122A
Authority
CN
China
Prior art keywords
text
written
target
spoken
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210257335.6A
Other languages
Chinese (zh)
Inventor
弓源
李长亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Digital Entertainment Co Ltd
Priority to CN202210257335.6A priority Critical patent/CN114357122A/en
Priority to CN202210590875.6A priority patent/CN114880436A/en
Publication of CN114357122A publication Critical patent/CN114357122A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

本申请提供一种文本处理方法及装置,其中文本处理方法包括:获取目标口语文本;将所述目标口语文本进行分类处理,获得所述目标口语文本对应的文本类型;在所述文本类型为标准文本类型的情况下,根据所述标准文本类型选择对应的书面语改写模型;将所述目标口语文本输入所述书面语改写模型进行处理,获得所述目标口语文本对应的目标书面语文本;其中,所述书面语改写模型,基于书面语文本以及对所述书面语文本进行回译和转换处理获得的口语文本训练得到,实现了根据目标口语文本的文本类型选择适合目标口语文本的书面语改写模型进行书面语改写,使书面语改写更加具有针对性,并提高了书面语改写的准确性。

Figure 202210257335

The present application provides a text processing method and device, wherein the text processing method includes: obtaining target spoken text; classifying the target spoken text to obtain a text type corresponding to the target spoken text; In the case of text type, select the corresponding written language rewriting model according to the standard text type; input the target spoken language text into the written language rewriting model for processing, and obtain the target written language text corresponding to the target spoken language text; wherein, the The written language rewriting model is obtained by training the written language text and the spoken language text obtained by back-translating and converting the written language text, so as to select a written language rewriting model suitable for the target spoken language text according to the text type of the target spoken language text to rewrite the written language, so that the written language can be rewritten. Rewriting is more targeted and improves the accuracy of written language rewriting.

Figure 202210257335

Description

文本处理方法及装置Text processing method and device

技术领域technical field

本申请涉及信息技术的人工智能领域,特别涉及一种文本处理方法,本申请同时涉及一种文本处理装置、计算设备和计算机可读存储介质。The present application relates to the field of artificial intelligence of information technology, and in particular, to a text processing method. The present application also relates to a text processing apparatus, a computing device, and a computer-readable storage medium.

背景技术Background technique

人工智能(artificial intelligence;AI)是指已工程化(即设计并制造)的系统感知环境的能力,以及获取、处理、应用和表示知识的能力。人工智能领域关键技术的发展状况,包括机器学习、知识图谱、自然语言处理、计算机视觉、人机交互、生物特征识别、虚拟现实/增强现实等关键技术。自然语言处理(Natural Language Processing,NLP)是指用计算机对自然语言的形、音、义等信息进行处理,即对字、词、句、篇章的输入、输出、识别、分析、理解、生成等的操作和加工,NLP是计算机科学领域与人工智能领域中的一个重要方向,它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。Artificial intelligence (AI) refers to the ability of an engineered (ie designed and manufactured) system to perceive its environment, and to acquire, process, apply and represent knowledge. The development status of key technologies in the field of artificial intelligence, including machine learning, knowledge graph, natural language processing, computer vision, human-computer interaction, biometric recognition, virtual reality/augmented reality and other key technologies. Natural Language Processing (NLP) refers to the use of computers to process the shape, sound, meaning and other information of natural language, that is, the input, output, recognition, analysis, understanding, generation, etc. NLP is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language.

文本生成任务是自然语言处理领域的一个重要研究方向,其中,机器翻译、摘要生成、文本风格迁移等是自然文本生成领域的重要任务。而口语文本-书面语文本改写作为自然文本生成领域的重要任务的其中一环,在日常工作及生活中具有重要应用。比如在录音文本分析、会议语音文本纪要、重要书面语材料文档转写等涉及口语文本的分析应用场景中,口语文本转写为书面语文本的转写质量至关重要。然而,由于口语文本的质量参差不齐,且文本生成任务生成结果文本具有较大的不确定性及不连续性等诸多因素影响,口语文本到书面语文本的转写任务一直是一个巨大的挑战。Text generation task is an important research direction in the field of natural language processing. Among them, machine translation, summary generation, text style transfer, etc. are important tasks in the field of natural text generation. As one of the important tasks in the field of natural text generation, oral text-written text rewriting has important applications in daily work and life. For example, in the analysis and application scenarios involving spoken text, such as the analysis of recorded texts, the minutes of conference voice texts, and the transcription of important written materials and documents, the quality of transcription from spoken text to written text is very important. However, due to the uneven quality of spoken texts and the large uncertainty and discontinuity of the resulting texts generated by text generation tasks, the task of transcribing spoken texts to written texts has always been a huge challenge.

发明内容SUMMARY OF THE INVENTION

有鉴于此,本申请实施例提供了一种文本处理方法,本申请同时涉及一种文本处理装置、计算设备和计算机可读存储介质,以解决现有技术中存在的技术缺陷。In view of this, embodiments of the present application provide a text processing method, and the present application simultaneously relates to a text processing apparatus, a computing device, and a computer-readable storage medium, so as to solve the technical defects existing in the prior art.

根据本申请实施例的第一方面,提供了一种文本处理方法,包括:According to a first aspect of the embodiments of the present application, a text processing method is provided, including:

获取书面语文本;obtain written text;

通过对所述书面语文本进行回译处理,获得所述书面语文本对应的回译书面语文本;By performing back-translation processing on the written text, a back-translated written text corresponding to the written text is obtained;

对所述书面语文本和所述回译书面语文本分别进行语句组成单元的转换处理,获得口语文本;The written text and the back-translated written text are respectively subjected to the conversion processing of the sentence constituent units to obtain the spoken text;

基于所述书面语文本和所述回译书面语文本与所述口语文本的对应关系,构建样本语料。A sample corpus is constructed based on the written text and the correspondence between the back-translated written text and the spoken text.

根据本申请实施例的第二方面,提供了一种文本处理装置,包括:According to a second aspect of the embodiments of the present application, a text processing apparatus is provided, including:

获取模块,被配置为获取目标口语文本;The acquisition module is configured to acquire the target spoken text;

分类模块,被配置为将所述目标口语文本进行分类处理,获得所述目标口语文本对应的文本类型;a classification module, configured to classify the target spoken text to obtain a text type corresponding to the target spoken text;

选择模块,被配置为在所述文本类型为标准文本类型的情况下,根据所述标准文本类型选择对应的书面语改写模型;a selection module, configured to select a corresponding written language rewriting model according to the standard text type when the text type is a standard text type;

处理模块,被配置为将所述目标口语文本输入所述书面语改写模型进行处理,获得所述目标口语文本对应的目标书面语文本;其中,所述书面语改写模型,基于书面语文本以及对所述书面语文本进行回译和转换处理获得的口语文本训练得到。a processing module, configured to input the target spoken text into the written language rewriting model for processing, and obtain a target written language text corresponding to the target spoken language text; wherein, the written language rewriting model is based on the written language text and the written language text It is obtained by training the spoken text obtained by back-translation and conversion processing.

根据本申请实施例的第三方面,提供了一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机指令,所述处理器执行所述计算机指令时实现所述文本处理方法的步骤。According to a third aspect of the embodiments of the present application, a computing device is provided, including a memory, a processor, and computer instructions stored in the memory and executable on the processor, the processor implementing the computer instructions when the processor executes the computer instructions. Describe the steps of the text processing method.

根据本申请实施例的第四方面,提供了一种计算机可读存储介质,其存储有计算机指令,所述计算机指令被处理器执行时实现所述文本处理方法的步骤。According to a fourth aspect of the embodiments of the present application, a computer-readable storage medium is provided, which stores computer instructions, and when the computer instructions are executed by a processor, implements the steps of the text processing method.

根据本申请实施例的第五方面,提供了一种芯片,其存储有计算机指令,所述计算机指令被芯片执行时实现所述文本处理方法或所述口语生成方法的步骤。According to a fifth aspect of the embodiments of the present application, a chip is provided, which stores computer instructions, and when the computer instructions are executed by the chip, implements the steps of the text processing method or the spoken language generation method.

本申请实施例中,通过获取目标口语文本;将所述目标口语文本进行分类处理,获得所述目标口语文本对应的文本类型;再在所述文本类型为标准文本类型的情况下,根据所述标准文本类型选择对应的书面语改写模型,实现了根据目标口语文本的文本类型选择适合目标口语文本的书面语改写模型;再将所述目标口语文本输入所述书面语改写模型进行处理,获得所述目标口语文本对应的目标书面语文本,使书面语改写更加具有针对性,并提高了书面语改写的准确性。其中,所述书面语改写模型,基于书面语文本以及对所述书面语文本进行回译和转换处理获得的口语文本训练得到,实现了基于回译以及转换处理对书面语文本进行预处理,从而为模型训练提供大量口语文本-书面语文本的样本语料,简化了模型的训练难度,也避免了人工耗时费力收集并处理大量的文本数据,节约了时间成本以及人力成本。In the embodiment of the present application, by obtaining the target spoken text; classifying the target spoken text to obtain the text type corresponding to the target spoken text; and then in the case that the text type is a standard text type, according to the Selecting the corresponding written language rewriting model for the standard text type realizes selecting a written language rewriting model suitable for the target spoken language text according to the text type of the target spoken language text; then inputting the target spoken language text into the written language rewriting model for processing to obtain the target spoken language The target written language text corresponding to the text makes the written language rewriting more targeted and improves the accuracy of the written language rewriting. The written language rewriting model is obtained by training based on the written language text and the spoken language text obtained by back-translation and conversion processing of the written language text, and realizes the preprocessing of the written language text based on the back-translation and conversion processing, thereby providing model training. A large amount of spoken text-written text sample corpus simplifies the training difficulty of the model, and also avoids the time-consuming and laborious collection and processing of large amounts of text data, saving time and labor costs.

附图说明Description of drawings

图1是本申请一实施例提供的计算设备的结构框图;1 is a structural block diagram of a computing device provided by an embodiment of the present application;

图2是本申请一实施例提供的文本处理方法的示意图;2 is a schematic diagram of a text processing method provided by an embodiment of the present application;

图3是本申请一实施例提供的文本处理方法的流程图;3 is a flowchart of a text processing method provided by an embodiment of the present application;

图4是本申请一实施例提供的文本处理方法中构建样本语料的示意图;4 is a schematic diagram of constructing a sample corpus in a text processing method provided by an embodiment of the present application;

图5是本申请一实施例提供的一种应用于实际场景的文本处理方法的处理流程图;FIG. 5 is a processing flowchart of a text processing method applied to an actual scene provided by an embodiment of the present application;

图6是本申请一实施例提供的文本处理装置的结构示意图。FIG. 6 is a schematic structural diagram of a text processing apparatus provided by an embodiment of the present application.

具体实施方式Detailed ways

在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. However, the present application can be implemented in many other ways different from those described herein, and those skilled in the art can make similar promotions without violating the connotation of the present application. Therefore, the present application is not limited by the specific implementation disclosed below.

在本申请一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请一个或多个实施例。在本申请一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本申请一个或多个实施例中使用的术语“和/或”是指包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in one or more embodiments of the present application is for the purpose of describing a particular embodiment only, and is not intended to limit the one or more embodiments of the present application. As used in one or more embodiments of this application and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used in one or more embodiments of this application is meant to include any and all possible combinations of one or more of the associated listed items.

应当理解,尽管在本申请一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“响应于确定”。It should be understood that although the terms first, second, etc. may be used in one or more embodiments of the present application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first could be termed the second, and similarly the second could be termed the first, without departing from the scope of one or more embodiments of the present application. Depending on the context, the word "if" as used herein may be interpreted as "in response to determining."

首先,对本发明一个或多个实施例涉及的名词术语进行解释。First, terminology related to one or more embodiments of the present invention is explained.

Seq2Seq(Sequence to Sequence,序列到序列)模型:用于自然语言处理的一系列机器学习方法,常用于机器翻译、图像描述、对话模型和文本摘要等应用领域。Seq2Seq (Sequence to Sequence) models: A family of machine learning methods for natural language processing, commonly used in applications such as machine translation, image description, dialogue models, and text summarization.

Transformer模型:一种深度学习模型,采用注意机制,对输入数据每个部分的重要性进行微分加权,广泛应用于各项自然语言处理任务。Transformer model: A deep learning model that uses an attention mechanism to differentially weight the importance of each part of the input data, and is widely used in various natural language processing tasks.

文本分类:指在给定的分类体系中,将文本指定分到某个或某几个类别中。Text classification: refers to assigning texts to one or more categories in a given classification system.

Natural Language Generation(NLG,自然语言生成):自然语言处理的一部分,从知识库或逻辑形式等机器表述系统去生成自然语言文本。Natural Language Generation (NLG): The part of natural language processing that generates natural language text from machine representation systems such as knowledge bases or logical forms.

文本风格迁移:从一种风格形式的文本转写生成另一种风格形式的文本。Text style transfer: Transcribing text in one style to generate text in another style.

摘要生成:通过技术方案,实现将长文本进行压缩、归纳和总结,从而形成具有概括性含义的短文本的过程。Abstract generation: Through technical solutions, the process of compressing, summarizing and summarizing long texts to form short texts with general meanings is realized.

机器翻译:利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程。Machine Translation: The process of converting one natural language (source language) into another natural language (target language) using a computer.

实体:指文本中具有特定意义的实体词汇或者短语描述。Entity: Refers to the entity word or phrase description with a specific meaning in the text.

词性标注:是在给定句子中判定每个词的语法范畴,确定其词性并加以标注的过程,这也是自然语言处理(natural language processing,NLP)中一项非常重要的基础性工作。Part-of-speech tagging: It is the process of determining the grammatical category of each word in a given sentence, determining its part-of-speech and marking it, which is also a very important basic work in natural language processing (NLP).

句法分析:是自然语言处理(natural language processing,NLP)中的关键底层技术之一,其基本任务是确定句子的句法结构或者句子中词汇之间的依存关系。Syntactic analysis: It is one of the key underlying technologies in natural language processing (NLP), and its basic task is to determine the syntactic structure of a sentence or the dependencies between words in a sentence.

在本申请中,提供了一种文本处理方法,本申请同时涉及一种文本处理装置、计算设备和计算机可读存储介质,在下面的实施例中逐一进行详细说明。In the present application, a text processing method is provided, and the present application simultaneously relates to a text processing apparatus, a computing device and a computer-readable storage medium, which will be described in detail in the following embodiments.

图1示出了根据本申请一实施例提供的计算设备100的结构框图。该计算设备100的部件包括但不限于存储器110和处理器120。处理器120与存储器110通过总线130相连接,数据库150用于保存数据。FIG. 1 shows a structural block diagram of a computing device 100 provided according to an embodiment of the present application. Components of the computing device 100 include, but are not limited to, memory 110 and processor 120 . The processor 120 is connected with the memory 110 through the bus 130, and the database 150 is used for saving data.

计算设备100还包括接入设备140,接入设备140使得计算设备100能够经由一个或多个网络160通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备140可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE802。11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。Computing device 100 also includes access device 140 that enables computing device 100 to communicate via one or more networks 160 . Examples of such networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet. Access device 140 may include one or more of any type of network interface (eg, network interface card (NIC)), wired or wireless, such as IEEE 802.11 Wireless Local Area Network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, etc.

在本申请的一个实施例中,计算设备100的上述部件以及图1中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图1所示的计算设备结构框图仅仅是出于示例的目的,而不是对本申请范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。In one embodiment of the present application, the above-described components of the computing device 100 and other components not shown in FIG. 1 may also be connected to each other, eg, via a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 1 is only for the purpose of example, rather than limiting the scope of the present application. Those skilled in the art can add or replace other components as required.

计算设备100可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备100还可以是移动式或静止式的服务器。Computing device 100 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs. Computing device 100 may also be a mobile or stationary server.

实际应用中,由于口语文本-书面语文本改写在日常工作以及生活中具有重要应用,现有技术中,为了实现口语文本-书面语文本的改写,可以采用人工的方式,将口语化文本改写成书面语文本;也可以采用规则改写的方式,将部分可处理的口语化表述进行改写替换;此外,还可以直接采用文本翻译的方式,将口语化文本翻译成书面语文本。In practical applications, since the oral text-written text rewriting has important applications in daily work and life, in the prior art, in order to realize the oral text-written text rewriting, an artificial method can be used to rewrite the oral text into the written text. You can also use the method of rule rewriting to rewrite and replace some of the processable oral expressions; in addition, you can directly use the method of text translation to translate the spoken text into written text.

其中,采用人工的方式进行改写,耗费大量人力,且文本转写的质量及结果不统一;采用规则的方式进行改写,只能处理限定的少量口语化表述词及固定的文本形式,且改写的逻辑规则处理复杂度较高;而采用文本翻译的方式进行改写,对文本语料的量级且文本数据质量具有较高的要求,该方法可以实现一定程度的转写效果,但是整体不适合口语到书面语的转写任务。Among them, the manual rewriting method consumes a lot of manpower, and the quality and results of the text transcription are not uniform; using the regular method for rewriting, only a limited number of colloquial expressions and fixed text forms can be processed. The processing complexity of logic rules is high; while the way of rewriting by text translation has high requirements on the magnitude of the text corpus and the quality of the text data. This method can achieve a certain degree of transcription effect, but it is not suitable for spoken language to the whole. Written transcription tasks.

因此,为了实现口语文本-书面语文本的准确改写,可以采用预先训练好的改写模型对口语文本进行改写。然而由于口语文本的质量参差不齐,如果对这些口语文本采用统一的改写模型进行处理,可能达不到准确的改写效果。因此,亟需一种有效的方案以解决上述问题。Therefore, in order to achieve accurate rewriting of spoken text to written text, a pre-trained rewriting model can be used to rewrite the spoken text. However, due to the uneven quality of spoken texts, if a unified rewriting model is used to process these spoken texts, accurate rewriting effects may not be achieved. Therefore, there is an urgent need for an effective solution to solve the above problems.

参见图2,图2示出了根据本申请一实施例提供的文本处理方法的示意图。在获取目标口语文本之后,将目标口语文本输入文本分类模型,该文本分类模型是通过预先采集的文本语料语义清晰度标注数据对初始文本分类模型进行训练获得的。进一步的,该文本分类模型通过对输入的目标口语文本进行文本分类输出该目标口语文本对应的预测文本类型。再根据该预测文本类型,确定该目标口语文本对应的改写模糊(即书面语改写模型或书面语转换模型;若预测文本类型为标准文本类型,则该改写模糊为书面语改写模型;若预测文本类型为模糊文本类型,则该改写模糊为书面语转换模型)。其中,改写模糊是通过口语文本-书面语文本对齐数据构建的样本语料进行模型训练获得的。进一步的,该改写模糊通过对输入的目标口语文本进行书面语改写,即可输出目标口语文本对应的书面语文本。Referring to FIG. 2, FIG. 2 shows a schematic diagram of a text processing method provided according to an embodiment of the present application. After acquiring the target spoken text, the target spoken text is input into the text classification model, which is obtained by training the initial text classification model through pre-collected text corpus semantic clarity annotation data. Further, the text classification model outputs the predicted text type corresponding to the target spoken text by performing text classification on the input target spoken text. Then, according to the predicted text type, determine the rewriting ambiguity corresponding to the target spoken text (i.e. the written rewriting model or the written language conversion model; if the predicted text type is a standard text type, the rewriting ambiguity is a written rewriting model; if the predicted text type is fuzzy text type, the rewriting is fuzzy as a written language conversion model). Among them, the rewriting fuzzy is obtained by training the model through the sample corpus constructed from the oral text-written text alignment data. Further, in the rewriting ambiguity, the written text corresponding to the target spoken text can be output by rewriting the input target spoken text in written language.

本申请实施例根据目标口语文本的质量对目标口语文本进行分类,从而选择适合该目标口语文本的改写模型对其进行改写处理,以此使书面语改写更加具有针对性,并提高了书面语改写的准确性。The embodiment of the present application classifies the target spoken text according to the quality of the target spoken text, and selects a rewriting model suitable for the target spoken text to rewrite it, thereby making the written rewriting more targeted and improving the accuracy of the written rewriting sex.

图3示出了根据本申请一实施例提供的文本处理方法的流程图,具体包括如下步骤:3 shows a flowchart of a text processing method provided according to an embodiment of the present application, which specifically includes the following steps:

步骤302:获取目标口语文本。Step 302: Obtain the target spoken text.

目标口语文本,是指待进行改写处理的口语文本。实际应用中,该目标口语文本可以是任意领域的口语文本,比如医学领域的口语文本、化学领域的口语文本、销售领域的口语文本、日常生活领域的口语文本、旅游领域的口语文本等。此外,该目标口语文本的文本数量可以是一个也可以是多个。The target spoken text refers to the spoken text to be rewritten. In practical applications, the target spoken text can be spoken text in any field, such as spoken text in the medical field, oral text in the field of chemistry, oral text in the field of sales, spoken text in the field of daily life, spoken text in the field of tourism, etc. In addition, the text quantity of the target spoken text may be one or multiple.

本实施例以获取的目标口语文本TST为例对文本处理方法进行说明,其他的目标口语文本的处理过程均可参见本实施例相同或相近的描述,在此不做限制。This embodiment takes the acquired target spoken text TST as an example to describe the text processing method. For other processing procedures of the target spoken text, reference may be made to the same or similar descriptions in this embodiment, which are not limited herein.

步骤304:将目标口语文本进行分类处理,获得目标口语文本对应的文本类型。Step 304: Classify the target spoken text to obtain a text type corresponding to the target spoken text.

具体的,在上述获取目标口语文本的基础上,由于待进行书面语改写的目标口语文本可能质量参差不齐,这种情况下,直接对目标口语文本进行书面语改写,可能不能保障改写的质量,因此,为了保障书面语改写的效果,可以对目标口语文本进行文本分类,以至于对于不同类型的目标口语文本实行相应的改写措施。Specifically, on the basis of the above-mentioned acquisition of the target spoken text, since the quality of the target spoken text to be rewritten in written language may vary, in this case, directly rewriting the target spoken text in written language may not guarantee the quality of the rewriting. , in order to ensure the effect of written language rewriting, the target oral text can be classified, so that corresponding rewriting measures can be implemented for different types of target oral text.

进一步的,将目标口语文本进行分类处理,获得目标口语文本对应的文本类型,具体通过如下方式实现:Further, the target spoken text is classified and processed to obtain the text type corresponding to the target spoken text, which is specifically implemented in the following manner:

将目标口语文本输入文本分类模型进行分类处理,获得目标口语文本对应的文本类型;其中,文本分类模型的训练,包括:Input the target spoken text into the text classification model for classification processing, and obtain the text type corresponding to the target spoken text; wherein, the training of the text classification model includes:

获取样本口语文本以及样本口语文本对应的语义清晰度标签;Obtain the sample spoken text and the semantic clarity label corresponding to the sample spoken text;

基于样本口语文本以及语义清晰度标签构建训练样本对;Construct training sample pairs based on sample spoken text and semantic clarity labels;

通过训练样本对对初始文本分类模型进行模型训练,直至获得满足分类训练停止条件的文本分类模型。Model training is performed on the initial text classification model by training samples until a text classification model that satisfies the classification training stopping condition is obtained.

文本分类模型,是指预先训练完成的对目标口语文本进行分类的模型,该文本分类模型可以是二分类模型,通过对目标口语文本进行文本语义清晰度识别,将目标口语文本分为标准文本类型或模糊文本类型。其中,标准文本类型,是指目标口语文本的文本语义表述较为清晰;模糊文本类型,是指目标口语文本的文本语义表述较为模糊。语义清晰度标签,是指根据样本口语文本的语义清晰度所标注的标签。The text classification model refers to a pre-trained model for classifying the target spoken text. The text classification model can be a two-classification model. By identifying the text semantic clarity of the target spoken text, the target spoken text is classified into standard text types. or fuzzy text type. Among them, the standard text type means that the textual semantic expression of the target spoken text is relatively clear; the fuzzy text type means that the textual semantic expression of the target spoken text is relatively vague. The semantic clarity label refers to the label marked according to the semantic clarity of the sample spoken text.

此外,考虑到目标口语文本还可能是不包含语义信息的文本,而对这类型的文本进行书面语改写是没有意义的。因此,可以采用三分类模型,对目标口语文本进行分类,将目标口语文本分为标准文本类型、模糊文本类型或无效文本类型。In addition, considering that the target spoken text may also be text that does not contain semantic information, it is meaningless to rewrite this type of text in written language. Therefore, a three-classification model can be used to classify the target spoken text and classify the target spoken text into standard text types, fuzzy text types or invalid text types.

实际应用中,如果该文本分类模型为二分类模型,则语义清晰度标签包括:标准文本类型以及模糊文本类型。如果该文本分类模型为三分类模型,则语义清晰度标签包括:标准文本类型、模糊文本类型以及无效文本类型。In practical applications, if the text classification model is a binary classification model, the semantic clarity labels include: standard text type and fuzzy text type. If the text classification model is a three-classification model, the semantic clarity labels include: standard text type, fuzzy text type, and invalid text type.

需要说明的是,样本口语文本的语义清晰度标签需要预先进行标注,再将样本口语文本以及语义清晰度标签作为文本语料语义清晰度标注数据对初始文本类型模型进行模型训练。其中,初始文本分类模型,可以是预先通过CNN(卷积神经网络)、RNN(循环神经网络)、LSTM(长短时记忆网络)、FastText、TextCNN、HAN模型等构建的待训练的文本分类模型。It should be noted that the semantic clarity label of the sample spoken text needs to be marked in advance, and then the sample spoken text and the semantic clarity label are used as the text corpus semantic clarity label data to train the initial text type model. Among them, the initial text classification model can be a text classification model to be trained constructed in advance through CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory Network), FastText, TextCNN, HAN model, etc.

具体实施时,训练过程中可以通过初始文本分类模型对样本口语文本进行分类处理输出预测文本类型,并计算预测文本类型和该样本口语文本对应的语义清晰度标签所记录的样本文本类型的损失值,实际应用中,计算模型损失值的损失函数在实际应用中可以为0-1损失函数、绝对值损失函数、平方损失函数、交叉熵损失函数等,在此,以0-1损失函数为例进行解释说明,参见下述公式1:During the specific implementation, in the training process, the sample spoken text can be classified and processed by the initial text classification model to output the predicted text type, and the predicted text type and the loss value of the sample text type recorded by the semantic clarity label corresponding to the sample spoken text can be calculated. , In practical applications, the loss function for calculating the loss value of the model can be a 0-1 loss function, an absolute value loss function, a squared loss function, a cross entropy loss function, etc. Here, the 0-1 loss function is taken as an example For an explanation, see Equation 1 below:

Figure 863136DEST_PATH_IMAGE001
公式1
Figure 863136DEST_PATH_IMAGE001
Formula 1

其中,L代表损失值,f(X)表示预测文本类型,Y表示样本文本类型,在本申请中,对损失函数的选择不做限定,以实际应用为准。Among them, L represents the loss value, f(X) represents the predicted text type, and Y represents the sample text type. In this application, the selection of the loss function is not limited, and the actual application shall prevail.

在计算模型损失值之后,即可根据模型损失值反向调整初始文本分类模型的模型参数,并采样下一批次文本语料语义清晰度标注数据继续训练初始文本分类模型,直至到达分类训练停止条件,具体的,该分类训练停止条件,可以是模型损失值小于预设阈值或训练迭代次数达到预设迭代次数等,在此不做限制。After calculating the model loss value, you can reversely adjust the model parameters of the initial text classification model according to the model loss value, and sample the next batch of text corpus semantic clarity annotation data to continue training the initial text classification model until the classification training stop condition is reached , Specifically, the classification training stop condition may be that the model loss value is less than the preset threshold or the number of training iterations reaches the preset number of iterations, etc., which is not limited here.

综上,通过预先训练完成的文本分类模型对目标口语文本进行语义清晰度分类,可以有效地识别目标口语文本的语义质量,以便对不同类型的目标口语文本进行合理性改写,保障了书面语改写的质量。To sum up, the semantic clarity classification of the target spoken text through the pre-trained text classification model can effectively identify the semantic quality of the target spoken text, so as to rationally rewrite different types of target spoken text, ensuring the accuracy of written rewriting. quality.

步骤306:在文本类型为标准文本类型的情况下,根据标准文本类型选择对应的书面语改写模型。Step 306: In the case that the text type is a standard text type, select a corresponding written language rewriting model according to the standard text type.

具体的,在上述确定目标口语文本对应的文本类型之后,考虑到在目标口语文本的语义表述比较清晰(即文本类型为标准文本类型)的情况下,对其进行书面语改写时,可以在保障语义的基础上采用比较复杂的改写方式。因此,对于标准文本类型的目标口语文本,可以选择进行较为复杂改写的书面语改写模型进行处理。Specifically, after the above-mentioned determination of the text type corresponding to the target spoken text, considering that the semantic expression of the target spoken text is relatively clear (that is, the text type is a standard text type), when rewriting it in written language, it is possible to ensure the semantics based on a more complex rewriting method. Therefore, for the target spoken text of standard text type, a written rewriting model that performs more complex rewriting can be selected for processing.

书面语改写模型,是指用以将口语文本改写为书面语文本的模块。具体的,该书面语改写模型,可以是基于Seq2Seq模型构建的,该Seq2Seq模型中的编码器以及解码器均可以采用Transformer模型构建。The written language rewriting model refers to the module used to rewrite the spoken text into the written language text. Specifically, the written language rewriting model may be constructed based on the Seq2Seq model, and both the encoder and the decoder in the Seq2Seq model may be constructed using the Transformer model.

沿用上例,将该目标口语文本TST输入文本分类模型,获得该文本分类模型输出的目标口语文本TST对应的文本类型为标准文本类型,则将该目标口语文本TST输入书面语改写模型,获得该书面语改写模型输出的目标书面语文本TLT1。Following the above example, the target spoken text TST is input into the text classification model, and the text type corresponding to the target spoken text TST output by the text classification model is obtained as the standard text type, then the target spoken text TST is input into the written language rewriting model, and the written language is obtained. Rewrite the target written language text TLT1 output by the model.

步骤308:将目标口语文本输入书面语改写模型进行处理,获得目标口语文本对应的目标书面语文本。Step 308: Input the target spoken text into the written rewriting model for processing, and obtain the target written text corresponding to the target spoken text.

其中,书面语改写模型,基于书面语文本以及对书面语文本进行回译和转换处理获得的口语文本训练得到。Among them, the written language rewriting model is obtained by training based on the written language text and the spoken language text obtained by back-translating and converting the written language text.

具体的,在上述选择书面语改写模型的基础上,则可通过该书面语改写模型对目标口语文本进行书面语改写,从而获得改写之后生成的目标书面语文本。Specifically, on the basis of selecting the written language rewriting model above, the written language rewriting can be performed on the target spoken language text through the written language rewriting model, so as to obtain the target written language text generated after the rewriting.

具体实施时,为了保障书面语改写模型进行书面语改写的准确性,避免文本生成结果出现不可控的情况,可以在改写过程中采用字符级掩码操作。通过该字符级掩码操作,保障书面语改写的生成结果全部主要来自于输入文本,本申请实施例,该书面语改写模型包括编码层、解码层,通过该书面语改写模型对目标口语文本进行改写,具体采用如下方式实现:During specific implementation, in order to ensure the accuracy of written language rewriting by the written language rewriting model and avoid the uncontrollable situation of the text generation result, a character-level mask operation can be used in the rewriting process. Through the character-level mask operation, it is ensured that all the generated results of written language rewriting mainly come from the input text. In this embodiment of the present application, the written language rewriting model includes an encoding layer and a decoding layer, and the target spoken language text is rewritten through the written language rewriting model. Specifically, It is implemented as follows:

将目标口语文本进行分句处理,获得目标口语文本中包含的语句序列;Sentence processing is performed on the target spoken text to obtain the sentence sequence contained in the target spoken text;

将语句序列中的口语句单元依次输入书面语改写模型的编码层进行编码处理,获得口语句单元对应的语句特征向量和词表向量,其中,词表向量由口语句单元与词表进行映射获得;The spoken sentence units in the sentence sequence are sequentially input into the coding layer of the written language rewriting model for encoding processing to obtain the sentence feature vector and the vocabulary vector corresponding to the spoken sentence unit, wherein the vocabulary vector is obtained by mapping the spoken sentence unit and the vocabulary;

计算语句特征向量与词表向量之间的向量积,并将向量积输入书面语改写模型的解码层进行解码处理,获得目标口语文本对应的目标书面语文本。Calculate the vector product between the sentence feature vector and the vocabulary vector, and input the vector product into the decoding layer of the written language rewriting model for decoding processing to obtain the target written language text corresponding to the target spoken language text.

其中,编码层是文本生成模型中一个层级结构,通过将信息转为另一种形式进行表达,用于在模型内部进行处理。相应地,语句特征向量具体是指对口语句单元进行编码处理后获得的向量表达。解码层具体是指书面语改写模型中用于将语句特征向量转化为解码向量的层级结构,实际应用中,在解码器输出解码向量之后,将解码向量输入到输出层,获得输出层输出的目标书面语文本。Among them, the coding layer is a hierarchical structure in the text generation model, which is expressed by converting information into another form for processing inside the model. Correspondingly, the sentence feature vector specifically refers to a vector expression obtained after encoding the spoken sentence unit. The decoding layer specifically refers to the hierarchical structure used in the written language rewriting model to convert the sentence feature vector into a decoding vector. In practical applications, after the decoder outputs the decoding vector, the decoding vector is input to the output layer to obtain the target written language output by the output layer. text.

词表,是指词语列表。具体的,该词表可以是在对书面语改写模型进行训练过程中通过统计样本语料中出现的词语/字符的频率生成的(比如:将训样本语料中出现频率大于阈值的字符/词语加入词表),也可以是模型自身携带的,此外,还可以通过其他方式生成。语句序列,是指将目标口语文本中包含的口语语句按照在目标口语文本中的先后排序进行排列所组成的序列。相应地,口语句单元,是指语句序列中包含的口语语句。Vocabulary is a list of words. Specifically, the vocabulary list may be generated by counting the frequencies of words/characters that appear in the sample corpus during the training process of the written language rewriting model (for example, adding characters/words that appear in the training sample corpus with a frequency greater than a threshold to the vocabulary list) ), can also be carried by the model itself, and can also be generated in other ways. The sentence sequence refers to a sequence formed by arranging the spoken sentences contained in the target spoken text according to the sequence in the target spoken text. Correspondingly, the spoken sentence unit refers to the spoken sentence contained in the sentence sequence.

具体实施时,口语句单元与词表进行映射,是指将口语语句中字符/词语与词表中的字符/词语进行匹配;如果口语语句中字符/词语命中词表中的任意字符/词语,则将词表中所命中的字符/词语对应的向量位设置为1,该词表中未命中的字符/词语对应的向量位设置为0,则可获得词表向量。比如词表中包括5000个字符,通过将口语语句1中的4个字符与该词表进行映射,其中,第1个字符映射该词表中第3个字符,第2个字符映射该词表中第6个字符,第3个字符映射该词表中第9个字符,第4个字符映射该词表中第5个字符,则获得的词表向量为00101100100……0。During specific implementation, mapping the spoken sentence unit to the vocabulary refers to matching the characters/words in the spoken sentence with the characters/words in the vocabulary; if the characters/words in the spoken sentence hit any character/word in the vocabulary, Then, the vector bit corresponding to the hit character/word in the vocabulary is set to 1, and the vector bit corresponding to the missed character/word in the vocabulary is set to 0, then the vocabulary vector can be obtained. For example, the vocabulary contains 5000 characters. By mapping the 4 characters in the spoken sentence 1 to the vocabulary, the first character maps the third character in the vocabulary, and the second character maps the vocabulary. The 6th character in the vocabulary, the 3rd character maps the 9th character in the vocabulary, and the 4th character maps the 5th character in the vocabulary, then the obtained vocabulary vector is 00101100100...0.

进一步的,计算词表向量与语句特征向量之间的向量积,再基于向量积进行解码,实现了通过统计输入文本字符/词语对解码的输出的时候进行约束限制。上述通过词表实现的操作,也可以称之为字符级掩码操作。Further, the vector product between the vocabulary vector and the sentence feature vector is calculated, and then decoding is performed based on the vector product, which realizes the restriction and restriction on the decoded output by counting the input text characters/words. The above operations implemented through the vocabulary can also be referred to as character-level mask operations.

综上,通过字符级掩码操作保证了书面语改写模型生成的文本字符主要都来自于输入文本源,极大地避免了书面语改写模型的改写结果出现语义偏差。To sum up, the character-level mask operation ensures that the text characters generated by the written language rewriting model mainly come from the input text source, which greatly avoids the semantic deviation of the rewriting results of the written language rewriting model.

具体实施时,书面语改写模型的训练,具体通过如下步骤30802至步骤30810实现:During specific implementation, the training of the written language rewriting model is specifically implemented through the following steps 30802 to 30810:

步骤30802:获取书面语文本。Step 30802: Obtain written text.

书面语文本,是指采用人们在书写和阅读文章时所使用的语言所形成的文本,文字为其主要组成部分。该书面语文本可以是任意领域的书面语文本,比如医学领域的书面语文本,化学领域的书面语文本、销售领域的书面语文本、日常生活领域的书面语文本,旅游领域的书面语文本等,此外,该书面语文本还可以根据采用的语体不同,分为文艺语体、公文语体、科技语体等。Written text refers to the text formed in the language that people use when writing and reading articles, and the text is the main component. The written text can be written text in any field, such as written text in the field of medicine, written text in the field of chemistry, written text in the field of sales, written text in the field of daily life, written text in the field of tourism, etc. In addition, the written text is also According to the different styles used, it can be divided into literary style, official document style, scientific and technological style, etc.

比如:获取的书面语文本为文艺语体的书面语文本LT。For example, the acquired written text is the written text LT in the literary style.

步骤30804:通过对书面语文本进行回译处理,获得书面语文本对应的回译书面语文本。Step 30804: Obtain the back-translated written text corresponding to the written text by performing back-translation processing on the written text.

具体的,在上述获取书面语文本的基础上,考虑到单纯对书面语文本进行转换,生成对应的口语文本,可能对样本语料的扩充仍然有限,为了进一步扩充样本语料,可以先通过对书面语文本进行回译处理的方式扩充书面语文本,再对扩充后的书面语文本进行文本转换,转换为书面语文本对应的口语文本。Specifically, on the basis of the above-mentioned acquisition of the written text, considering that the written text is simply converted to generate the corresponding spoken text, the expansion of the sample corpus may still be limited. The translation processing method expands the written text, and then performs text conversion on the expanded written text to convert it into the spoken text corresponding to the written text.

其中,回译处理是指将A语言的文本翻译为B语言,再将B语言的文本翻译回A语言的过程。实际应用中,由于经过回译处理生成的回译书面语文本可以生成和原书面语文本差异化的文本表述,因此,通过回译处理生成的回译书面语文本可以扩充书面语文本。The back-translation processing refers to the process of translating the text in language A into language B, and then translating the text in language B back into language A. In practical applications, since the back-translated written text generated by the back-translation process can generate a text representation that is different from the original written text, the back-translated written text generated by the back-translation process can expand the written text.

进一步的,考虑到回译处理后生成的回译书面语文本与原书面语文本,可能差异较大,并可能失去原书面语文本所要表达的含义,为了保障回译书面语文本和书面语文本中关键信息保持不变,可以对回译生成的回译书面语文本通过书面语文本中的关键词语进行替换,本申请实施例,具体采用如下方式实现:Further, considering that the back-translated written text generated after the back-translation process may be quite different from the original written text, and may lose the meaning to be expressed by the original written text, in order to ensure that the key information in the back-translated written text and the written text remains unchanged. Change, the back-translated written text generated by the back-translation can be replaced by the key words in the written text, and the embodiment of the present application is implemented in the following manner:

将书面语文本翻译为预设语种对应的译文书面语文本;Translate the written language text into the translated written language text corresponding to the preset language;

将译文书面语文本回译为书面语文本所属的目标语种,获得初始回译书面语文本;Back-translate the written text of the translation into the target language to which the written text belongs, and obtain the initial back-translated written text;

通过书面语文本中的关键词语对初始回译书面语文本中关键词语对应的目标关键词语进行替换,获得回译书面语文本。The back-translated written text is obtained by replacing the target key words corresponding to the key words in the initial back-translated written text by the key words in the written text.

预设语种可以是英语、法语、韩语、德语等任意一种或多种语种,在此不做限制。相应地,目标语种,是指书面语文本中文字所属的语种。The preset language can be any one or more languages such as English, French, Korean, German, etc., which is not limited here. Correspondingly, the target language refers to the language to which the characters in the written text belong.

实际应用中,将书面语文本先翻译为其他语种的文本,即译文书面语文本。再将译文书面语文本翻译回书面语文本所属的语种,获得初始回译书面语文本。该初始回译书面语文本可能由于回译过程与书面语文本产生较大的差异,甚至偏离书面语文本的表达含义。为了使两个文本保持关键信息不变,可以通过书面语文本中的关键词语对初始回译书面语文本中与之对应的词语(即目标关键词语)进行替换,从而生成关键信息与书面语文本一致的回译书面语文本。In practical applications, the written text is first translated into other languages, that is, the translated written text. Then translate the translated written text back to the language to which the written text belongs to obtain the initial back-translated written text. The initial back-translated written text may be quite different from the written text due to the back-translation process, or even deviate from the expressed meaning of the written text. In order to keep the key information of the two texts unchanged, the corresponding words (ie target key words) in the initial back-translated written text can be replaced by the key words in the written text, so as to generate a return with the key information consistent with the written text. Translate written text.

其中,关键词语,可以是预先在书面语文本中选取的认为对书面语文本较为重要的词语,实际应用中,可以根据预设选取规则对关键词语进行选取,该预设选取规则,可以是根据词性进行选取,也可以根据词语的实体类型进行选取。此外,也可以通过预先设立的关键词库对关键词语进行选取,将书面语文本中包含的关键词库中的词语作为关键词语等。Among them, the key words may be pre-selected words in the written text that are considered to be more important to the written text. In practical applications, the key words may be selected according to preset selection rules, and the preset selection rules may be based on parts of speech. It can also be selected according to the entity type of the word. In addition, key words may also be selected through a pre-established keyword database, and words in the keyword database included in the written language text may be used as key words and the like.

具体实施时,通过书面语文本中的关键词语对初始回译书面语文本中关键词语对应的目标关键词语进行替换,需要先确定关键词语对应的目标关键词语。具体的,该确定方式可以是多种多样的,比如,可以根据关键词语和目标关键词语在文本语句中的位置关联进行确定,也可以通过查找关键词语在初始回译书面语文本中对应的近义词,将该近义词作为目标关键词语,还可以通过关键词语在文本语句中所属的句子成分确定属于相同句子成分的词语作为目标关键词语(比如,可以将语句中的主语、谓语、宾语、定语、状语或补语等作为关键词语,并在初始回译书面语文本中选取相同成分的词作为目标关键词语)。实际应用中,可以根据实际场景选取合适的方式确定关键词语对应的目标关键词语。During specific implementation, the target keyword corresponding to the keyword in the initial back-translated written text is replaced by the keyword in the written text, and the target keyword corresponding to the keyword needs to be determined first. Specifically, the determination method can be various, for example, it can be determined according to the positional association between the keyword and the target keyword in the text sentence, or it can be determined by searching for the synonym corresponding to the keyword in the initial back-translated written text, The synonym is used as the target key word, and the words belonging to the same sentence component can also be determined as the target key word according to the sentence component of the key word in the text sentence (for example, the subject, predicate, object, attribute, adverbial or Complements, etc., are used as key words, and words with the same components in the initial back-translation written text are selected as target key words). In practical applications, the target keyword corresponding to the keyword may be determined in an appropriate manner according to the actual scene.

在确定关键词语对应的目标关键词语之后,通过关键词语对初始回译书面语文本中对应的目标关键词语进行替换,即可获得回译书面语文本。After the target key words corresponding to the key words are determined, the back-translated written text can be obtained by replacing the corresponding target key words in the initial back-translated written text by the key words.

沿用上例,在确定书面语文本LT所属的语种为汉语,预设语种为德语的基础上,将汉语的书面语文本LT翻译为德语,获得德语的译文书面语文本LT1,再将该德语的译文书面语文本LT1翻译为:汉语,获得汉语的初始回译书面语文本LT2。假设书面语文本LT中包含书面语语句S1,该书面语语句S1具体为“我的故乡是山西,那里很美”。该书面语语句S1中的关键词语为地理位置实体“山西”,初始回译书面语文本LT2中与该书面语语句S1对应的书面语语句S11为“我的家乡是陕西,那里非常漂亮”的情况下,书面语语句S1对应的书面语语句S11中关键词语对应的目标关键词语为地理位置实体“陕西”,则通过“山西”对书面语语句S11中的“陕西”进行替换,获得回译书面语文本LT3,该回译书面语文本LT3中包括对书面语语句S11进行替换后的书面语语句S12“我的家乡是山西,那里非常漂亮”。Following the above example, on the basis of determining that the language to which the written text LT belongs is Chinese and the default language is German, translate the Chinese written text LT into German to obtain the German translated written text LT1, and then the German translated written text LT1 is translated into: Chinese, and the initial back-translation written text LT2 of Chinese is obtained. Assuming that the written language text LT contains a written language sentence S1, the written language sentence S1 is specifically "My hometown is Shanxi, where it is beautiful". The key word in the written language sentence S1 is the geographic location entity "Shanxi", and the written language sentence S11 corresponding to the written language sentence S1 in the initial back-translated written language text LT2 is "My hometown is Shaanxi, where it is very beautiful", the written language If the target key word corresponding to the key word in the written language sentence S11 corresponding to the sentence S1 is the geographic location entity "Shaanxi", the "Shaanxi" in the written language sentence S11 is replaced by "Shanxi", and the back-translated written language text LT3 is obtained. The written language text LT3 includes the written language sentence S12 after replacing the written language sentence S11 "My hometown is Shanxi, it is very beautiful".

综上,在回译过程中,通过书面语文本中关键词语对回译书面语文本中对应的目标关键词语进行替换,实现了在对书面语文本进行语料扩充的情况下,保障回译书面语文本和书面语文本中关键信息的一致性。提高了回译书面语文本的准确性。To sum up, in the process of back-translation, the key words in the written text are used to replace the corresponding target key words in the back-translated written text, so as to ensure the back-translated written text and written text in the case of corpus expansion of the written text. Consistency of key information in Improved the accuracy of back-translating written text.

具体实施时,考虑到准确确定关键词语对应的目标关键词语对于保持回译书面语文本和书面语文本中的文本含义的一致性至关重要,为了避免确定通过关键词语对错误的目标关键词语进行替换,可以通过对书面语文本中关键词语添加位置标记的方式,保障可以准确获得关键词语对应的目标关键词语并进行替换,本申请实施例,将书面语文本翻译为预设语种对应的译文书面语文本之前,还包括:During the specific implementation, considering that accurately determining the target keyword corresponding to the keyword is crucial to maintaining the consistency of the back-translated written text and the text in the written text, in order to avoid determining the replacement of the wrong target keyword by the keyword, By adding position marks to the key words in the written language text, it can be ensured that the target key words corresponding to the key words can be accurately obtained and replaced. include:

通过对书面语文本进行词性分析,识别书面语文本中词性为预设词性的关键词语;Identify the key words whose part of speech is the preset part of speech in the written text by analyzing the part of speech of the written text;

在书面语文本中对关键词语所处的位置进行位置标记;Mark the position of key words in the written text;

相应地,通过书面语文本中的关键词语对初始回译书面语文本中关键词语对应的目标关键词语进行替换,获得回译书面语文本,包括:Correspondingly, the target key words corresponding to the key words in the initial back-translated written text are replaced by the key words in the written text to obtain the back-translated written text, including:

基于位置标记,通过关键词语对初始回译书面语文本中对应的目标关键词语进行替换,获得回译书面语文本。Based on the position markers, the corresponding target key words in the initial back-translated written text are replaced by the key words to obtain the back-translated written text.

具体的,对书面语文本进行词性分析,可以是通过对书面语文本中的词语进行词性标注的方式,确定书面语文本中的词语是什么词性的词。其中,词性标注可以采用基于规则的词性标注方法,也可以采用基于统计模型的词性标注方法,此外,还可以采用基于统计方法和规则方法相结合的词性标注方法。相应地,词性指以词的特点作为划分词类的根据,词性可以是名词词性、动词词性、形容词词性、数词词性等。实际应用中,由于书面语文本可能属于不同的领域,而不同领域认为重要词性的词语(即关键词语)可能是不同的,比如化学领域中认为数词词性的词语是关键词语,而日常生活领域认为名词词性的词语是关键词语。Specifically, the part-of-speech analysis on the written text may be by marking the words in the written text by part of speech to determine what part of speech the words in the written text are. Among them, the part-of-speech tagging method can be a rule-based part-of-speech tagging method, or a part-of-speech tagging method based on a statistical model. In addition, a part-of-speech tagging method based on a combination of statistical methods and rule methods can also be used. Correspondingly, part of speech refers to the characteristics of words as the basis for classifying parts of speech. Parts of speech can be noun part of speech, verb part of speech, adjective part of speech, numeral part of speech, etc. In practical applications, because written texts may belong to different fields, the words (ie, key words) that are considered important parts of speech in different fields may be different. Noun part-of-speech words are key words.

具体实施时,对关键词语所处的位置进行位置标记,可以采用大括号“{}”,或星号“*”等符号进行标记。实际应用中,可以将该位置标记添加在关键词语前后的位置。比如,关键词语为:手机,通过大括号“{}”对该关键词语进行位置标记,标记后的关键词语为{手机}。During specific implementation, the position of the key word is marked with a position, which may be marked with symbols such as curly brackets "{}", or an asterisk "*". In practical applications, the position marker can be added before and after the keyword. For example, the key word is: mobile phone, the position of the key word is marked by braces "{}", and the marked key word is {mobile phone}.

需要说明的是,在对书面语文本进行翻译前,对书面语文本中的关键词语进行位置标记。可以使对书面语文本回译后获得的初始回译书面语文本中,仍保留该位置标记,且在初始回译书面语文本中该位置标记所标记的词语即为目标关键词语。即通过位置标记的方式,可以准确定位关键词语对应的目标关键词语,以便对该目标关键词语进行准确替换。在一个书面语句中存在多个关键词语的情况下,可以根据位置标记所标记的词语与关键词语的相似度确定关键词语对应的目标关键词语,也可以根据位置标记所标记的词语在语句中的成分(比如主语、谓语、宾语等语句成分),确定相同语句成分的标记词语作为关键词语对应的目标关键词语。It should be noted that, before translating the written text, the key words in the written text are marked with positions. The position mark can still be retained in the initial back-translated written text obtained after back-translation of the written text, and the word marked by the position mark in the initial back-translated written text is the target key word. That is, the target keyword corresponding to the keyword can be accurately located by means of position marking, so that the target keyword can be accurately replaced. When there are multiple key words in a written sentence, the target key word corresponding to the key word can be determined according to the similarity between the word marked by the position mark and the key word, or the target key word corresponding to the key word can be determined according to the similarity of the word marked by the position mark in the sentence. components (such as subject, predicate, object and other sentence components), determine the marked words of the same sentence components as the target keyword corresponding to the keyword.

此外,替换后为了便于对书面语文本以及替换后的文本进行后续的文本处理,可以将书面语文本以及替换后的文本中的位置标记进行删除,对替换后的文本中的位置标记进行删除,即可获得回译书面语文本。In addition, in order to facilitate subsequent text processing of the written text and the replaced text after the replacement, the position markers in the written text and the replaced text can be deleted, and the position markers in the replaced text can be deleted. Get back-translated written text.

以书面语文本LT中的书面语语句S1为例进行说明,对书面语语句S1进行词性分析,在该书面语语句S1中识别出预设词性为名词的关键词语包括:“故乡”以及“山西”,将名词“故乡”以及“山西”通过位置标记{}进行位置标记,获得标记后的书面语语句S1,该标记后的书面语语句S1为“我的{故乡}是{山西},那里很美”。而初始回译书面语文本LT2中与该标记后的书面语语句S1对应的书面语语句S11为“我的{家乡}是{陕西},那里非常漂亮”,则通过“故乡”对书面语语句S11中位置标记“{}”对应的目标关键词语的“家乡”进行替换,并通过“山西”对书面语语句S11中位置标记“{}”对应的目标关键词语的“陕西”进行替换,获得替换后的书面语语句S11为“我的{故乡}是{山西},那里非常漂亮”,将替换后的书面语语句S11中的位置标记进行删除,删除后的书面语语句S12为“我的故乡是陕西,那里非常漂亮”。Taking the written language sentence S1 in the written language text LT as an example, the part-of-speech analysis of the written language sentence S1 is carried out. In the written language sentence S1, the key words whose preset part-of-speech is identified as a noun include: "hometown" and "Shanxi", and the noun is identified as a noun. "Hometown" and "Shanxi" are marked by the position mark {}, and the marked written language sentence S1 is obtained, and the marked written language sentence S1 is "My {hometown} is {Shanxi}, it is beautiful there". And the written language sentence S11 corresponding to the marked written language sentence S1 in the initial back-translated written language text LT2 is "My {hometown} is {Shaanxi}, it is very beautiful there", then the "hometown" is used to mark the position in the written language sentence S11. The "hometown" of the target keyword corresponding to "{}" is replaced, and the "Shaanxi" of the target keyword corresponding to the position mark "{}" in the written language sentence S11 is replaced by "Shanxi", and the replaced written language sentence is obtained. S11 is "My {hometown} is {Shanxi}, it is very beautiful there", delete the position mark in the replaced written language sentence S11, and the deleted written language sentence S12 is "My hometown is Shaanxi, it is very beautiful there" .

综上,通过在翻译前对书面语文本中的关键词语进行位置标记之后,通过位置标记对目标关键词语进行确定并替换,提高了替换的准确率以及效率。To sum up, after the key words in the written text are marked before translation, the target key words are determined and replaced by the location marking, which improves the accuracy and efficiency of the replacement.

步骤30806:对书面语文本和回译书面语文本分别进行语句组成单元的转换处理,获得口语文本。Step 30806: Perform the conversion processing of the sentence constituent units on the written text and the back-translated written text, respectively, to obtain the spoken text.

具体的,在通过对书面语文本进行回译处理,获得书面语文本对应的回译书面语文本的基础上,为了进一步扩充口语文本,可以对书面语文本以及回译书面语文本分别进行转换处理,从而获得对应的口语文本。Specifically, on the basis of obtaining the back-translated written text corresponding to the written text by performing back-translation processing on the written text, in order to further expand the spoken text, the written text and the back-translated written text can be converted respectively, so as to obtain the corresponding back-translated written text. Spoken text.

具体实施时,还可以通过对转换后获得的口语文本进行进一步筛选,从中获得语义表达相对准确的口语文本,进一步提高了对书面语文本的进行口语转换的准确性。During specific implementation, the oral text obtained after the conversion can be further screened to obtain the oral text with relatively accurate semantic expression, which further improves the accuracy of the oral conversion of the written text.

可选地,语句组成单元包括下述至少一项:子句单元、词语单元、字符单元以及符号单元。Optionally, the sentence composition unit includes at least one of the following: a clause unit, a word unit, a character unit, and a symbol unit.

实际应用中,由于书面语文本通常由书面语语句组成,而书面语语句通常由多种语句组成单元组成,这些语句组成单元包括:子句单元(子句)、词语单元(词语)、字符单元(文字)以及符号单元(标点符号)等。每种语句组成单元可能都存在书面语表达和口语表达的差异,因此,针对每种语句组成单元都可以对书面语语句进行转换处理,以使书面语语句在子句单元、词语单元、字符单元以及符号单元等都更加具有口语表达的特性。In practical applications, written language text is usually composed of written language sentences, and written language sentences are usually composed of various sentence units. These sentence units include: clause units (clauses), word units (words), character units (words) and symbolic units (punctuation marks), etc. There may be differences between written and spoken expressions in each sentence composition unit. Therefore, the written sentence can be transformed for each sentence composition unit, so that the written sentence can be divided into clause units, word units, character units and symbol units. etc. are more characteristic of oral expression.

其中,子句单元,是指书面语语句中的子句,比如,在书面语语句为“今天天气晴朗,万里无云,适合出门游玩”的情况下,该书面语语句包括3个子句,其中,子句1为:“今天天气晴朗”,子句2为:“万里无云”,子句3为:“适合出门游玩”,这3个子句在书面语语句中通过逗号进行分隔。相应地,词语单元,是指书面语语句中的词语。字符单元,是指书面语语句中的字符,该字符,可以理解为是英文中的单词,也可以理解为是中文中的单字,在此不做限制。符号单元,是指书面语语句中的标点符号,比如逗号、引号、破折号等,在此不做限制。具体实施时,可以通过对书面语文本中的书面语语句进行子句级别的调整或改写等处理,和/或,对书面语文本中的书面语语句进行词语级别的调整或改写等处理,和/或,对书面语文本中的书面语语句进行字符级别的调整或改写等处理,和/或,对书面语文本中的书面语语句进行符号级别的调整或改写等处理,可以使书面语文本的表达更具有口语的特点。Among them, the clause unit refers to the clause in the written language sentence. For example, when the written language sentence is "the weather is sunny today, there are no clouds in the sky, it is suitable to go out to play", the written language sentence includes 3 clauses, among which, the clause 1 is: "the weather is fine today", clause 2 is: "there are no clouds in the sky", and clause 3 is: "suitable for going out to play", these three clauses are separated by commas in written language sentences. Correspondingly, word units refer to words in written sentences. A character unit refers to a character in a written language sentence, and the character can be understood as a word in English or a single character in Chinese, which is not limited here. Symbolic units refer to punctuation marks in written language sentences, such as commas, quotation marks, dashes, etc., which are not limited here. In specific implementation, the written language sentences in the written language text can be adjusted or rewritten at the clause level, and/or the written language sentences in the written language text can be adjusted or rewritten at the word level, and/or The written language sentences in the written language text are adjusted or rewritten at the character level, and/or the written language sentences in the written language text are adjusted or rewritten at the symbol level, etc., so that the expression of the written language text can be more characteristic of spoken language.

具体实施时,由于书面语文本和回译书面语文本是作为不同的书面语语料,用以构建样本语料,因此,需要对书面语文本和回译书面语文本分别进行语句组成单元的转换处理,获得对应的口语文本,本申请实施例,具体通过如下方式实现:During specific implementation, since the written text and the back-translated written text are used as different written corpora to construct sample corpora, the written text and the back-translated written text need to be converted into sentence constituent units respectively to obtain the corresponding spoken text. , the embodiment of the present application is specifically realized in the following ways:

对书面语文本进行语句组成单元的转换处理,获得书面语文本对应的第一口语文本;The written language text is converted into the sentence composition unit to obtain the first spoken language text corresponding to the written language text;

对回译书面语文本进行语句组成单元的转换处理,获得回译书面语文本对应的第二口语文本;Perform the conversion processing of the sentence composition unit on the back-translated written text to obtain the second spoken text corresponding to the back-translated written text;

将第一口语文本以及第二口语文本作为口语文本。The first spoken text and the second spoken text are taken as the spoken text.

第一口语文本是指对书面语文本进行转换处理,获得的口语文本。第二口语文本是指对回译书面语文本进行转换处理,获得的口语文本。The first spoken text refers to the spoken text obtained by converting the written text. The second spoken text refers to the spoken text obtained by converting the back-translated written text.

沿用上例,对书面语文本LT进行语句组成单元的转换处理,获得书面语文本LT对应的第一口语文本ST1,并对回译书面语文本LT3进行语句组成单元的转换处理,获得回译书面语文本LT3对应的第二口语文本ST2,将第一口语文本ST1以及第二口语文本ST2作为口语文本。Following the above example, the written language text LT is converted into sentence composition units to obtain the first spoken language text ST1 corresponding to the written language text LT, and the sentence composition unit conversion processing is performed on the back-translated written language text LT3 to obtain the back-translated written language text LT3 For the corresponding second spoken text ST2, the first spoken text ST1 and the second spoken text ST2 are taken as the spoken text.

综上,通过对书面语文本和回译书面语文本分别进行语句组成单元的转换处理,获得对应的口语文本,并将获得的两个口语文本作为口语文本,即获得了两个口语文本,实现了对口语文本的扩充。To sum up, the corresponding spoken texts are obtained by converting the written text and the back-translated written texts into the sentence constituent units respectively, and the obtained two spoken texts are taken as the spoken texts, that is, two spoken texts are obtained. Augmentation of spoken text.

实际应用中,由于口语表达与书面语表达虽然可能存在很多差异,但是这些差异并非在每一个语句中都有体现,而是根据说话人的表达习惯存在一定的概率出现,为了使转换后的书面语文本更加符合口语特点,可以针对每种转换处理策略设置对应的转换处理概率,并根据转换处理概率确定是否执行转换处理策略,具体采用如下方式实现:In practical applications, although there may be many differences between spoken and written expressions, these differences are not reflected in every sentence, but appear with a certain probability according to the speaker's expression habits. It is more in line with the characteristics of spoken language. The corresponding conversion processing probability can be set for each conversion processing strategy, and whether to execute the conversion processing strategy is determined according to the conversion processing probability. The specific implementation is as follows:

确定待处理书面语文本的转换处理策略对应的转换处理概率;Determine the conversion processing probability corresponding to the conversion processing strategy of the written text to be processed;

基于转换处理概率,在转换处理策略中确定待执行的目标转换处理策略;Based on the conversion processing probability, determine the target conversion processing strategy to be executed in the conversion processing strategy;

通过执行目标转换处理策略对书面语文本进行语句组成单元的转换处理,获得待处理书面语文本对应的口语文本。By executing the target conversion processing strategy, the written text is converted into sentence constituent units, and the spoken text corresponding to the written text to be processed is obtained.

转换处理策略,是指预先设置的针对待处理书面语文本进行转换处理的方法(策略)。具体的,该转换处理策略可以包括下述至少一项:子句转换处理策略(对书面语语句进行子句单元的处理策略),词语转换处理策略(对书面语语句进行词语单元的转换处理的策略),字符转换处理策略(对书面语语句进行字符单元的转换处理的策略),以及符号转换处理策略(对书面语语句进行符号单元的转换处理的策略)。The conversion processing strategy refers to a preset method (strategy) for the conversion processing of the written text to be processed. Specifically, the conversion processing strategy may include at least one of the following: a clause conversion processing strategy (a strategy for processing written sentences with clause units), and a word conversion processing strategy (a strategy for performing word unit conversion processing on written sentences) , a character conversion processing strategy (a strategy for converting character units to written language sentences), and a symbol conversion processing strategy (a strategy for converting written language sentences to symbol units).

其中,子句转换处理策略可以是对子句的复制处理(即复制子句转换处理策略)、乱序处理、和/或倒装处理等。词语转换处理策略可以是对词语的添加处理、重复处理,和/或乱序处理等。字符转换处理策略可以是字符乱序处理等。符号转换处理策略可以是删除符号处理、添加符号处理,和/或修改符号处理等。Wherein, the clause conversion processing strategy may be duplication processing of clauses (ie, copying the clause conversion processing strategy), out-of-order processing, and/or inversion processing, and the like. The word conversion processing strategy may be addition processing, repetition processing, and/or out-of-order processing of words. The character conversion processing strategy may be character out-of-order processing, etc. The symbol conversion processing strategy may be deleting symbol processing, adding symbol processing, and/or modifying symbol processing, and the like.

具体的,转换处理策略对应的转换处理概率,是指执行转换处理策略的概率。实际应用中,每种转换处理策略都可以存在对应的转换处理概率。进一步的,基于转换处理概率,在转换处理策略中确定待执行的目标转换处理策略。以转换处理策略A为例,该转换处理策略A对应的转换处理概率为10%。则可以设置一个数值范围,该数值范围为1-100(或1-10等),并在这个数值范围内设置一个取值概率与转换处理策略A对应的转换处理概率相同的取值区间,比如1-10(或90-100),再随机生成1-100这个数值范围内的任意一个数值。若生成的数值为9,该数值处于1-10之间,表示该数值满足10%的取值概率,也即满足执行该转换处理策略A对应的转换处理概率,则确定执行该转换处理策略A,并将该转换处理策略A作为目标转换处理策略;若该生成的数值为50,该数值处于11-100之间,表示该数值不满足10%的取值概率,也即不满足执行该转换处理策略A对应的转换处理概率,因此确定不执行该转换处理策略。类似地,对于其他转换处理策略也可以通过上述方式进行相应的处理。Specifically, the conversion processing probability corresponding to the conversion processing strategy refers to the probability of executing the conversion processing strategy. In practical applications, each conversion processing strategy may have a corresponding conversion processing probability. Further, based on the conversion processing probability, the target conversion processing strategy to be executed is determined in the conversion processing strategy. Taking the conversion processing strategy A as an example, the conversion processing probability corresponding to the conversion processing strategy A is 10%. Then you can set a value range, the value range is 1-100 (or 1-10, etc.), and set a value range within this value range with the same value probability as the conversion processing probability corresponding to the conversion processing strategy A, such as 1-10 (or 90-100), and then randomly generate any value in the range of 1-100. If the generated value is 9, and the value is between 1 and 10, it means that the value satisfies the value probability of 10%, that is, it satisfies the conversion processing probability corresponding to executing the conversion processing strategy A, then it is determined to execute the conversion processing strategy A. , and use the conversion processing strategy A as the target conversion processing strategy; if the generated value is 50, and the value is between 11-100, it means that the value does not meet the 10% probability of taking a value, that is, the conversion is not satisfied. The conversion processing probability corresponding to the processing strategy A, so it is determined not to execute the conversion processing strategy. Similarly, other conversion processing strategies can also be processed in the above manner.

进一步的,由于确定的目标转换处理策略可以是一种,也可以是多种。在目标转换处理策略为多种的情况下,可以采用预设的执行顺序,顺次执行这些目标转换处理策略对待处理书面语文本进行转换处理。Further, the determined target conversion processing strategy may be one or multiple. In the case where there are multiple target conversion processing strategies, a preset execution sequence may be adopted, and these target conversion processing strategies are sequentially executed to perform conversion processing on the written text to be processed.

需要说明的是,由于每种转换处理策略都存在对应的转换处理概率,并且每种转换处理策略本身也带有一定的随机性,因此对于同一个待处理书面语文本执行多次转换处理,最终生成的口语文本很可能是不同的。因此,为了进一步扩充语料,可以对至少一个待处理书面语文本多次进行语句组成单元的转换处理,从而获得该待处理书面语文本对应的多种口语文本。It should be noted that since each conversion processing strategy has a corresponding conversion processing probability, and each conversion processing strategy itself has a certain degree of randomness, multiple conversion processing is performed on the same written text to be processed, and the final generation The spoken text is likely to be different. Therefore, in order to further expand the corpus, at least one written language text to be processed may be subjected to the conversion processing of the sentence composition unit multiple times, so as to obtain a plurality of spoken language texts corresponding to the written language text to be processed.

此外,考虑到对转换处理策略设置较高的转换处理概率,会增加样本语料的复杂度。而样本语料的复杂度越高,通过该样本语料训练获得的改写模型所进行的书面语改写也越复杂。因此,在针对不同文本类型,存在多种改写模型的情况下,由于对于模糊文本类型的目标口语文本不宜进行复杂的改写。因此,在构建模糊文本类型的目标口语文本对应的改写模型的样本语料的情况下,则可以对转换处理策略设置较低的转换处理概率。In addition, considering that a higher conversion processing probability is set for the conversion processing strategy, the complexity of the sample corpus will be increased. The higher the complexity of the sample corpus, the more complex the written language rewriting performed by the rewriting model obtained by training the sample corpus. Therefore, when there are multiple rewriting models for different text types, complex rewriting is not suitable for the target spoken text of ambiguous text type. Therefore, in the case of constructing the sample corpus of the rewriting model corresponding to the target spoken text of the fuzzy text type, a lower conversion processing probability can be set for the conversion processing strategy.

沿用上例,假设针对书面语文本存在4种转换处理策略的情况下,确定对书面语文本LT的每种转换处理策略对应的转换处理概率分别为:2%,6%,0.8%,8%。则针对每种转换处理策略,都可以为其对应的转换处理概率设置一个数值范围,并设置与该转换处理概率对应的取值范围,通过随机生成一个数,若该数在该取值范围内,则将该转换处理概率对应的转换处理策略确定为目标转换处理策略,并执行该目标转换处理策略对书面语文本LT进行语句组成单元的转换处理,获得书面语文本LT对应的口语文本。Following the above example, assuming that there are four conversion processing strategies for written text, the conversion processing probabilities corresponding to each conversion processing strategy for written text LT are determined to be: 2%, 6%, 0.8%, and 8%, respectively. Then for each conversion processing strategy, a value range can be set for its corresponding conversion processing probability, and a value range corresponding to the conversion processing probability can be set, by randomly generating a number, if the number is within the value range , then the conversion processing strategy corresponding to the conversion processing probability is determined as the target conversion processing strategy, and the target conversion processing strategy is executed to convert the written text LT to sentence constituent units to obtain the spoken text corresponding to the written text LT.

综上,为每种转换策略设置对应的转换处理概率,即每种转换策略根据一定的执行概率予以执行,则无需刻意执行每种转换处理策略,以此保障了书面语转换的自然性以及合理性。In summary, the corresponding conversion processing probability is set for each conversion strategy, that is, each conversion strategy is executed according to a certain execution probability, so there is no need to deliberately execute each conversion processing strategy, thus ensuring the naturalness and rationality of written language conversion. .

具体实施时,由于生成回译书面语文本是为了对书面语文本进行扩充,因此,需要对书面语文本以及回译书面语文本分别进行口语转换。也因此,将书面语文本和回译书面语文本中任意一个书面语文本都可以作为待处理书面语文本,并对待处理书面语文本进行语句组成单元的转换处理,在语句组成单元为子句单元的情况下,具体实现通过执行如下步骤30806-2至步骤30806-6:During specific implementation, since the back-translated written text is generated to expand the written text, it is necessary to perform oral language conversion on the written text and the back-translated written text respectively. Therefore, any written language text in the written language text and the back-translated written language text can be used as the written language text to be processed, and the sentence composition unit conversion processing is performed on the written language text to be processed. When the sentence composition unit is a clause unit, the specific This is achieved by performing steps 30806-2 to 30806-6 as follows:

步骤30806-2,对待处理书面语文本进行语句识别,获得待处理书面语文本中包含的书面语语句。Step 30806-2: Perform sentence recognition on the written language text to be processed to obtain written language sentences contained in the written language text to be processed.

对待处理书面语文本进行语句识别,可以理解为对待处理文本进行分句处理。实际应用中,可以通过对待处理书面语文本中包含的分句符号(比如句号、问号、分号等用于进行分句操作的标识符)进行识别,通过分句符号进行语句划分(识别),即可获得该待处理书面语文本中包含的至少一个书面语语句。Performing sentence recognition on the written text to be processed can be understood as performing sentence processing on the text to be processed. In practical applications, it is possible to identify the clause symbols (such as period, question mark, semicolon and other identifiers used for clause operations) contained in the written text to be processed, and to divide (identify) sentences by means of clause symbols, that is, At least one written language sentence contained in the written language text to be processed can be obtained.

步骤30806-4,对书面语语句进行子句单元的转换处理,获得转换后的书面语语句。Step 30806-4: Convert the written language sentence to the clause unit to obtain the converted written language sentence.

进一步的,对识别出的每个书面语语句分别进行子句单元的转换,即可获得每个书面语语句对应的转换后的书面语语句。Further, the clause unit conversion is performed on each of the identified written language sentences, so as to obtain a converted written language sentence corresponding to each written language sentence.

具体的,由于对待处理书面语文本中包含的书面语语句进行子句单元的转换处理的转换方式是多种多样的,本申请实施例,可以通过如下两种方式或如下两种方式组合的方式对书面语语句进行转换处理,包括:Specifically, since there are various ways of converting the written language sentence contained in the written language text to be processed for the conversion processing of the clause unit, in this embodiment of the present application, the written language can be converted to the written language in the following two ways or a combination of the following two ways. Statements are converted, including:

方法一:按照预设子句采样规则对书面语语句进行子句采样,获得书面语语句中的目标子句;在书面语语句中对目标子句进行转换处理,获得转换后的书面语语句。Method 1: Sampling the written sentences according to the preset clause sampling rules to obtain the target clauses in the written sentences; converting the target clauses in the written sentences to obtain the converted written sentences.

实际应用中,由于一个书面语语句中可能包含多个子句,而这些子句不一定都具有书面语和口语的表达差异,因此,可以先在这些书面语语句中选取需要进行子句单元转换的子句,再对选取出的子句进行转换处理。In practical applications, since a written language statement may contain multiple clauses, and these clauses may not all have the expression difference between written language and spoken language, therefore, you can first select the clauses that need to be converted into clause units in these written language sentences, The selected clauses are then converted.

预设子句采样规则,是指预先设置的在书面语语句中采样子句的采样规则,该预设子句采样规则,可以是随机采样,也可以是根据位置进行采样,比如采样位置在书面语语句中排在第一位置的子句,此外,还可以根据字符数量进行采样,比如采样子句中字符数量小于5的子句等,在此不做限制。相应地,目标子句,是指通过预设子句采样规则对书面语语句进行采样获得的子句。The preset clause sampling rule refers to a preset sampling rule for sampling clauses in written sentences. The preset clause sampling rules may be random sampling or sampling based on positions, for example, the sampling position is in written sentences. In addition, sampling can be performed according to the number of characters, such as a clause with a number of characters less than 5 in the sampling clause, which is not limited here. Correspondingly, the target clause refers to a clause obtained by sampling written language sentences through a preset clause sampling rule.

在获得目标子句的基础上,即可在书面语语句中对该目标子句进行转换处理,具体实施时,由于对选取出的目标子句进行转换处理的方式也是多种多样的,为了增加转换后的书面语语句的自然性以及丰富性,可以通过如下三种转换方式或如下三种转换方式进行任意组合的方式,对目标子句进行转换处理,包括:On the basis of obtaining the target clause, the target clause can be converted in the written language sentence. During the specific implementation, since there are various ways to convert the selected target clause, in order to increase the conversion The naturalness and richness of the following written language sentences can be transformed into the target clause by the following three transformation methods or any combination of the following three transformation methods, including:

方式A:对目标子句进行复制获得复制目标子句,并将复制目标子句按照预设子句插入位置插入至书面语语句,获得转换后的书面语语句。Mode A: Duplicate the target clause to obtain the duplicate target clause, and insert the duplicate target clause into the written language statement according to the preset clause insertion position to obtain the converted written language statement.

实际应用中,由于口语表达时,有时会出现一些书面语语句中没有的口语化的语句表达,比如:对对对、好好好等。为了使书面语更加符合口语特点,可以对书面语语句进行一些口语化子句的添加处理。In practical applications, due to oral expressions, sometimes there are some colloquial expressions that are not found in written sentences, such as: right, right, good, and so on. In order to make written language more in line with the characteristics of spoken language, some colloquial clauses can be added to written language sentences.

具体的,预设子句插入位置,是指预先设置的将目标子句插入书面语语句中的位置,该位置可以根据实际口语特点进行设置,比如该预设子句插入位置可以是书面语语句的句首或句尾,也可以是该书面语语句中目标子句所处位置之前或之后等。Specifically, the preset clause insertion position refers to the preset position where the target clause is inserted into the written language sentence, and the position can be set according to the actual spoken language characteristics. For example, the preset clause insertion position can be the sentence of the written language sentence. The beginning or the end of the sentence can also be before or after the position of the target clause in the written sentence.

沿用上例,假设将书面语文本LT作为待处理书面语文本,对该书面语文本LT进行语句识别,获得书面语文本LT中包含的n个书面语语句,这n个书面语语句分别为书面语语句S1、书面语语句S2……书面语语句Sn。以书面语语句S1为例进行说明,随机对书面语语句S1“我的故乡是山西,那里很美”进行子句采样,获得书面语语句S1中的目标子句为“我的故乡是山西”。在书面语语句S1中对该目标子句进行复制,获得复制目标子句“我的故乡是山西”,在预设子句插入位置为目标子句所处位置之前的情况下,将复制目标子句“我的故乡是山西”插入书面语语句S1中,获得转换后的书面语语句S13为:“我的故乡是山西,我的故乡是山西,那里很美”。Following the above example, suppose that the written text LT is used as the written text to be processed, and the written text LT is sentence-recognized to obtain n written sentences contained in the written text LT. These n written sentences are the written sentence S1 and the written sentence S2 respectively. ...Written language sentence Sn. Taking the written language sentence S1 as an example to illustrate, randomly sample the clauses of the written language sentence S1 "My hometown is Shanxi, where it is beautiful", and obtain the target clause in the written language sentence S1 as "My hometown is Shanxi". Copy the target clause in the written language sentence S1 to obtain the copy target clause "My hometown is Shanxi". If the insertion position of the preset clause is before the position of the target clause, the target clause will be copied "My hometown is Shanxi" is inserted into the written language sentence S1, and the converted written language sentence S13 is obtained as: "My hometown is Shanxi, my hometown is Shanxi, and it is beautiful there".

方式B:在书面语语句中将删除目标子句;将目标子句按照预设子句插入规则插入删除后的书面语语句,获得转换后的书面语语句。Mode B: delete the target clause in the written language statement; insert the target clause into the deleted written language statement according to the preset clause insertion rule, and obtain the converted written language statement.

实际应用中,由于口语表达时,有时并不在意子句的表达顺序,因此口语语句中可能会出现子句的表达顺序与书面语语句的表达顺序不一致的情况。为了使转换后的书面语更加符合口语特点,可以对书面语语句的一些子句进行位置调整处理。In practical applications, because sometimes the order of expression of clauses is not concerned when expressing in spoken language, there may be situations in which the order of expression of clauses in spoken sentences is inconsistent with that of written sentences. In order to make the converted written language more in line with the characteristics of spoken language, some clauses of written language sentences can be adjusted in position.

具体的,预设子句插入规则,是指预先设置的插入目标子句的规则,该规则可以根据实际经验进行设置,比如,预设子句插入规则可以是随机插入(即该预设子句随机插入书面语语句中任意子句之前或之后),可以是在第一个子句之后插入,还可以是在句尾插入等。Specifically, the preset clause insertion rule refers to a preset rule for inserting the target clause, and the rule can be set according to actual experience. For example, the preset clause insertion rule can be random insertion (that is, the preset clause Randomly inserted before or after any clause in a written statement), either after the first clause, at the end of the sentence, etc.

需要说明的是,由于方法A的转换处理和方法B的转换处理可以有选择性地执行,因此,可以对方法A的获得的转换后的书面语语句执行方法B的转换处理,也可以直接对原书面语语句进行执行方法B的转换处理,还可以对方法B获得的转换后的书面语语句执行方法A的转换处理,此外,其他转换处理也是可以有选择性地执行,和/或顺次执行等。It should be noted that, since the conversion processing of method A and the conversion processing of method B can be selectively performed, the conversion processing of method B may be performed on the converted written language sentence obtained by method A, or the conversion processing of method B may be directly performed on the original sentence. The conversion processing of method B is performed on the written language sentence, and the conversion processing of method A can be performed on the converted written language sentence obtained by method B. In addition, other conversion processing can also be selectively executed and/or executed sequentially.

沿用上例,还是以书面语语句S1为例进行说明,随机对书面语语句S1“我的故乡是山西,那里很美”进行子句采样,获得书面语语句S1中的目标子句为“我的故乡是山西”。在书面语语句S1中对该目标子句进行删除,并在预设子句插入规则为随机插入的情况下,将目标子句“我的故乡是山西”随机插入书面语语句S1的任意子句之后,获得转换后的书面语语句S13为:“我的故乡是山西,那里很美,我的故乡是山西”。Following the above example, let’s take the written language sentence S1 as an example to illustrate, randomly sample the clauses of the written language sentence S1 “My hometown is Shanxi, it is beautiful there”, and obtain the target clause in the written language sentence S1 as “My hometown is Shanxi". The target clause is deleted in the written language sentence S1, and when the preset clause insertion rule is random insertion, the target clause "My hometown is Shanxi" is randomly inserted after any clause of the written language sentence S1, The converted written language sentence S13 is: "My hometown is Shanxi, it is beautiful there, my hometown is Shanxi".

方式C:对目标子句进行句法分析,获得目标子句对应的句法结构;通过将目标子句按照句法结构对应的目标句法结构进行转换,获得转换后的书面语语句。Mode C: perform syntactic analysis on the target clause to obtain the syntactic structure corresponding to the target clause; and obtain the converted written language statement by converting the target clause according to the target syntactic structure corresponding to the syntactic structure.

实际应用中,虽然子句的语法结构(句法结构)不一致,但表达的意思仍是相同的,因此口语语句中可能会出现子句中的语序与书面语语句中的语法结构不一致的情况。因此,为了使转换后的书面语更加符合口语特点,可以对书面语语句的一些子句进行语法结构的改变,比如:倒装处理。In practical applications, although the grammatical structure (syntax structure) of the clauses is inconsistent, the meanings expressed are still the same. Therefore, the word order in the clauses may be inconsistent with the grammatical structure in the written sentences in spoken sentences. Therefore, in order to make the converted written language more in line with the characteristics of spoken language, the grammatical structure of some clauses of the written language sentence can be changed, such as: inversion processing.

具体的,对采样的目标子句进行句法分析,可以采用基于规则的句法分析方法或基于统计的句法分析方法,获得目标子句对应的句法结构,该句法结构可以为主谓宾结构或宾谓主的结构等,在此不做限制。相应地,目标句法结构,是指预先设置的与目标子句的句法结构对应的句法结构。具体实施时,句法结构和目标句法结构之间可以进行转换。比如,句法结构为主谓宾的主动句法结构,而其目标句法结构可以为宾谓主的被动句法结构。Specifically, to syntactically analyze the sampled target clauses, a rule-based syntax analysis method or a statistics-based syntax analysis method can be used to obtain a syntax structure corresponding to the target clause, and the syntax structure can be a subject-predicate-object structure or an object-predicate structure. The structure of the master, etc., is not limited here. Correspondingly, the target syntactic structure refers to a preset syntactic structure corresponding to the syntactic structure of the target clause. During specific implementation, conversion can be performed between the syntactic structure and the target syntactic structure. For example, the syntactic structure is an active syntactic structure of subject-predicate-object, and its target syntactic structure can be a passive syntactic structure of object-predicate-subject.

沿用上例,还是随机对书面语语句S1“我的故乡是山西,那里很美”进行子句采样,获得书面语语句S1中的目标子句为“我的故乡是山西”为例进行说明。该目标子句的句法结构为主谓宾结构,而该句法结构对应的目标句法结构为宾谓主结构。则将目标子句转换为宾谓主结构,转换后的目标子句变为:“山西是我的故乡”。相应地,转换后的书面语语句S13为:“山西是我的故乡,那里很美”。Following the above example, we still randomly sample the clauses of the written language sentence S1 "My hometown is Shanxi, where it is beautiful", and obtain the target clause in the written language sentence S1 as "My hometown is Shanxi" as an example to illustrate. The syntactic structure of the target clause is a subject-predicate-object structure, and the target syntactic structure corresponding to the syntactic structure is an object-predicate-subject structure. Then the target clause is converted into an object-predicate subject structure, and the converted target clause becomes: "Shanxi is my hometown". Correspondingly, the converted written language sentence S13 is: "Shanxi is my hometown, it is beautiful there".

方法二:确定预设子句集合中包含的预设子句对应的子句位置概率分布;基于子句位置概率分布在预设子句中确定目标预设子句以及目标预设子句对应的子句添加位置;根据子句添加位置将目标预设子句添加至书面语语句中,获得转换后的书面语语句。Method 2: Determine the clause position probability distribution corresponding to the preset clause contained in the preset clause set; determine the target preset clause and the corresponding target preset clause in the preset clause based on the clause position probability distribution. Clause adding position; the target preset clause is added to the written language sentence according to the clause adding position, and the converted written language sentence is obtained.

预设子句集合,是指预先设置的包含至少一个口语化子句的集合。相应地,预设子句是指预设子句集合中包含的子句。子句位置概率分布,是指预先通过对某一口语语料集中预设子句的出现位置(比如句首、句尾、或句中等位置)进行统计,获得的每个预设子句的位置概率分布。实际应用中,可以统计每个预设子句在每个位置出现的频次,再根据统计的频次计算位置概率分布。The preset clause set refers to a preset set containing at least one colloquial clause. Correspondingly, a preset clause refers to a clause contained in a set of preset clauses. Clause position probability distribution refers to the position probability of each preset clause obtained by pre-stating the occurrence positions of preset clauses in a certain spoken language corpus (such as the beginning of a sentence, the end of a sentence, or the position in a sentence). distributed. In practical applications, the frequency of occurrence of each preset clause at each position can be counted, and then the position probability distribution can be calculated according to the counted frequency.

假设,预设子句集合中包含了3个预设子句,这3个预设子句分别为预设子句1、预设子句2以及预设子句3。根据对销售领域的口语语料集进行统计,预设子句1在句首出现了60次,预设子句2在句尾出现了20次,预设子句3在句首出现了20次,则预设子句1添加至句首的概率为:60/(60+20+20)=60%,预设子句2添加至句尾的概率为20/(60+20+20)=20%,预设子句3添加至句首的概率也为20/(60+20+20)=20%。以上3个概率即为预设子句对应的子句位置概率分布。It is assumed that the preset clause set includes three preset clauses, and the three preset clauses are preset clause 1, preset clause 2, and preset clause 3 respectively. According to the statistics of the spoken language corpus in the sales field, the preset clause 1 appears 60 times at the beginning of the sentence, the preset clause 2 appears at the end of the sentence 20 times, and the preset clause 3 appears at the beginning of the sentence 20 times. Then the probability of adding preset clause 1 to the beginning of the sentence is: 60/(60+20+20)=60%, and the probability of adding preset clause 2 to the end of the sentence is 20/(60+20+20)=20 %, the probability that the preset clause 3 is added to the beginning of the sentence is also 20/(60+20+20)=20%. The above three probabilities are the clause position probability distributions corresponding to the preset clauses.

进一步的,基于子句位置概率分布,即可在预设子句中确定目标预设子句,以及目标预设子句对应的子句添加位置(在书面语语句中添加目标预设子句的位置)。具体实施时,也可以预设一个数值范围,该数值范围为1-100(或1-10等),并在这个数值范围内设置取值概率与子句位置概率分布相同的取值区间,比如1-60、61-80以及81-100,再随机生成1-100这个数值范围内的任意一个数值。若生成的数值为9,该数值处于1-60之间,表示该数值满足60%的取值概率,也即满足执行将预设子句1添加至句首的概率,则确定目标预设子句为预设子句1以及目标预设子句对应的子句添加位置为句首。Further, based on the probability distribution of the clause positions, the target preset clause can be determined in the preset clause, and the clause addition position corresponding to the target preset clause (the position where the target preset clause is added in the written language sentence) can be determined. ). During specific implementation, it is also possible to preset a value range, the value range is 1-100 (or 1-10, etc.), and within this value range, set the value range in which the value probability is the same as the clause position probability distribution, such as 1-60, 61-80 and 81-100, and then randomly generate any value in the range of 1-100. If the generated value is 9, and the value is between 1 and 60, it means that the value satisfies the 60% probability of taking the value, that is, the probability of adding the preset clause 1 to the beginning of the sentence is met, then determine the target preset The sentence is the pre-set clause 1 and the clauses corresponding to the target pre-set clause are added at the beginning of the sentence.

再进一步的,在目标预设子句为“对对对”的情况下,将该目标预设子句“对对对”添加至书面语语句S1的句首,获得转换后的书面语语句S13为:“对对对,我的故乡是山西,那里很美”。Still further, in the case where the target preset clause is "Duo Duo Duo", the target pre-set clause "Duo Duo Duo" is added to the sentence beginning of the written language sentence S1, and the converted written language sentence S13 is obtained as: "Yes, yes, my hometown is Shanxi, which is very beautiful."

步骤30806-6,基于转换后的书面语语句确定口语文本。Step 30806-6: Determine the spoken text based on the converted written sentence.

在转换后的书面语语句存在多个的情况下,可以对转换后的书面语语句按照原书面语语句在书面语文本中的排列顺序进行组合,生成口语文本。When there are a plurality of converted written language sentences, the converted written language sentences may be combined according to the arrangement order of the original written language sentences in the written language text to generate a spoken language text.

沿用上例,对书面语文本LT中包含的n个书面语语句中至少一个书面语语句,执行上述至少一种转换处理,获得n转换后的书面语语句分别为书面语语句S13、书面语语句S23、……、书面语语句Sn3,将这n个转换后的书面语语句进行组合,生成口语文本ST1。Following the above example, perform at least one conversion process on at least one written language sentence among the n written language sentences included in the written language text LT, and obtain n converted written language sentences as written language sentence S13, written language sentence S23, ..., written language sentence respectively. The sentence Sn3 combines the n converted written sentences to generate the spoken text ST1.

综上,通过对书面语语句进行子句单元的复制子句、子句乱序和/或子句添加的转换处理,实现了对书面语语句进行口语改写,使转换后的书面语语句更加符合口语特点。To sum up, by performing the conversion processing of copying clauses of clause units, out-of-order clauses and/or adding clauses to written sentences, the oral rewriting of written sentences is realized, and the converted written sentences are more in line with the characteristics of spoken language.

此外,在语句组成单元为词语单元的情况下,对待处理书面语文本进行子句单元的转换处理的转换方式也是多种多样的,本申请实施例提供的第一种实施方式,具体采用如下方式实现:In addition, in the case that the sentence composition unit is a word unit, there are various conversion methods for converting the written language text to be processed for the clause unit. The first implementation provided by the embodiment of the present application is implemented in the following manner. :

对待处理书面语文本进行语句识别,获得待处理书面语文本中包含的书面语语句;Perform sentence recognition on the written language text to be processed, and obtain written language sentences contained in the written language text to be processed;

确定预设词语集合中包含的预设词语对应的词语位置概率分布;Determine the word position probability distribution corresponding to the preset words included in the preset word set;

根据词语位置概率分布在预设词语中确定目标预设词语以及目标预设词语对应的词语添加位置,并根据词语添加位置将目标预设词语插添加至书面语语句中,获得转换后的书面语语句;Determine the target preset word and the word addition position corresponding to the target preset word in the preset words according to the word position probability distribution, and insert the target preset word into the written language sentence according to the word addition position to obtain the converted written language sentence;

基于转换后的书面语语句确定口语文本。The spoken text is determined based on the transformed written sentences.

实际应用中,由于口语表达中,会随机添加对一些口语化的词语,这些口语化的词语可以包括:连接词,语气词或其他口语词语等,比如:哇塞、其实等,这些口语化的词语在书面语语句中通常不存在。为了使书面语更加符合口语特点,可以对书面语语句进行一些口语词语的添加处理。In practical applications, some colloquial words will be randomly added in the oral expression. These colloquial words may include: conjunctions, modal particles or other colloquial words, such as: wow, actually, etc. These colloquial words Usually not present in written language sentences. In order to make the written language more in line with the characteristics of spoken language, some oral words can be added to the written language sentences.

具体的,预设词语集合,是指预先设置的包含至少一个口语化词语的集合。相应地,预设词语是指预设词语集合中包含的词语。词语位置概率分布,是指预先通过对某一口语语料集中预设词语的出现位置(比如句首、句尾、或句中等位置)进行统计,获得的每个预设词语的位置概率分布。实际应用中,可以统计每个预设词语在每个位置出现的频次,再根据统计的频次计算位置概率分布。Specifically, the preset word set refers to a preset set containing at least one colloquial word. Correspondingly, the preset words refer to the words contained in the preset word set. The word position probability distribution refers to the position probability distribution of each preset word obtained by pre-stating the occurrence positions of preset words (such as sentence beginning, sentence end, or sentence middle position) in a certain spoken language corpus. In practical applications, the frequency of occurrence of each preset word at each position can be counted, and then the position probability distribution can be calculated according to the counted frequency.

假设,预设词语集合中包含了2个预设词语,这2个预设词语分别为预设词语1、预设词语2。根据对销售领域的口语语料集进行统计,预设词语1在句首出现了80次,预设词语2在句尾出现了20次,则预设词语1添加至句首的概率为:80/(80+20)=80%,预设词语2添加至句尾的概率为20%,则以上2个概率即为预设词语对应的词语位置概率分布。It is assumed that the preset word set includes two preset words, and the two preset words are preset word 1 and preset word 2 respectively. According to the statistics of the oral corpus in the sales field, the preset word 1 appears 80 times at the beginning of the sentence, and the preset word 2 appears 20 times at the end of the sentence, then the probability that the preset word 1 is added to the beginning of the sentence is: 80/ (80+20)=80%, the probability that the preset word 2 is added to the end of the sentence is 20%, and the above two probabilities are the word position probability distributions corresponding to the preset words.

具体的,根据词语位置概率分布在预设词语中确定目标预设词语以及目标预设词语对应的词语添加位置(在书面语语句中添加目标预设词语的位置)的具体实现,参考上述基于子句位置概率分布在预设子句中确定目标预设子句以及目标预设子句对应的子句添加位置的具体实现即可,在此不做赘述。Specifically, for the specific implementation of determining the target preset word and the word addition position corresponding to the target preset word (the position where the target preset word is added in the written sentence) in the preset words according to the word position probability distribution, refer to the above clause-based The position probability distribution can be implemented by determining the target preset clause and the specific implementation of the clause addition position corresponding to the target preset clause in the preset clause, and details are not described here.

在确定目标预设词语以及目标预设词语对应的词语添加位置的基础上,即可将目标预设词语添加至书面语语句中的词语添加位置,并获得转换后的书面语语句,进一步,基于转换后的书面语语句确定口语文本的具体实现参考上述在子句单元的转换处理部分,基于转换后的书面语语句确定口语文本的具体实现即可,在此不做限制。On the basis of determining the target preset word and the word addition position corresponding to the target preset word, the target preset word can be added to the word addition position in the written language sentence, and the converted written language sentence is obtained, and further, based on the converted written language sentence For the specific implementation of determining the spoken text by the written language sentence, refer to the above-mentioned conversion processing part in the clause unit, and the specific implementation of the spoken text is determined based on the converted written language sentence, which is not limited here.

沿用上例,在书面语文本LT中包含的n个书面语语句的基础上,以其中的书面语语句S1为例进行说明,预设词语集合中包含了2个预设词语,这2个预设词语分别为预设词语1、预设词语2,这2个预设词语的词语位置概率分布为:预设词语1添加至句首的概率为80%,预设词语2添加至句首的概率为20%。假设根据该词语位置概率分布在预设词语集合中确定目标预设词语为预设词语1以及目标预设词语对应的词语添加位置为句首。在预设词语1为“其实”情况下,将预设词语1添加至书面语语句S1的句首,获得转换后的书面语语句S13为:“其实我的故乡是山西,那里很美”。再将n个转换后的书面语语句进行组合,生成口语文本ST1。Following the above example, on the basis of the n written language sentences included in the written language text LT, taking the written language sentence S1 as an example for illustration, the preset word set contains two preset words, and the two preset words are respectively. It is preset word 1 and preset word 2, and the word position probability distribution of these two preset words is: the probability that preset word 1 is added to the beginning of a sentence is 80%, and the probability that preset word 2 is added to the beginning of a sentence is 20% %. It is assumed that the target preset word is determined as preset word 1 in the preset word set according to the word position probability distribution and the word addition position corresponding to the target preset word is the beginning of the sentence. When the preset word 1 is "actually", the preset word 1 is added to the sentence beginning of the written language sentence S1, and the converted written language sentence S13 is obtained as: "Actually, my hometown is Shanxi, which is very beautiful". Then the n converted written sentences are combined to generate spoken text ST1.

综上,通过对书面语语句进行词语单元的词语添加处理,实现了对书面语语句进行口语改写,使转换后的书面语语句更加符合口语特点。To sum up, by adding the words of word units to the written sentences, the oral rewriting of the written sentences is realized, so that the converted written sentences are more in line with the characteristics of spoken language.

具体实施时,由于口语表达时,在添加一些口语化词语之后,可能会习惯性地对添加的词语进行重复,因此,为了使转换后的书面语更加符合口语特点,可以对书面语语句中添加的词语进行复制处理,本申请实施例,具体通过如下方式实现:During the specific implementation, after adding some colloquial words, the added words may be habitually repeated. Therefore, in order to make the converted written language more in line with the characteristics of spoken language, the words added in the written language sentence can be changed. Carry out the replication process, the embodiment of the present application is specifically implemented in the following ways:

对转换后的书面语语句中添加的目标预设词语进行复制,获得复制词语;Copy the target preset words added in the converted written language sentence to obtain the copied words;

将复制词语按照预设词语插入规则插入转换后的书面语语句中,获得插入后的书面语语句;Insert the copied word into the converted written language statement according to the preset word insertion rule to obtain the inserted written language statement;

基于插入后的书面语语句确定口语文本。The spoken text is determined based on the inserted written language sentence.

具体的,预设词语插入规则,是指预先设置的将目标预设词语插入书面语语句中的规则,该规则可以根据实际口语特点进行设置,比如该预设词语插入规则可以是将目标预设词语插入目标预设词语之前或之后的位置,也可以是将目标预设词语插入书面语语句中的其他位置,在此不做限制。Specifically, the preset word insertion rule refers to a preset rule for inserting target preset words into written sentences, and the rule can be set according to the characteristics of actual spoken language. For example, the preset word insertion rule can be the target preset word insertion rule The positions before or after the target preset words are inserted may also be other positions in which the target preset words are inserted in the written language sentence, which is not limited herein.

沿用上例,在获得转换后的书面语语句S13为:“其实我的故乡是山西,那里很美”的基础上,将目标预设词语“其实”进行复制,获得复制词语“其实”,在预设词语插入规则为插入至目标预设词语之前的情况下,将该复制词语“其实”加入至转换后的书面语语句S13中,获得插入后的书面语语句S14,该书面语语句S14为“其实其实我的故乡是山西,那里很美”。再将n个插入后的书面语语句进行组合,生成口语文本ST1。Following the above example, on the basis of obtaining the converted written language sentence S13 as: "Actually, my hometown is Shanxi, and it is beautiful there", copy the target preset word "actually" to obtain the duplicated word "actually", and in the preset Assuming that the word insertion rule is to be inserted before the target preset word, add the copied word "actually" to the converted written language sentence S13 to obtain the inserted written language sentence S14, and the written language sentence S14 is "actually actually I am. His hometown is Shanxi, which is very beautiful." Then, the n inserted written sentences are combined to generate spoken text ST1.

综上,在对书面语语句进行词语单元的词语添加处理的处理上,再对添加的词语进行重复处理,使转换后的书面语语句更加符合口语特点。To sum up, in the process of adding words of word units to written sentences, the added words are repeated, so that the converted written sentences are more in line with the characteristics of spoken language.

在语句组成单元为词语单元的情况下,除上述词语单元的转换处理之外,本申请实施例提供的第二种实施方式,具体采用如下处理方式实现:In the case that the sentence composition unit is a word unit, in addition to the conversion processing of the word unit, the second implementation provided by the embodiment of the present application is specifically implemented by the following processing methods:

对待处理书面语文本进行语句识别,获得书面语文本中包含的书面语语句;Perform sentence recognition on the written text to be processed, and obtain written sentences contained in the written text;

按照预设词语采样规则对书面语语句中的词语进行词语采样,获得书面语语句中的目标词语;According to the preset word sampling rules, the words in the written language sentence are sampled to obtain the target word in the written language sentence;

在书面语语句中删除目标词语,并将目标词语插入删除后的书面语语句中目标词语对应的预设插入范围内,获得转换后的书面语语句;Delete the target word in the written language sentence, insert the target word into the preset insertion range corresponding to the target word in the deleted written language sentence, and obtain the converted written language sentence;

基于转换后的书面语语句确定口语文本。The spoken text is determined based on the transformed written sentences.

由于口语表达中,有时并不在意词语的表达顺序,因此口语语句中可能会出现词语的表达顺序与书面语语句的表达顺序不一致的情况。为了使转换后的书面语更加符合口语特点,可以对书面语语句的一些词语进行位置调整处理。Since the order of expression of words is sometimes ignored in spoken language, the order of expression of words in spoken sentences may be inconsistent with the order of expression in written sentences. In order to make the converted written language more in line with the characteristics of spoken language, some words in the written language sentence can be adjusted in position.

具体的,预设词语采样规则,是指预先设置的在书面语语句中采样待乱序的词语的采样规则,该预设词语采样规则,可以是随机采样,也可以是根据预设字符数量进行采样,比如随机采样在书面语语句中采样字符数量为3个字符的词语等。相应地,目标词语,是指通过词语采样规则在书面语语句中采样的词语。Specifically, the preset word sampling rule refers to a preset sampling rule for sampling words to be out of order in written sentences. The preset word sampling rule may be random sampling or sampling according to a preset number of characters , such as randomly sampling words with three characters in written sentences, etc. Correspondingly, the target word refers to the word sampled in the written language sentence by the word sampling rule.

预设插入范围,是指预先设置的进行插入处理的范围。该预设插入范围可以根据实际经验或口语表达习惯进行预先设置,具体的,目标词语对应的预设插入范围,可以是目标词语在书面语语句所在位置之前3个字符到目标词语在书面语语句所在位置之后3个字符的字符区间,该字符区间可以简称为[-3,3],此外,该预设插入范围还可以是该词语所属子句范围内等。进一步的,将目标词语在预设插入范围内随机插入即可。The preset insertion range refers to a preset range for performing insertion processing. The preset insertion range can be preset according to actual experience or oral expression habits. Specifically, the preset insertion range corresponding to the target word can be 3 characters before the position of the target word in the written language sentence to the position of the target word at the position of the written language sentence For the character interval of the next three characters, the character interval may be referred to as [-3, 3] for short. In addition, the preset insertion range may also be within the scope of the clause to which the word belongs. Further, the target word can be randomly inserted within the preset insertion range.

沿用上例,在书面语文本LT中包含的n个书面语语句的基础上,以其中的书面语语句S1为例进行说明,在书面语语句S1中随机采样词语,获得目标词语为“那里”,在书面语语句S1中删除该目标词语,删除后的书面语语句S1为“我的故乡是山西,很美”。并在预设插入范围为“目标词语所属的子句范围”的情况下,将该目标词语插入至书面语语句S1的预设插入范围内,获得转换后的书面语语句S13为:“我的故乡是山西,很美那里”。再将n个转换后的书面语语句进行组合,生成口语文本ST1。Following the above example, on the basis of the n written language sentences contained in the written language text LT, taking the written language sentence S1 as an example to illustrate, randomly sample words in the written language sentence S1 to obtain the target word "there", and in the written language sentence The target word is deleted in S1, and the deleted written language sentence S1 is "My hometown is Shanxi, it is very beautiful". And when the preset insertion range is "the scope of the clause to which the target word belongs", insert the target word into the preset insertion range of the written language sentence S1, and the converted written language sentence S13 is obtained as: "My hometown is Shanxi is very beautiful there." Then the n converted written sentences are combined to generate spoken text ST1.

综上,通过对书面语语句进行词语单元的转换处理,使转换后的书面语语句更加符合口语特点。To sum up, through the conversion of word units to written sentences, the converted written sentences are more in line with the characteristics of spoken language.

在语句组成单元为字符单元的情况下,考虑到有时在口语表达过程并不严格遵循字符的表达顺序,因此口语语句中可能会出现字符的表达顺序与书面语语句中字符的表达顺序不一致的情况。为了使转换后的书面语更加符合口语特点,可以对书面语语句的一些字符进行位置调整处理,本申请实施例,具体采用如下方式实现:In the case where the unit of sentence is a character unit, considering that sometimes the expression order of characters is not strictly followed in the process of spoken language expression, the expression order of characters in spoken language sentences may be inconsistent with the expression order of characters in written language sentences. In order to make the converted written language more in line with the characteristics of spoken language, some characters of the written language sentence can be adjusted in position.

对待处理书面语文本进行语句识别,获得待处理书面语文本中包含的书面语语句;Perform sentence recognition on the written language text to be processed, and obtain written language sentences contained in the written language text to be processed;

按照预设字符采样规则对书面语语句中的字符进行字符采样,获得书面语语句中的目标字符;According to the preset character sampling rule, character sampling is performed on the characters in the written language sentence to obtain the target character in the written language sentence;

在书面语语句中删除目标字符,并将目标字符插入删除后的书面语语句中目标字符对应的预设字符插入范围内,获得转换后的书面语语句;Delete the target character in the written language sentence, and insert the target character into the preset character insertion range corresponding to the target character in the deleted written language sentence to obtain the converted written language sentence;

基于转换后的书面语语句确定口语文本。The spoken text is determined based on the transformed written sentences.

具体的,预设字符采样规则,是指预先设置的在书面语语句中采样待乱序的字符的采样规则,该预设字符采样规则,可以是随机采样,也可以是根据预设字符位置进行采样,比如随机采样在书面语语句中位置为第5位置的字符等,在此不做限制。Specifically, the preset character sampling rule refers to a preset sampling rule for sampling characters to be out of order in written sentences. The preset character sampling rule may be random sampling or sampling according to preset character positions , such as random sampling of characters whose position is the 5th position in the written language sentence, etc., which is not limited here.

相应地,目标字符对应的预设字符插入范围,可以是目标字符在书面语语句所在位置之前3个字符到目标词语在书面语语句所在位置之后3个字符的字符区间,该字符区间可以简称为[-3,3],此外,该预设字符插入范围还可以是该目标字符所在子句范围内等,在此不做限制。Correspondingly, the preset character insertion range corresponding to the target character may be a character interval from 3 characters before the position of the target character at the position of the written language sentence to 3 characters after the position of the target word at the position of the written language sentence, and the character interval may be abbreviated as [- 3, 3], in addition, the preset character insertion range may also be within the scope of the clause where the target character is located, etc., which is not limited here.

沿用上例,在书面语文本LT中包含的n个书面语语句的基础上,以其中的书面语语句S1为例进行说明,在书面语语句S1中随机采样字符,获得目标字符为“美”,在书面语语句S1中删除该目标字符,删除后的书面语语句S1为“我的故乡是山西,那里很”。并在目标字符对应的预设字符插入范围为“目标字符所属的子句范围”的情况下,将该目标字符“美”插入至书面语语句S1的预设字符插入范围内,获得转换后的书面语语句S13为:“我的故乡是山西,那里美很”。再将n个转换后的书面语语句进行组合,生成口语文本ST1。Following the above example, on the basis of the n written language sentences contained in the written language text LT, the written language sentence S1 is taken as an example for illustration, and the characters are randomly sampled in the written language sentence S1 to obtain the target character "美". The target character is deleted in S1, and the deleted written language sentence S1 is "My hometown is Shanxi, where it is very". And when the preset character insertion range corresponding to the target character is "the scope of the clause to which the target character belongs", insert the target character "美" into the preset character insertion range of the written language sentence S1 to obtain the converted written language. The sentence S13 is: "My hometown is Shanxi, where it is very beautiful". Then the n converted written sentences are combined to generate spoken text ST1.

综上,通过对书面语语句进行字符单元的乱序处理,使转换后的书面语语句更加符合口语特点。To sum up, by shuffling the character units of written sentences, the converted written sentences are more in line with the characteristics of spoken language.

在语句组成单元为符号单元的情况下,由于口语表达中可能对于语句的断开或衔接不具有明确划分,或者划分较为随意,因此口语语句中出现的符号可能与书面语语句中出现的符号存在不一致的情况。为了使转换后的书面语更加符合口语特点,可以通过如下两种方式或如下两种方式的组合对待处理书面语文本进行符号单元的转换处理,包括:In the case where the unit of the sentence is a symbol unit, since there may not be a clear division of the disconnection or cohesion of the sentence in the spoken language, or the division may be arbitrary, the symbols appearing in the spoken language sentence may be inconsistent with the symbols appearing in the written language sentence. Case. In order to make the converted written language more in line with the characteristics of spoken language, the symbol unit conversion processing of the written language text to be processed can be carried out in the following two ways or a combination of the following two ways, including:

转换方法一:对待处理书面语文本进行语句识别,获得待处理书面语文本中包含的书面语语句;按照预设符号采样规则对书面语语句进行符号采样,获得书面语语句中的目标标点符号,在书面语语句中删除目标标点符号,获得转换后的书面语语句;基于转换后的书面语语句确定口语文本。Conversion method 1: Perform sentence recognition on the written language text to be processed, and obtain the written language sentences contained in the written language text to be processed; perform symbol sampling on the written language sentences according to the preset symbol sampling rules, obtain the target punctuation marks in the written language sentences, and delete them in the written language sentences The target punctuation marks, the converted written sentences are obtained; the spoken text is determined based on the converted written sentences.

具体的,预设符号采样规则,是指预先设置的在书面语语句中采样待删除的符号的采样规则。预设符号采样规则,可以是随机采样,也可以是根据预设位置进行采样,比如采样在书面语语句中第一子句后的标点符号,在此不做限制。相应地,目标标点符号,是指按照预设符号采样规则从书面语语句中采样的标点符号。Specifically, the preset symbol sampling rule refers to a preset sampling rule for sampling symbols to be deleted in written language sentences. The preset symbol sampling rule may be random sampling or sampling according to a preset position, such as sampling the punctuation marks after the first clause in a written language sentence, which is not limited here. Correspondingly, the target punctuation marks refer to punctuation marks sampled from written language sentences according to the preset symbol sampling rule.

沿用上例,在书面语文本LT中包含的n个书面语语句的基础上,以其中的书面语语句S1为例进行说明,随机对书面语语句S1进行符号采样,获得书面语语句S1中的目标标点符号为子句“我的故乡是山西”子句之后的逗号。在书面语语句S1中删除该目标标点符号,获得转换后的书面语语句S13为:“我的故乡是山西那里很美”。再将n个转换后的书面语语句进行组合,生成口语文本ST1。Following the above example, on the basis of the n written language sentences contained in the written language text LT, taking the written language sentence S1 as an example to illustrate, randomly sample the symbols of the written language sentence S1, and obtain the target punctuation mark in the written language sentence S1 as sub A comma after the "My hometown is Shanxi" clause. The target punctuation mark is deleted in the written language sentence S1, and the converted written language sentence S13 is obtained as: "My hometown is Shanxi where it is beautiful". Then the n converted written sentences are combined to generate spoken text ST1.

转换方法二:对待处理书面语文本进行语句识别,获得待处理书面语文本中包含的书面语语句;按照预设符号子句采样规则对书面语语句进行符号子句采样,获得书面语语句中的目标符号子句,在目标符号子句中插入预设标点符号,获得转换后的书面语语句;基于转换后的书面语语句确定口语文本。Conversion method 2: Perform sentence recognition on the written language text to be processed, and obtain the written language sentences contained in the written language text to be processed; perform symbol clause sampling on the written language sentence according to the preset symbol clause sampling rule, and obtain the target symbol clause in the written language sentence, Insert a preset punctuation mark in the target symbol clause to obtain the converted written language sentence; determine the spoken text based on the converted written language sentence.

具体的,预设符号子句采样规则,是指预先设置的在书面语语句中采样添加符号的子句的采样规则。预设符号子句采样规则,可以是随机采样,也可以是根据子句的字符数量进行采样,比如采样在书面语语句中采样字符数量最多的子句,在此不做限制。相应地,预设标点符号,是指预先设置的用以进行插入的标点符号,实际应用中,可以在目标符号子句中随机插入该预设标点符号,也可以按照预设的位置插入该预设标点符号,在此不做限制;目标符号子句,是指按照预设符号子句采样规则在书面语语句中采样的子句。Specifically, the preset symbol clause sampling rule refers to a preset sampling rule for sampling clauses with added symbols in written language sentences. The preset symbol clause sampling rule can be random sampling, or sampling according to the number of characters in the clause, such as sampling the clause with the largest number of characters in the written language sentence, which is not limited here. Correspondingly, the preset punctuation mark refers to the preset punctuation mark used for insertion. In practical applications, the preset punctuation mark can be randomly inserted into the target symbol clause, or the preset punctuation mark can be inserted according to a preset position. The punctuation marks are set, which is not limited here; the target symbol clause refers to the clause sampled in the written language sentence according to the preset symbol clause sampling rule.

沿用上例,在书面语文本LT中包含的n个书面语语句的基础上,以其中的书面语语句S1为例进行说明,在预设符号子句采样规则为采样字符最长子句的情况下,对书面语语句S1进行符号子句采样,获得书面语语句S1中的目标符号子句为“我的故乡是山西”。在预设标点符号为“!”的情况下,在书面语语句S1中插入该预设标点符号,获得转换后的书面语语句S13为:“我的故乡是山西,!那里很美”。再将n个转换后的书面语语句进行组合,生成口语文本ST1。Following the above example, on the basis of the n written language sentences contained in the written language text LT, taking the written language sentence S1 as an example to illustrate, in the case where the preset symbol clause sampling rule is the longest clause for sampling characters, Sentence S1 performs symbolic clause sampling, and the target symbolic clause in written language sentence S1 is obtained as "My hometown is Shanxi". When the preset punctuation mark is "!", the preset punctuation mark is inserted into the written language sentence S1, and the converted written language sentence S13 is obtained as: "My hometown is Shanxi, it is beautiful there". Then the n converted written sentences are combined to generate spoken text ST1.

综上,通过对书面语语句进行符号单元的删除符号、添加符号的转换处理,实现了对书面语语句进行口语改写,使转换后的书面语语句更加符合口语特点。To sum up, the written language sentence is rewritten in oral language by deleting symbols and adding symbols in the written language sentence, so that the converted written language sentence is more in line with the characteristics of spoken language.

步骤30808:基于书面语文本和回译书面语文本与口语文本的对应关系,构建样本语料。Step 30808: Construct a sample corpus based on the written text and the correspondence between the back-translated written text and the spoken text.

具体的,在上述获得口语文本的基础上,由于获得的口语文本是对书面语文本或回译书面语文本进行转换处理获得的,因此,口语文本和书面语文本或回译书面语文本之间存在对应关系,基于该对应关系,即可生成书面语-口语文本对齐的样本语料。Specifically, on the basis of obtaining the spoken text above, since the obtained spoken text is obtained by converting written text or back-translated written text, there is a correspondence between the spoken text and the written text or the back-translated written text. Based on this correspondence, a sample corpus of written-spoken text alignment can be generated.

其中,样本语料,是指用以进行模型训练的训练样本对。实际应用中,通过生成书面语文本-口语文本的训练样本对,可以用以训练口语文本到书面语文本的书面语改写模型。在训练书面语改写模型的情况下,将样本语料中的口语文本作为训练样本,并将样本语料中的书面语文本作为口语文本对应的样本标签。Among them, the sample corpus refers to the pair of training samples used for model training. In practical applications, by generating a training sample pair of written text and spoken text, it can be used to train a written language rewriting model from spoken text to written text. In the case of training the written rewriting model, the spoken text in the sample corpus is used as a training sample, and the written text in the sample corpus is used as the sample label corresponding to the spoken text.

实际应用中,由于经过转换处理获得的口语文本中,可能存在一些异常数据,这些异常数据的存在严重影响口语文本的质量,为了保障生成的口语文本的质量,可以对口语文本中的异常数据进行数据清洗,本申请实施例,具体采用如下方式实现:In practical applications, there may be some abnormal data in the spoken text obtained through the conversion process, and the existence of these abnormal data seriously affects the quality of the spoken text. Data cleaning, an embodiment of the present application, is specifically implemented in the following manner:

识别口语文本中的异常信息;Identify abnormal information in spoken text;

根据异常信息对口语文本进行数据清洗,获得清洗后的口语文本;Perform data cleaning on the spoken text according to the abnormal information to obtain the cleaned spoken text;

基于书面语文本和回译书面语文本与清洗后的口语文本的对应关系,构建样本语料。A sample corpus is constructed based on the correspondence between the written text and the back-translated written text and the cleaned spoken text.

其中,异常信息,可以是错别字,重复的标点符号,中文标点混合英文标点符号、特殊的符号,停用词等异常的信息。此外,异常信息还可以是语义模糊,或语义不合理的信息,在此不做限制。实际应用中,可以通过预设异常识别规则识别口语文本中的异常信息,也可以基于预先训练的文本清洗模型识别口语文本中异常信息。具体实施时,文本清洗模型可以是用于语法纠错的深度上下文模型进行语法检测。The abnormal information may be typos, repeated punctuation, Chinese punctuation mixed with English punctuation, special symbols, stop words and other abnormal information. In addition, the abnormal information may also be semantically vague or unreasonable information, which is not limited here. In practical applications, abnormal information in spoken text can be identified by preset anomaly identification rules, or abnormal information in spoken text can be identified based on a pre-trained text cleaning model. During specific implementation, the text cleaning model may be a deep context model for grammar error correction to perform grammar detection.

进一步的,在识别出口语文本中的异常信息之后,在口语文本为多个的情况下,可以直接对存在异常信息的口语文本进行删除,从而获得无异常信息的口语文本(即清洗后的口语文本)。此外也可以对口语文本中的异常信息进行删除或纠正,从而获得清洗后的口语文本,在此不做限制。需要说明的是,若删除了任意一个口语文本,也需要将其对应的书面语文本或回译书面语文本进行删除。Further, after identifying the abnormal information in the spoken text, in the case of multiple oral texts, the spoken text with abnormal information can be deleted directly, so as to obtain the oral text without abnormal information (that is, the cleaned spoken text). text). In addition, the abnormal information in the spoken text can also be deleted or corrected, so as to obtain the cleaned spoken text, which is not limited here. It should be noted that, if any spoken text is deleted, its corresponding written text or back-translated written text also needs to be deleted.

具体实施时,考虑到可能采用的书面语文本中也包含异常信息,可以对书面语文本以及口语文本都进行数据清洗。During specific implementation, considering that the written text that may be used also contains abnormal information, data cleaning can be performed on both the written text and the spoken text.

沿用上例,在将书面语文本LT作为待处理的书面语文本进行转换处理,获得口语文本ST1,将回译书面语文本LT3作为待处理文本进行转换处理在获得口语文本ST2的基础上,通过异常识别规则,识别出口语文本ST1中的异常信息为“,!”,而口语文本ST2无异常,则根据该异常信息对口语文本ST1进行数据清洗,获得清洗后的口语文本ST1,并将口语文本ST2直接作为清洗后的口语文本ST2。基于书面语文本LT和清洗后的口语文本ST1之间的对应关系,将书面语文本LT和清洗后的口语文本ST1构建样本语料对1。并基于回译书面语文本LT3与清洗后的口语文本ST2之间的对应关系,将书面语文本LT3与清洗后的口语文本ST2构建样本语料对2,将样本语料对1以及样本语料对2作为样本语料。Following the above example, after converting the written text LT as the written text to be processed, the spoken text ST1 is obtained, and the back-translated written text LT3 is used as the text to be processed for conversion processing. On the basis of obtaining the spoken text ST2, the exception identification rules , recognize that the abnormal information in the spoken text ST1 is ",!", but the spoken text ST2 has no abnormality, then perform data cleaning on the spoken text ST1 according to the abnormal information, obtain the cleaned spoken text ST1, and directly use the spoken text ST2. As the cleaned spoken text ST2. Based on the correspondence between the written text LT and the cleaned spoken text ST1, a sample corpus pair 1 is constructed by combining the written text LT and the cleaned spoken text ST1. And based on the correspondence between the back-translated written text LT3 and the cleaned spoken text ST2, a sample corpus pair 2 is constructed from the written text LT3 and the cleaned spoken text ST2, and the sample corpus pair 1 and the sample corpus pair 2 are used as sample corpus. .

综上,通过对转换处理后的口语文本进行数据清洗,在通过清洗后的口语文本构建样本语料,保障了样本语料的质量,也进一步增加了模型训练的准确性。To sum up, by performing data cleaning on the converted spoken text, and constructing a sample corpus from the cleaned spoken text, the quality of the sample corpus is guaranteed, and the accuracy of model training is further increased.

参见图4,图4示出了本申请一实施例提供的文本处理方法中构建样本语料的示意图。在获取书面语文本后,为了能够进一步扩充书面语文本,可以将该书面语文本进行回译处理,并在回译过程中,通过对书面语文本中的语句进行词法语法分析,并根据分析结果对语句中关键实体词回对(替换),保障获得的回译语料和书面语文本中的关键信息保持一致。再将回译语料以及书面语文本共同作为数据源输入口语化数据生成模块进行口语转换。该口语化数据生成模块中包括对数据源中的书面语语句进行子句级、词级、字符级以及符号级的转换处理。Referring to FIG. 4 , FIG. 4 shows a schematic diagram of constructing a sample corpus in a text processing method provided by an embodiment of the present application. After the written text is obtained, in order to further expand the written text, the written text can be back-translated. Entity words are back-paired (replaced) to ensure that the obtained back-translation corpus is consistent with the key information in the written text. Then, the back-translation corpus and the written text are jointly used as data sources and input into the oral data generation module for oral conversion. The colloquialized data generation module includes converting the written language sentences in the data source at clause level, word level, character level and symbol level.

其中,子句级的转换处理包括对书面语语句在子句级别进行子句重复、子句生成、子句乱序等转换处理;词级的转换处理包括对书面语语句在词语级别进行添加词语、词语重复以及词语乱序等转换处理;字符级的转换处理包括对书面语语句在字符级别进行字符乱序处理等转换处理,符号级的转换处理包括对书面语语句在符号级别进行符号删除、符号插入等转换处理。Among them, the conversion processing at the clause level includes the conversion processing of clause repetition, clause generation, and clause disorder at the clause level for the written language sentence; the conversion processing at the word level includes adding words and words to the written language sentence at the word level. Conversion processing such as repetition and word disorder; character-level conversion processing includes conversion processing such as character-level disorder processing for written sentences at the character level, and symbol-level conversion processing includes conversion processing such as symbol deletion and symbol insertion for written language sentences at the symbol level deal with.

在将数据源通过口语化数据生成模块进行口语转换后,可以获得初始口语文本,对该初始口语文本进行数据清洗,去除其中的异常信息(即包含错误信息、错误数据或错误标点的信息),即可输出数据源对应的口语文本。After the oral language conversion of the data source is carried out through the oral data generation module, the initial oral text can be obtained, and the data of the initial oral text can be cleaned to remove abnormal information (that is, information containing wrong information, wrong data or wrong punctuation), The spoken text corresponding to the data source can be output.

本申请实施例通过研究分析总结口语化文本的文本结构及句法语法结构特点,基于标准的书面语文本进行回译处理,扩充书面语文本之后,再对扩充的书面语文本进行转换处理,生成相应书面语文本的口语化表述,以此扩充书面语改写模型的样本语料,提高了获得书面语改写的样本语料的效率以及丰富性。The embodiment of the present application summarizes the text structure and syntactic-grammatical structure characteristics of the spoken text through research and analysis, performs back-translation processing based on the standard written text, and then performs conversion processing on the expanded written text after expanding the written text to generate a corresponding written text. Spoken expressions are used to expand the sample corpus of the written language rewriting model, which improves the efficiency and richness of obtaining the sample corpus of written language rewriting.

步骤30810:通过样本语料对初始书面语改写模型进行训练,直至获得满足第二训练停止条件的书面语改写模型。Step 30810: Train the initial written language rewriting model through the sample corpus, until a written language rewriting model that satisfies the second training stop condition is obtained.

具体的,在上述构建样本语料的基础上,即可通过样本语料对初始书面语改写模型进行训练,训练完成后即可生成可以用以书面语改写的书面语改写模型。Specifically, on the basis of the above construction of the sample corpus, the initial written language rewriting model can be trained through the sample corpus, and after the training is completed, a written language rewriting model that can be rewritten in written language can be generated.

其中,初始书面语改写模型,可以是基于Seq2Seq模型构建的待训练的书面语改写模型,其中该Seq2Seq模型中的编码器以及解码器均可以采用Transformer模型构建。相应地,第二训练停止条件,是指停止基于样本语料对初始书面语改写模型进行模型训练的条件。该第二训练停止条件,可以是对样本语料中口语文本通过模型进行书面语改写生成的预测书面语文本与样本书面语文本之间的损失值小于预设损失值,还可以是训练迭代次数达到预设迭代次数,比如5次、或6次等,在此不做限制。相应地,书面语改写模型可以理解为训练完成的对口语文本进行书面语改写的模型。The initial written language rewriting model may be a written language rewriting model to be trained constructed based on the Seq2Seq model, wherein both the encoder and the decoder in the Seq2Seq model may be constructed using the Transformer model. Correspondingly, the second training stop condition refers to a condition for stopping the model training of the initial written language rewriting model based on the sample corpus. The second training stop condition may be that the loss value between the predicted written text and the sample written text generated by rewriting the spoken text in the sample corpus through the model is smaller than the preset loss value, or the number of training iterations reaches the preset iteration The number of times, such as 5 times or 6 times, is not limited here. Correspondingly, the written language rewriting model can be understood as a model that has been trained to perform written language rewriting on spoken text.

具体实施时,训练过程中通过初始书面语改写模型对输入的样本口语文本进行书面语改写输出预测书面语文本,并计算预测书面语文本和样本书面语文本的损失值,实际应用中,计算模型损失值的损失函数在实际应用中可以为0-1损失函数、绝对值损失函数、平方损失函数、交叉熵损失函数等,在此,以绝对值损失函数为例进行解释说明,参见下述公式2:During the specific implementation, the initial written language rewriting model is used to rewrite the input sample spoken text in written language and output the predicted written language text, and calculate the loss value of the predicted written language text and the sample written language text. In practical applications, the loss function of the loss value of the model is calculated. In practical applications, it can be a 0-1 loss function, an absolute value loss function, a squared loss function, a cross entropy loss function, etc. Here, the absolute value loss function is used as an example for explanation, see the following formula 2:

Figure 908452DEST_PATH_IMAGE002
公式2
Figure 908452DEST_PATH_IMAGE002
Formula 2

其中,L代表损失值,f(X)表示预测书面语文本,Y表示样本书面语文本,在本申请中,对损失函数的选择不做限定,以实际应用为准。Among them, L represents the loss value, f(X) represents the predicted written text, and Y represents the sample written text. In this application, the selection of the loss function is not limited, and the actual application shall prevail.

在计算模型损失值之后,即可根据模型损失值反向调整初始文本分类模型的模型参数,并采样下一批次样本语料继续训练初始文本分类模型,直至到达训练停止条件,即可获得训练完成的书面语改写模型。After calculating the model loss value, you can reversely adjust the model parameters of the initial text classification model according to the model loss value, and sample the next batch of sample corpus to continue training the initial text classification model until the training stop condition is reached, and the training is complete. written language rewriting model.

具体实施时,由于该书面语改写模型采用丰富的样本语料进行模型训练,该书面语改写模型,可以用于处理较为复杂的语句改写,因此,可以将该书面语改写模型用以对标准文本类型的目标口语文本进行相对复杂的改写处理。During specific implementation, since the written language rewriting model uses abundant sample corpus for model training, the written language rewriting model can be used to deal with more complex sentence rewriting. Therefore, the written language rewriting model can be used for standard text type target spoken language The text undergoes relatively complex rewriting processing.

此外,在获得目标口语文本对应的文本类型的基础上,还存在文本类型为模糊文本类型的情况。在这种情况下,为了保障书面语改写的合理性以及准确性,本申请实施例,具体通过如下方式实现:In addition, on the basis of obtaining the text type corresponding to the target spoken text, there is also a case where the text type is a fuzzy text type. In this case, in order to ensure the rationality and accuracy of the written language rewriting, the embodiments of the present application are specifically implemented in the following ways:

在文本类型为模糊文本类型的情况下,根据模糊文本类型选择对应的书面语转换模型;When the text type is fuzzy text type, select the corresponding written language conversion model according to the fuzzy text type;

将目标口语文本输入书面语转换模型进行处理,获得目标口语文本对应的转换书面语文本;Input the target spoken text into the written language conversion model for processing, and obtain the converted written text corresponding to the target spoken text;

其中,书面语转换模型,基于书面语文本以及对书面语文本进行转换处理获得的基础口语文本训练得到。The written language conversion model is obtained by training based on the written language text and the basic spoken language text obtained by converting the written language text.

由于目标口语文本为模糊文本类型,表明目标口语文本的文本语义表述较为模糊。而针对模糊文本类型的目标口语文本,需要输入对口语文本进行较小改写的书面语转换模型进行书面语改写,获得目标口语文本对应的书面语文本(转换书面语文本)。这是由于这种类型的目标口语文本,其语义表述本身比较模糊,如果再对其进行复杂的改写,可能会导致其语义更加模糊,或者容易出现偏差。因此,针对模糊文本类型的目标口语文本,可以采用书面语转换模型对其进行简单的口语词汇、语气词汇等改写即可。Since the target spoken text is of fuzzy text type, it indicates that the textual semantic expression of the target spoken text is relatively vague. For the target spoken text of the fuzzy text type, it is necessary to input a written language conversion model that slightly rewrites the spoken text to rewrite the written language, so as to obtain the written language text corresponding to the target spoken language text (convert written language text). This is because the semantic expression of this type of target spoken text itself is relatively vague, and if complex rewriting is performed on it, its semantics may be more ambiguous or prone to deviation. Therefore, for the target spoken text of the fuzzy text type, the written language conversion model can be used to rewrite it with simple spoken words, mood words, etc.

假设文本分类模型输出的目标口语文本TST对应的文本类型为模糊文本类型,则将该目标口语文本TST输入书面语转换模型,获得该书面语转换模型输出的目标书面语文本TLT2。Assuming that the text type corresponding to the target spoken text TST output by the text classification model is a fuzzy text type, input the target spoken text TST into the written language conversion model to obtain the target written language text TLT2 output by the written language conversion model.

综上,通过将模糊文本类型的目标口语文本通过书面语转换模型进行轻微改写,实现了对不同类型口语文本的合理性改写,保障了书面语改写的质量。To sum up, by slightly rewriting the target spoken text of the fuzzy text type through the written language conversion model, the rational rewriting of different types of oral text is realized, and the quality of written language rewriting is guaranteed.

具体实施时,书面语转换模型的训练,具体通过如下步骤实现:During specific implementation, the training of the written language conversion model is specifically implemented through the following steps:

获取书面语文本;obtain written text;

对书面语文本进行语句组成单元的转换处理,获得基础口语文本;Convert written text to sentence composition units to obtain basic spoken text;

基于书面语文本与基础口语文本的对应关系,构建基础样本语料;Construct basic sample corpus based on the correspondence between written text and basic spoken text;

通过基础样本语料对初始书面语转换模型进行训练,直至获得满足第一训练停止条件的书面语转换模型。The initial written language conversion model is trained through the basic sample corpus until a written language conversion model that satisfies the first training stop condition is obtained.

基础口语文本,是指对获取的书面语文本进行转换处理,生成的口语文本。实际应用中,获取书面语文本,以及对书面语文本进行语句组成单元的转换处理,获得基础口语文本的具体实现,与上述获取书面语文本,对书面语文本和回译书面语文本分别进行语句组成单元的转换处理,获得口语文本的具体实现类似,参考上述具体实现即可,在此不再赘述。Basic spoken text refers to the spoken text generated by converting the acquired written text. In practical applications, the specific implementation of acquiring written text, and performing sentence composition unit conversion processing on the written text, to obtain the basic spoken text, is the same as the above-mentioned acquisition of written language text, and the written language text and the back-translated written language text are respectively subjected to sentence composition unit conversion processing. , the specific implementation of obtaining the spoken text is similar, and the above-mentioned specific implementation may be referred to, which will not be repeated here.

相应地,基础样本语料,是指将基于口语文本作为训练样本,将该口语文本对应的书面语文本作为样本标签所构建的样本语料。第一训练停止条件,是指停止基于基础样本语料对初始书面语改写模型进行模型训练的条件。类似地,第一训练停止条件,可以是对样本语料中口语文本通过模型进行书面语改写生成的预测书面语文本与样本书面语文本之间的损失值小于预设损失值,还可以是训练迭代次数达到预设迭代次数,比如5次、或6次等,在此不做限制。相应地,书面语转换模型可以理解为基于基础样本语料训练完成的对口语文本进行书面语改写的模型。Correspondingly, the basic sample corpus refers to a sample corpus constructed by using the spoken text as a training sample and the written text corresponding to the spoken text as a sample label. The first training stop condition refers to a condition for stopping the model training of the initial written language rewriting model based on the basic sample corpus. Similarly, the first training stop condition may be that the loss value between the predicted written language text generated by rewriting the spoken language text in the sample corpus through the model and the sample written language text is less than the preset loss value, or it may be that the number of training iterations reaches the preset loss value. Set the number of iterations, such as 5 times, or 6 times, etc., and there is no limit here. Correspondingly, the written language conversion model can be understood as a written language rewriting model based on the basic sample corpus training for the spoken text.

具体实施时,通过基础样本语料对初始书面语转换模型进行训练,直至获得满足第一训练停止条件的书面语转换模型的具体实现方式,与上述通过样本语料对初始书面语转换模型进行训练,直至获得满足第二训练停止条件的书面语转换模型的具体实现方式类似,参考上述实现方式即可,在此不做赘述。In the specific implementation, the initial written language conversion model is trained by using the basic sample corpus until the specific implementation of the written language conversion model that satisfies the first training stop condition is obtained. The specific implementation of the written language conversion model for the second training stop condition is similar, and the above-mentioned implementation can be referred to, which will not be repeated here.

需要说明的是,该基础样本语料中的书面语文本未通过回译处理进行扩充,因此,该基础样本语料相对于上述构建的样本语料较为精简。也因此,通过该基础样本语料进行模型训练所获得的书面语转换模型所进行的书面语改写也相对上述书面语改写模型更为简单。It should be noted that the written language text in the basic sample corpus is not expanded by back-translation processing, and therefore, the basic sample corpus is relatively compact compared to the above constructed sample corpus. Therefore, the written language rewriting performed by the written language conversion model obtained by performing model training on the basic sample corpus is also simpler than the above-mentioned written language rewriting model.

综上,通过基础样本语料训练书面语转换模型,使书面语转换模型可以实现对模糊文本类型的目标口语文本进行轻微改写,使书面语改写更具有合理性。To sum up, the written language conversion model is trained by the basic sample corpus, so that the written language conversion model can slightly rewrite the target spoken text of ambiguous text type, making the written language rewriting more reasonable.

此外,还存在文本类型为无效文本类型的可能性,文本类型为无效文本类型的情况下,删除目标口语文本。由于无效文本类型的目标口语文本,表明该目标口语文本是不包含语义信息的口语文本。对这类型的口语文本进行书面语改写,获得的改写结果也是没有语义信息的。因此,针对无效文本类型的目标口语文本可以直接进行删除,即无需对其进行书面语改写。In addition, there is a possibility that the text type is an invalid text type, and if the text type is an invalid text type, the target spoken text is deleted. Due to the target spoken text of an invalid text type, it indicates that the target spoken text is a spoken text that does not contain semantic information. Written rewriting of this type of spoken text will also result in no semantic information. Therefore, the target spoken text for invalid text types can be deleted directly, that is, without written rewriting.

综上,对于无效文本类型的目标口语文本,直接进行删除,避免了浪费计算资源处理无效的口语文本。从而节约了计算成本。To sum up, the target spoken text of an invalid text type is directly deleted, which avoids wasting computing resources to process the invalid spoken text. Thereby, the computational cost is saved.

本申请实施例提供的文本处理方法,通过获取目标口语文本;将目标口语文本进行分类处理,获得目标口语文本对应的文本类型;再在文本类型为标准文本类型的情况下,根据标准文本类型选择对应的书面语改写模型,实现了根据目标口语文本的文本类型选择适合目标口语文本的书面语改写模型;再将目标口语文本输入书面语改写模型进行处理,获得目标口语文本对应的目标书面语文本,使书面语改写更加具有针对性,并提高了书面语改写的准确性。其中,书面语改写模型,基于书面语文本以及对书面语文本进行回译和转换处理获得的口语文本训练得到,实现了基于回译以及转换处理对书面语文本进行预处理,从而为模型训练提供大量口语文本-书面语文本的样本语料,简化了模型的训练难度,也避免了人工耗时费力收集并处理大量的文本数据,节约了时间成本以及人力成本。In the text processing method provided by the embodiment of the present application, the target spoken text is obtained by obtaining the target spoken text; the target spoken text is classified and processed to obtain the text type corresponding to the target spoken text; and when the text type is a standard text type, selection is made according to the standard text type The corresponding written language rewriting model realizes the selection of a written language rewriting model suitable for the target spoken language text according to the text type of the target spoken language text; then the target spoken language text is input into the written language rewriting model for processing, and the target written language text corresponding to the target spoken language text is obtained, so that the written language can be rewritten. It is more targeted and the accuracy of written language rewriting has been improved. Among them, the written language rewriting model is based on the written language text and the spoken language text obtained by back-translation and conversion processing of the written language text, and realizes the preprocessing of the written language text based on the back-translation and conversion processing, thus providing a large amount of spoken language text for model training- The sample corpus of written text simplifies the training difficulty of the model, avoids the time-consuming and laborious collection and processing of a large amount of text data, and saves time and labor costs.

下述结合附图5,以本申请提供的文本处理方法在实际场景中的应用为例,对文本处理方法进行进一步说明。其中,图5示出了本申请一实施例提供的一种应用于实际场景的文本处理方法的处理流程图,具体包括以下步骤:The text processing method is further described below by taking the application of the text processing method provided by the present application in an actual scene as an example with reference to FIG. 5 . 5 shows a processing flow chart of a text processing method applied to an actual scene provided by an embodiment of the present application, which specifically includes the following steps:

步骤502:获取书面语文本。Step 502: Acquire written language text.

具体的,该书面语文本可以是任意领域的书面语文本,比如医学领域的书面语文本,化学领域的书面语文本、销售领域的书面语文本、日常生活领域的书面语文本,旅游领域的书面语文本等,在此不做限制。并且该书面语文本的文本数量可以是一个也可以是多个,在此不做限制。Specifically, the written text can be written text in any field, such as written text in the field of medicine, written text in the field of chemistry, written text in the field of sales, written text in the field of daily life, written text in the field of tourism, etc. make restrictions. And the number of the written text can be one or more, which is not limited here.

以销售领域为例,获取销售领域的书面语文本T。Taking the sales field as an example, obtain the written text T of the sales field.

步骤504:通过对书面语文本进行词性分析,识别书面语文本中词性为预设词性的关键词语。Step 504 : by performing part-of-speech analysis on the written text, identify key words whose part of speech is a preset part of speech in the written text.

对书面语文本T中包含的每个词语进行词性分析,获得该书面语文本T中每个词语的词性。在预设词性为名词词性的情况下,将该书面语文本T中名词词性的词语识别为关键词语。Perform part-of-speech analysis on each word contained in the written text T to obtain the part-of-speech of each word in the written text T. When the preset part of speech is a noun part of speech, the word of the noun part of speech in the written language text T is identified as a key word.

步骤506:在书面语文本中对关键词语所处的位置进行位置标记。Step 506: Mark the positions of the key words in the written text.

基于此,假设书面语文本T中识别出的关键词语为“计算机”以及“速度”,这些关键词语在书面语文本T中所属的书面语语句SS为:“我使用计算机,速度很快,而且是非常便捷的”,在书面语文本T通过星号“*”进行位置标记,标记完成后获得的标记后的书面语文本T。该标记后的书面语文本T中的书面语语句SS变更为:“我使用*计算机*,*速度*很快,而且是非常便捷的”。Based on this, it is assumed that the key words identified in the written language text T are "computer" and "speed", and the written language sentence SS to which these key words belong in the written language text T is: "I use a computer, which is very fast and very convenient. ", the written text T is marked with an asterisk "*", and the marked written text T is obtained after the marking is completed. The written language sentence SS in the marked written language text T is changed to: "I use *computer*, *speed* is very fast, and it is very convenient".

步骤508:将标记后的书面语文本翻译为预设语种对应的译文书面语文本。Step 508: Translate the marked written text into the translated written text corresponding to the preset language.

具体的,预设语种可以是英语、法语、韩语等任意一种或多种语种,在此不做限制。Specifically, the preset language may be any one or more languages such as English, French, and Korean, which are not limited herein.

基于此,在预设语种为英语的情况下,将标记后的书面语文本T翻译为英语,获得标记后的书面语文本T对应的英语译文书面语文本T1。Based on this, when the preset language is English, the marked written text T is translated into English, and an English translated written text T1 corresponding to the marked written text T is obtained.

步骤510:将译文书面语文本翻译为书面语文本所属的目标语种,获得初始回译书面语文本。Step 510: Translate the written text of the translation into the target language to which the written text belongs to obtain the initial back-translated written text.

具体的,由于书面语文本T中的文本内容所属的目标语种是汉语,因此将英语译文书面语文本T1翻译为汉语,获得英语译文书面语文本T1对应的初始回译文书面语文本T2,其中,初始回译书面语文本T2中与书面语语句SS对应的书面语语句SS2更新为:“我采用*电脑*,*效率*很快,而且非常方便”。Specifically, since the target language to which the text content in the written text T belongs is Chinese, the English translated written text T1 is translated into Chinese, and the initial back-translated written text T2 corresponding to the English translated written text T1 is obtained, wherein the initial back-translated written text T2 is obtained. The written language sentence SS2 corresponding to the written language sentence SS in the text T2 is updated to: "I use *computer*, *efficiency* is very fast and very convenient".

步骤512:通过关键词语对初始回译书面语文本中位置标记对应的目标关键词语进行替换,获得回译书面语文本。Step 512: Replacing the target key words corresponding to the position marks in the initial back-translated written text with the key words to obtain the back-translated written text.

具体的,位置标记对应的目标关键词语,是指位置标记在初始回译书面语文本中所标记的词语,该目标关键词语也与关键词语相对应。实际应用中,结合对书面语文本进行词性分析,在回译过程中对书面语语句中特定词性的词语进行位置标记并替换,以此尽可能保证回译书面语文本与书面语文本中的关键信息不变。Specifically, the target keyword corresponding to the position marker refers to the word marked by the position marker in the initial back-translated written text, and the target keyword also corresponds to the keyword. In practical applications, combined with part-of-speech analysis of written texts, during the back-translation process, words with specific parts of speech in written sentences are marked and replaced, so as to ensure that the key information in the back-translated written text and the written text remains unchanged as much as possible.

基于此,初始回译书面语文本T2中的标记位置对应的目标关键词语为“电脑”以及“效率”;通过关键词词语“计算机”对初始回译书面语文本T2中的“电脑”进行替换,并通过关键词词语“速度”对初始回译书面语文本T2中的“效率”进行替换,获得回译书面语文本T3,其中,回译书面语文本T3中与书面语语句SS对应的书面语语句SS3更新为:“我采用计算机进行运算,速度很快,而且非常方便”。Based on this, the target keywords corresponding to the marked positions in the initial back-translated written text T2 are "computer" and "efficiency"; the "computer" in the initial back-translated written text T2 is replaced by the keyword "computer", and The "efficiency" in the initial back-translated written text T2 is replaced by the keyword word "speed" to obtain a back-translated written text T3, wherein the written text SS3 corresponding to the written sentence SS in the back-translated written text T3 is updated to: " I use a computer to do the calculations, which is very fast and very convenient.”

步骤514:将书面语文本和回译书面语文本中每个书面语文本作为待处理书面文本,对每个待处理书面语文本进行语句识别,获得每个待处理书面语文本中包含的书面语语句。Step 514: Take each written text in the written text and the back-translated written text as the written text to be processed, perform sentence recognition on each written text to be processed, and obtain written sentences contained in each written text to be processed.

具体的,将书面语文本T和回译书面语文本T3中每个书面语文本都作为待处理书面文本,对每个待处理文本依次进行语句识别,获得每个待处理文本中包含的书面语语句。进一步的,对每个待处理文本中的每个书面语句执行下述步骤516至步骤522。Specifically, each written language text in the written language text T and the back-translated written language text T3 is regarded as the written text to be processed, and sentence recognition is performed on each of the to-be-processed texts to obtain written language sentences contained in each of the to-be-processed texts. Further, the following steps 516 to 522 are performed for each written sentence in each text to be processed.

基于此,假设将书面语文本T作为待处理文本T,对待处理文本T进行语句识别,获得待处理文本T中包含的n个书面语语句,这n个书面语语句分别为书面语语句1、书面语语句2、……、书面语语句n,对这n个书面语语句分别执行下述步骤516至步骤522。Based on this, it is assumed that the written language text T is taken as the text to be processed, and the text to be processed T is sentence-recognized to obtain n written language sentences contained in the to-be-processed text T, and the n written language sentences are written language sentence 1, written language sentence 2, ..., written language sentence n, the following steps 516 to 522 are respectively executed for the n written language sentences.

步骤516:对书面语语句进行子句单元的转换处理,获得转换后的第A4书面语语句。Step 516 : Convert the written language sentence to the clause unit to obtain the converted A4-th written language sentence.

具体的,对任意一个书面语语句进行子句单元的转换处理,具体通过如下执行如下步骤516-1至步骤516-18进行实现:Specifically, the conversion processing of the clause unit is performed on any written language sentence, which is specifically implemented by executing the following steps 516-1 to 516-18 as follows:

步骤516-1:确定书面语语句的各个子句转换处理策略中复制子句转换处理策略对应的复制子句转换处理概率。Step 516-1: Determine the duplicate clause conversion processing probability corresponding to the duplicate clause conversion processing policy in each clause conversion processing policy of the written language sentence.

其中,复制子句转换处理策略,是指对书面语语句中的子句进行复制的处理策略,相应地,复制子句转换处理概率,是指预先设置的执行复制子句转换处理策略的概率。该复制子句转换处理概率,可以根据实际经验或口语表达习惯进行预先设置,比如复制子句转换处理概率可以为10%、20%、30%等,在此不做限制。在复制子句转换处理概率为10%的情况下,表明对书面语语句有10%的概率执行复制子句转换处理策略。Wherein, the copy clause conversion processing strategy refers to a processing strategy for copying clauses in a written language sentence, and correspondingly, the copy clause conversion processing probability refers to a preset probability of executing the copy clause conversion processing strategy. The conversion processing probability of the copy clause can be preset according to actual experience or oral expression habits. For example, the conversion processing probability of the copy clause can be 10%, 20%, 30%, etc., which is not limited here. In the case where the probability of copy clause conversion processing is 10%, it indicates that there is a 10% probability to execute the copy clause conversion processing strategy for written language sentences.

基于此,假设书面语语句1为上述书面语语句SS“我使用计算机,速度很快,而且是非常便捷的”,而复制子句转换处理策略对应的复制子句转换处理概率为10%,则针对书面语语句1执行复制子句转换处理策略的复制子句转换处理概率为10%。Based on this, assuming that written language sentence 1 is the above written language sentence SS "I use a computer, it is very fast and very convenient", and the copy clause conversion processing probability corresponding to the copy clause conversion processing strategy is 10%. Statement 1 executes the replication clause transformation processing probability of the replication clause transformation processing strategy is 10%.

步骤516-2:基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略。Step 516-2: Based on the copy clause conversion processing probability, determine whether to execute the copy clause conversion processing strategy for the written language sentence.

具体的,若基于复制子句转换处理概率确定执行复制子句转换处理策略,则执行下述步骤516-3;若确定不执行复制子句换行处理策略,则直接将书面语语句作为第A1书面语语句,执行下述步骤516-5。Specifically, if it is determined to execute the duplicated clause conversion processing strategy based on the duplicated clause conversion processing probability, the following step 516-3 is executed; if it is determined not to execute the duplicated clause newline processing strategy, the written language sentence is directly used as the A1 written language sentence , and execute the following step 516-5.

步骤516-3:在确定执行复制子句转换处理策略的情况下,按照第一预设采样规则对书面语语句进行子句采样,获得书面语语句中的第一目标子句。Step 516-3: In the case of determining to execute the duplicate clause conversion processing strategy, perform clause sampling on the written language sentence according to the first preset sampling rule to obtain the first target clause in the written language sentence.

具体的,第一预设采样规则,是指预先设置的在书面语语句中采样待复制的子句的采样规则。该第一预设采样规则,可以是随机采样,也可以是根据位置进行采样,比如采样位置在书面语语句中排在第一位置的子句,此外,还可以根据字符数量进行采样,比如采样子句中字符数量小于5的子句等。该第一预设采样规则,可以与上述方法实施例中的预设子句采样规则相同,也可以理解为上述方法实施例中的预设子句采样规则中的一种。相应地,第一目标子句,是指按照第一预设采样规则在书面语语句中采样的子句,也可以理解为上述方法实施例中的目标子句。Specifically, the first preset sampling rule refers to a preset sampling rule for sampling clauses to be copied in written language sentences. The first preset sampling rule may be random sampling, or sampling according to position, such as a clause whose sampling position is in the first position in a written language sentence, in addition, sampling may be carried out according to the number of characters, such as sampling sub-clauses Clauses with less than 5 characters in the sentence, etc. The first preset sampling rule may be the same as the preset clause sampling rule in the foregoing method embodiments, and may also be understood as one of the preset clause sampling rules in the foregoing method embodiments. Correspondingly, the first target clause refers to a clause sampled in a written language sentence according to the first preset sampling rule, and can also be understood as the target clause in the above method embodiments.

基于此,在确定执行复制子句转换处理策略的情况下,随机对书面语语句1进行子句采样,获得书面语语句1中的第一目标子句为“速度很快”。Based on this, when it is determined to execute the copy clause conversion processing strategy, the written language sentence 1 is randomly sampled, and the first target clause in the written language sentence 1 is obtained as "fast".

步骤516-4:对第一目标子句进行复制获得复制目标子句,并将复制目标子句按照预设子句插入位置插入至书面语语句中,获得转换后的第A1书面语语句。Step 516-4: Duplicate the first target clause to obtain the duplicate target clause, and insert the duplicate target clause into the written language sentence according to the preset clause insertion position to obtain the converted A1-th written language sentence.

具体的,预设子句插入位置,是指预先设置的将目标第一子句插入书面语语句中的位置,该位置可以根据实际口语特点进行设置,比如该预设子句插入位置可以是书面语语句的句首或句尾,也可以是该书面语语句中第一目标子句所处位置之前或之后,在此不做限制。Specifically, the preset clause insertion position refers to a preset position where the target first clause is inserted into a written sentence, and the position can be set according to the characteristics of actual spoken language. For example, the preset clause insertion position can be a written sentence. The beginning or end of the sentence can also be before or after the position of the first target clause in the written sentence, which is not limited here.

基于此,对该第一目标子句“速度很快”进行复制,获得的复制目标子句也为“速度很快”,在预设子句插入位置为第一目标子句所处位置之前的情况下,将复制目标子句插入至书面语语句1中第一目标子句所处位置之前,获得转换后的第A1书面语语句为:“我使用计算机,速度很快,速度很快,而且是非常便捷的”。Based on this, the first target clause is "very fast", and the obtained replication target clause is also "very fast", and the insertion position of the preset clause is before the position of the first target clause. In this case, the copy target clause is inserted before the position of the first target clause in written language sentence 1, and the transformed written language sentence A1 is obtained as: "I use a computer, which is very fast, very fast, and very fast. convenient".

步骤516-5:确定第A1书面语语句的各个子句转换处理策略中添加子句转换处理策略对应的添加子句转换处理概率。Step 516-5: Determine the added clause conversion processing probability corresponding to the added clause conversion processing policy in each clause conversion processing policy of the A1-th written language sentence.

其中,添加子句转换处理策略,是指对书面语语句进行子句添加的处理策略。相应地,添加子句转换处理概率,是指预先设置的执行添加子句转换处理策略相关处理的概率,该添加子句转换处理概率,也可以根据实际经验或口语表达习惯进行预先设置,比如添加子句转换处理概率可以为15%、20%等,在此不做限制。在添加子句转换处理概率为15%的情况下,表明对书面语语句有15%的概率执行添加子句转换处理策略。The addition clause conversion processing strategy refers to a processing strategy for adding clauses to written language sentences. Correspondingly, the added clause conversion processing probability refers to the preset probability of executing the related processing of the added clause conversion processing strategy. The added clause conversion processing probability can also be preset according to actual experience or oral expression habits, such as adding The clause conversion processing probability can be 15%, 20%, etc., which is not limited here. In the case of adding clause conversion processing probability of 15%, it indicates that there is a 15% probability to execute the addition clause conversion processing strategy for written sentences.

基于此,确定第A1书面语语句的添加子句转换处理策略对应的添加子句转换处理概率为15%。Based on this, it is determined that the addition clause conversion processing probability corresponding to the addition clause conversion processing strategy of the A1th written language sentence is 15%.

步骤516-6:基于添加子句转换处理概率,确定是否针对第A1书面语语句执行添加子句转换处理策略。Step 516-6: Based on the added clause conversion processing probability, determine whether to execute the addition clause conversion processing strategy for the A1 th written language sentence.

具体的,基于添加子句转换处理概率,确定是否针对第A1书面语语句执行添加子句转换处理策略的具体实现方式与上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式类似,参考上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式即可,在此不做赘述。Specifically, based on the added clause conversion processing probability, it is determined whether to execute the added clause conversion processing strategy for the A1 written language sentence. The specific implementation of the strategy is similar. Refer to the above-mentioned conversion processing probability based on the duplicated clause to determine whether to execute the specific implementation of the duplicated clause conversion processing strategy for the written language sentence, which will not be repeated here.

具体实施时,若确定执行添加子句转换处理策略,则执行下述步骤516-7;若确定不执行添加子句转换处理策略,则直接将第A1书面语语句作为第A2书面语语句,执行下述步骤516-9。In the specific implementation, if it is determined to execute the addition clause conversion processing strategy, the following step 516-7 is executed; if it is determined not to execute the addition clause conversion processing strategy, the A1th written language statement is directly used as the A2th written language statement, and the following step 516-7 is executed. Step 516-9.

基于此,假设基于添加子句转换处理概率15%,确定针对第A1书面语语句执行添加子句转换处理策略。Based on this, it is determined that the addition-clause conversion processing strategy is executed for the A1-th written language sentence, assuming that the addition-clause conversion processing probability is 15%.

步骤516-7:在确定执行添加子句转换处理策略的情况下,确定预设子句集合中包含的预设子句对应的子句位置概率分布。Step 516-7: In the case where it is determined to execute the adding clause conversion processing strategy, determine the probability distribution of the clause positions corresponding to the preset clauses included in the preset clause set.

具体的,预设子句集合中包含了3个预设子句,这3个预设子句分别为预设子句1、预设子句2以及预设子句3。这3个预设子句对应的子句位置概率分布为:预设子句1添加至句首的概率为60%,预设子句2添加至句尾的概率为20%,预设子句3添加至句首的概率为20%。Specifically, the preset clause set includes three preset clauses, and the three preset clauses are preset clause 1, preset clause 2, and preset clause 3 respectively. The probability distribution of the clause positions corresponding to the three preset clauses is as follows: the probability of adding preset clause 1 to the beginning of the sentence is 60%, the probability of adding preset clause 2 to the end of the sentence is 20%, and the probability of adding preset clause 2 to the end of the sentence is 20%. The probability of 3 being added to the beginning of a sentence is 20%.

步骤516-8:基于子句位置概率分布在预设子句中确定目标预设子句以及目标预设子句对应的子句添加位置,并根据子句添加位置将目标预设子句添加至第A1书面语语句中,获得转换后的第A2书面语语句。Step 516-8: Determine the target preset clause and the clause addition position corresponding to the target preset clause in the preset clause based on the clause position probability distribution, and add the target preset clause to the target preset clause according to the clause addition position. In the A1 written language sentence, the converted A2 written language sentence is obtained.

具体的,基于子句位置概率分布,在预设子句中确定目标预设子句为预设子句1以及预设子句1对应的子句添加位置为句首,则将预设子句1添加至第A1书面语语句的句首,在预设子句1为“对对对”的情况下,获得转换后的第A2书面语语句为“对对对,我使用计算机,速度很快,速度很快,而且是非常便捷的”。Specifically, based on the probability distribution of the clause positions, it is determined in the preset clause that the target preset clause is preset clause 1 and the addition position of the clause corresponding to preset clause 1 is the beginning of the sentence, then the preset clause 1 is added to the beginning of the sentence in the A1 written language sentence, and in the case where the preset clause 1 is "Duo Duo Du", the converted A2 written language sentence is obtained as "Yes Du Du, I use a computer, the speed is very fast, the speed It's fast and very convenient."

步骤516-9:确定第A2书面语语句的各个子句转换处理策略中乱序子句转换处理策略对应的乱序子句转换处理概率。Step 516-9: Determine the out-of-order clause conversion processing probability corresponding to the out-of-order clause conversion processing policy in each clause conversion processing policy of the A2-th written language sentence.

其中,乱序子句转换处理策略,是指对书面语语句中的子句进行乱序的处理策略。相应地,乱序子句转换处理概率,是指预先设置的执行乱序子句转换处理策略相关处理的概率,该乱序子句转换处理概率,也可以根据实际经验或口语表达习惯进行预先设置,比如乱序子句转换处理概率可以为5%、10%等,在此不做限制。在乱序子句转换处理概率为5%的情况下,表明对书面语语句有5%的概率执行乱序子句转换处理策略。The out-of-order clause conversion processing strategy refers to a processing strategy for out-of-order clauses in a written language sentence. Correspondingly, the out-of-order clause conversion processing probability refers to the preset probability of executing the relevant processing of the out-of-order clause conversion processing strategy. The out-of-order clause conversion processing probability can also be preset according to actual experience or oral expression habits. , for example, the out-of-order clause conversion processing probability can be 5%, 10%, etc., which is not limited here. In the case that the out-of-order clause conversion processing probability is 5%, it indicates that the out-of-order clause conversion processing strategy has a 5% probability for written sentences.

基于此,确定第A2书面语语句的乱序子句转换处理策略对应的乱序子句转换处理概率为5%。Based on this, it is determined that the out-of-order clause conversion processing probability corresponding to the out-of-order clause conversion processing strategy of the A2-th written language sentence is 5%.

步骤516-10:基于乱序子句转换处理概率,确定是否针对第A2书面语语句执行乱序子句转换处理策略。Step 516-10: Based on the out-of-order clause conversion processing probability, determine whether to execute the out-of-order clause conversion processing strategy for the A2-th written language sentence.

具体的,基于乱序子句转换处理概率,确定是否针对第A2书面语语句执行乱序子句转换处理策略的具体实现方式与上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式类似,参考上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式即可,在此不做赘述。Specifically, based on the out-of-order clause conversion processing probability, it is determined whether to execute the out-of-order clause conversion processing strategy for the A2 written language sentence. The specific implementation of the conversion processing strategy is similar. Referring to the above-mentioned conversion processing probability based on the copy clause, it is sufficient to determine whether to execute the specific implementation of the copy clause conversion processing strategy for the written language sentence, which will not be repeated here.

具体实施时,若确定执行乱序子句转换处理策略,则执行下述步骤516-11;若确定不执行乱序子句转换处理策略,则直接将第A2书面语语句作为第A3书面语语句,执行下述步骤516-13。During specific implementation, if it is determined to execute the out-of-order clause conversion processing strategy, the following step 516-11 is executed; if it is determined not to execute the out-of-order clause conversion processing strategy, the A2 written language statement is directly used as the A3 written language statement, and the execution is executed. Steps 516-13 are described below.

基于此,假设基于添加子句转换处理概率5%,确定针对第A2书面语语句执行乱序子句转换处理策略。Based on this, it is assumed that the out-of-order clause conversion processing strategy is determined for the A2-th written language sentence based on the assumption that the added clause conversion processing probability is 5%.

步骤516-11:在确定执行乱序子句转换处理策略的情况下,按照第二预设采样规则对第A2书面语语句进行子句采样,获得第A2书面语语句中的第二目标子句。Step 516-11: If it is determined to execute the out-of-order clause conversion processing strategy, perform clause sampling on the A2-th written language sentence according to the second preset sampling rule to obtain the second target clause in the A2-th written language sentence.

具体的,第二预设采样规则,是指预先设置的在书面语语句中采样待乱序的子句的采样规则,该第二预设采样规则,可以是随机采样,也可以是根据位置进行采样,比如采样位置在书面语语句中排在最末位置的子句,此外,还可以根据字符数量进行采样,比如采样子句中字符数量小于5的子句等,在此不做限制。实际应用中,该第二预设采样规则,可以和上述第一预设采样规则相同,也可以与上述第一预设采样规则不同,在此不做限制。该第二预设采样规则,也可以与上述方法实施例中的预设子句采样规则相同,或可以理解为上述方法实施例中的预设子句采样规则中的一种。相应地,第二目标子句,是指按照第二预设采样规则在书面语语句中采样的子句,也可以理解为上述方法实施例中的目标子句。Specifically, the second preset sampling rule refers to a preset sampling rule for sampling clauses to be out of order in written sentences, and the second preset sampling rule may be random sampling or sampling according to location , such as a clause whose sampling position is at the end of a written sentence. In addition, sampling can be performed according to the number of characters, such as a clause with less than 5 characters in the sampling clause, which is not limited here. In practical applications, the second preset sampling rule may be the same as the above-mentioned first preset sampling rule, or may be different from the above-mentioned first preset sampling rule, which is not limited herein. The second preset sampling rule may also be the same as the preset clause sampling rule in the foregoing method embodiments, or may be understood as one of the preset clause sampling rules in the foregoing method embodiments. Correspondingly, the second target clause refers to a clause sampled in a written language sentence according to the second preset sampling rule, and can also be understood as the target clause in the above method embodiments.

基于此,在确定执行乱序子句转换处理策略的情况下,随机对第A2书面语语句进行子句采样,获得第A2书面语语句中的第二目标子句为“对对对”。Based on this, when it is determined to execute the out-of-order clause conversion processing strategy, the clause sampling is performed on the A2-th written language sentence at random, and the second target clause in the A2-th written language sentence is obtained as "pair-to-pair".

步骤516-12:在第A2书面语语句中删除第二目标子句,并将第二目标子句按照预设子句插入规则插入删除后的第A2书面语语句,获得转换后的第A3书面语语句。Step 516-12: Delete the second target clause in the A2-th written language sentence, and insert the second target clause into the deleted A2-th written language sentence according to the preset clause insertion rule to obtain the converted A3-th written language sentence.

具体的,在第A2书面语语句中删除第二目标子句“对对对”,获得删除后的第A2书面语语句:“我使用计算机,速度很快,速度很快,而且是非常便捷的”。并将第二目标子句“对对对”随机插入删除后的第A2书面语语句中,获得转换后的第A3书面语语句为:“我使用计算机,速度很快,速度很快,而且是非常便捷的,对对对”。Specifically, delete the second target clause "Dai Du Du" in the A2 written language sentence, and obtain the deleted A2 written language sentence: "I use a computer, which is very fast, very fast, and very convenient". And the second target clause "Duo Duo" is randomly inserted into the deleted A2 written language sentence, and the converted A3 written language sentence is obtained as: "I use a computer, which is very fast, very fast, and very convenient. Yes, that's right."

步骤516-13:确定第A3书面语语句的各个子句转换处理策略中倒装子句转换处理策略对应的倒装子句转换处理概率。Step 516-13: Determine the inversion clause conversion processing probability corresponding to the inversion clause conversion processing strategy in each clause conversion processing strategy of the A3-th written language sentence.

其中,倒装子句转换处理策略,是指对书面语语句中的子句的语序进行倒装(比如从主谓宾结构倒装为宾谓主的结构)的处理策略。实际应用中,虽然子句的语序不一致,但表达的意思仍是相同的,因此口语语句中可能会出现子句中的语序与书面语语句中的语序不一致的情况。为了使转换后的书面语更加符合口语特点,可以对书面语语句的一些子句的语序进行倒装处理。相应地,倒装子句转换处理概率,是指预先设置的执行倒装子句转换处理策略相关处理的概率,该倒装子句转换处理概率,也可以根据实际经验或口语表达习惯进行预先设置,比如倒装子句转换处理概率可以为3%、5%等,在此不做限制。在倒装子句转换处理概率为3%的情况下,表明对书面语语句有3%的概率执行倒装子句转换处理策略。Among them, the inversion clause conversion processing strategy refers to a processing strategy for inverting the word order of clauses in a written sentence (for example, from a subject-predicate-object structure to an object-predicate-subject structure). In practical applications, although the word order of the clauses is inconsistent, the meanings expressed are still the same. Therefore, the word order in the clause may be inconsistent with the word order in the written sentence in a spoken sentence. In order to make the converted written language more in line with the characteristics of spoken language, the word order of some clauses of the written language sentence can be reversed. Correspondingly, the inverted clause conversion processing probability refers to the preset probability of executing the relevant processing of the inverted clause conversion processing strategy. The inverted clause conversion processing probability can also be preset according to actual experience or oral expression habits. , for example, the conversion processing probability of the inverted clause can be 3%, 5%, etc., which is not limited here. When the probability of inversion clause conversion processing is 3%, it indicates that there is a 3% probability of executing the inverted clause conversion processing strategy for written language sentences.

基于此,确定第A3书面语语句的倒装子句转换处理策略对应的倒装子句转换处理概率为3%。Based on this, it is determined that the inversion clause conversion processing probability corresponding to the inversion clause conversion processing strategy of the A3 written language sentence is 3%.

步骤516-14:基于倒装子句转换处理概率,确定是否针对第A3书面语语句执行倒装子句转换处理策略。Step 516-14: Based on the inversion clause conversion processing probability, determine whether to execute the inversion clause conversion processing strategy for the A3-th written language sentence.

具体的,基于倒装子句转换处理概率,确定是否针对第A3书面语语句执行倒装子句转换处理策略的具体实现方式与上述基于倒装子句转换处理概率,确定是否针对书面语语句执行倒装子句转换处理策略的具体实现方式类似,参考上述基于倒装子句转换处理概率,确定是否针对书面语语句执行倒装子句转换处理策略的具体实现方式即可,在此不做赘述。Specifically, based on the inversion clause conversion processing probability, determining whether to execute the inversion clause conversion processing strategy for the written language sentence A3 is the same as the above-mentioned conversion processing probability based on the inversion clause. The specific implementation of the clause conversion processing strategy is similar. Refer to the above-mentioned specific implementation of the inverted clause conversion processing strategy based on the inverted clause conversion processing probability to determine whether to execute the inverted clause conversion processing strategy for the written language sentence, which is not repeated here.

具体实施时,若确定执行倒装子句转换处理策略,则执行下述步骤516-15;若确定不执行倒装子句转换处理策略,则直接将第A3书面语语句作为第A4书面语语句,执行下述步骤518。During specific implementation, if it is determined to execute the inversion clause conversion processing strategy, the following steps 516-15 are executed; if it is determined not to execute the inverted clause conversion processing strategy, the A3 written language statement is directly used as the A4 written language statement, and the execution Step 518 is described below.

基于此,假设基于添加子句转换处理概率5%,确定针对第A2书面语语句执行乱序子句转换处理策略。Based on this, it is assumed that the out-of-order clause conversion processing strategy is determined for the A2-th written language sentence based on the assumption that the added clause conversion processing probability is 5%.

步骤516-15:在确定执行倒装子句转换处理策略的情况下,按照第三预设采样规则对第A3书面语语句进行子句采样,获得第A3书面语语句中的第三目标子句。Step 516-15: In the case of determining to execute the inverted clause conversion processing strategy, perform clause sampling on the A3-th written language sentence according to the third preset sampling rule to obtain the third target clause in the A3-th written language sentence.

具体的,第三预设采样规则,是指预先设置的在书面语语句中采样待倒装的子句的采样规则,该第三预设采样规则,可以是随机采样,也可以是根据位置进行采样,比如采样位置在书面语语句中排在句首位置的子句,此外,还可以根据字符数量进行采样,比如采样子句中字符数量大于5的子句等,在此不做限制。实际应用中,该第三预设采样规则,可以和上述第一预设采样规则或第二预设采样规则相同,也可以与上述第一预设采样规则或第二预设采样规则不同,在此不做限制。该第三预设采样规则,也可以与上述方法实施例中的预设子句采样规则相同,或可以理解为上述方法实施例中的预设子句采样规则中的一种。相应地,第三目标子句,是指按照第三预设采样规则在书面语语句中采样的子句,也可以理解为上述方法实施例中的目标子句。Specifically, the third preset sampling rule refers to a preset sampling rule for sampling clauses to be inverted in written sentences, and the third preset sampling rule may be random sampling or sampling based on location , such as a clause whose sampling position is at the beginning of a sentence in a written sentence. In addition, sampling can be performed according to the number of characters, such as a clause with more than 5 characters in the sampling clause, which is not limited here. In practical applications, the third preset sampling rule may be the same as the first preset sampling rule or the second preset sampling rule, or different from the first preset sampling rule or the second preset sampling rule. This does not limit. The third preset sampling rule may also be the same as the preset clause sampling rule in the foregoing method embodiments, or may be understood as one of the preset clause sampling rules in the foregoing method embodiments. Correspondingly, the third target clause refers to a clause sampled in a written language sentence according to the third preset sampling rule, and can also be understood as the target clause in the above method embodiment.

基于此,在确定执行倒装子句转换处理策略的情况下,随机对第A3书面语语句进行子句采样,获得第A3书面语语句中的第三目标子句为“我使用计算机”。Based on this, in the case of determining the implementation of the inverted clause conversion processing strategy, the clause sampling is performed on the A3-th written language sentence at random, and the third target clause in the A3-th written language sentence is obtained as "I use a computer".

步骤516-17:对第三目标子句进行句法分析,获得第三目标子句对应的句法结构。Step 516-17: Perform syntax analysis on the third target clause to obtain a syntax structure corresponding to the third target clause.

具体的,对第三目标子句进行句法分析,获得第三目标子句对应的句法结构为主谓宾结构。Specifically, the third target clause is syntactically analyzed to obtain a subject-predicate-object structure corresponding to the third target clause.

步骤516-18:通过将第三目标子句按照句法结构对应的目标句法结构进行转换,获得转换后的第A4书面语语句。Step 516-18: Obtain the converted A4th written language sentence by converting the third target clause according to the target syntax structure corresponding to the syntax structure.

具体的,在主谓宾结构对应的目标句法结构为宾谓主结构的情况下,将第三目标子句按照宾谓主结构进行转换,获得转换后的第A4书面语语句为:“计算机被我使用,速度很快,速度很快,而且是非常便捷的,对对对”。Specifically, when the target syntactic structure corresponding to the subject-verb-object structure is the object-verb-subject structure, the third target clause is converted according to the object-verb-subject structure, and the converted written language sentence A4 is obtained as: "The computer is used by me. It's very fast to use, it's very fast, and it's very convenient, right, right."

步骤518:对第A4书面语语句进行词语单元的转换处理,获得第B3书面语语句。Step 518: Perform word unit conversion processing on the A4th written language sentence to obtain the B3th written language sentence.

具体的,在上述对书面语语句进行子句单元的转换处理,获得第A4书面语语句的基础上,对第A4书面语语句进行词语单元的转换处理,具体通过如下执行如下步骤518-1至步骤518-12进行实现:Specifically, on the basis of the above-mentioned conversion processing of clause units on the written language sentence to obtain the A4 written language sentence, the conversion processing of the word unit is performed on the A4 written language sentence. Specifically, the following steps 518-1 to 518- are executed as follows: 12 to implement:

步骤518-1:确定第A4书面语语句的各个词语转换处理策略中添加词语转换处理策略对应的添加词语转换处理概率。Step 518-1: Determine the added word conversion processing probability corresponding to the added word conversion processing strategy in each word conversion processing strategy of the A4th written language sentence.

具体的,添加词语转换处理策略,是指对书面语语句进行词语添加的处理策略。相应地,添加词语转换处理概率,是指预先设置的执行添加词语转换处理策略的概率。该添加词语转换处理概率,可以根据实际经验或口语表达习惯进行预先设置,比如添加词语转换处理概率可以为10%、13%等,在此不做限制。在添加词语转换处理概率为10%的情况下,表明对书面语语句有10%的概率执行添加词语转换处理策略。Specifically, the added word conversion processing strategy refers to a processing strategy for adding words to written sentences. Correspondingly, the added word conversion processing probability refers to a preset probability of executing the added word conversion processing strategy. The added word conversion processing probability may be preset according to actual experience or oral expression habits. For example, the added word conversion processing probability may be 10%, 13%, etc., which is not limited here. In the case of adding word conversion processing probability of 10%, it indicates that there is a 10% probability to execute the adding word conversion processing strategy for written sentences.

基于此,确定第A4书面语语句:“计算机被我使用,速度很快,速度很快,而且是非常便捷的,对对对”的词语转换处理策略中添加词语转换处理策略对应的添加词语转换处理概率为10%,则针对第A4书面语语句执行添加词语转换处理策略的添加词语转换处理概率为10%。Based on this, determine the written language sentence A4: "The computer is used by me, the speed is very fast, the speed is very fast, and it is very convenient, right, right, right". Add word transformation processing strategy corresponding to the addition word transformation processing strategy If the probability is 10%, the probability of the addition-word conversion processing for executing the addition-word conversion processing strategy for the A4 written language sentence is 10%.

步骤518-2:基于添加词语转换处理概率,确定是否针对第A4书面语语句执行添加词语转换处理策略。Step 518-2: Based on the added word conversion processing probability, determine whether to execute the added word conversion processing strategy for the A4th written language sentence.

具体的,基于添加词语转换处理概率,确定是否针对第A4书面语语句执行添加词语转换处理策略的具体实现方式与上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式类似,参考上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式即可,在此不做赘述。Specifically, based on the added word conversion processing probability, it is determined whether to execute the added word conversion processing strategy for the A4 written language sentence. The specific implementation method is similar, referring to the above-mentioned specific implementation method for determining whether to execute the copy clause conversion processing strategy for the written language sentence based on the conversion processing probability based on the copy clause, which will not be repeated here.

具体实施时,若确定执行添加词语转换处理策略,则执行下述步骤518-3;若确定不执行添加词语转换处理策略,则直接将第A4书面语语句作为第B1书面语语句,执行下述步骤518-5。In specific implementation, if it is determined to execute the added word conversion processing strategy, the following step 518-3 is executed; if it is determined not to execute the added word conversion processing strategy, the A4th written language sentence is directly regarded as the B1th written language sentence, and the following step 518 is executed. -5.

基于此,假设基于添加词语转换处理概率10%,确定针对第A4书面语语句执行添加词语转换处理策略。Based on this, it is assumed that the addition-word conversion processing strategy is executed for the A4th written language sentence based on the assumption that the addition-word conversion processing probability is 10%.

步骤518-3:在确定执行添加词语转换处理策略的情况下,确定预设词语集合中包含的预设词语对应的词语位置概率分布。Step 518-3: In the case where it is determined to execute the adding word conversion processing strategy, determine the word position probability distribution corresponding to the preset words included in the preset word set.

具体的,预设词语集合中包含了2个预设词语,这2个预设词语分别为预设词语1、预设词语2。根据对销售领域的口语语料集进行统计,这两预设词语对应的词语位置概率分布为:预设词语1添加至句首的概率为80%,预设词语2添加至句尾的概率为20%。Specifically, the preset word set includes two preset words, and the two preset words are preset word 1 and preset word 2 respectively. According to the statistics of the spoken language corpus in the sales field, the probability distribution of the word positions corresponding to the two preset words is: the probability of adding the preset word 1 to the beginning of the sentence is 80%, and the probability of adding the preset word 2 to the end of the sentence is 20% %.

步骤518-4:根据词语位置概率分布在预设词语中确定目标预设词语以及目标预设词语对应的词语添加位置,并根据词语添加位置将目标预设词语添加至第A4书面语语句,获得转换后的第B1书面语语句。Step 518-4: Determine the target preset word and the word addition position corresponding to the target preset word in the preset words according to the word position probability distribution, and add the target preset word to the A4th written language sentence according to the word addition position to obtain a conversion After the B1 written language statement.

具体的,基于词语位置概率分布,在预设词语中确定目标预设词语为预设词语1以及预设词语1对应的预设添加位置为句首,则将预设词语1添加至在第A4书面语语句的句首,在预设词语1为“哇塞”的情况下,获得转换后的第B1书面语语句为“计算机被我使用,哇塞,速度很快,速度很快,而且是非常便捷的,对对对”。Specifically, based on the word position probability distribution, it is determined in the preset words that the target preset word is the preset word 1 and the preset addition position corresponding to the preset word 1 is the beginning of the sentence, then the preset word 1 is added to the At the beginning of the sentence in the written language sentence, when the preset word 1 is "Wowsai", the converted B1 written sentence is obtained as "The computer is used by me, wow, it is very fast, very fast, and very convenient, exactly".

步骤518-5:确定第B1书面语语句的各个词语转换处理策略中复制词语转换处理策略对应的复制词语转换处理概率。Step 518-5: Determine the duplicated word conversion processing probability corresponding to the duplicated word conversion processing strategy in each word conversion processing strategy of the B1th written language sentence.

其中,复制词语转换处理策略,是指对书面语语句中在上述步骤518-4中所添加的目标预设词语进行复制的处理策略。相应地,复制词语转换处理概率,是指预先设置的执行复制词语转换处理策略相关处理的概率,该复制词语转换处理概率,也可以根据实际经验或口语表达习惯进行预先设置,比如复制词语转换处理概率可以为8%、12%等,在此不做限制。在复制词语转换处理概率为8%的情况下,表明对书面语语句有8%的概率执行复制词语转换处理策略。The copy word conversion processing strategy refers to a processing strategy for copying the target preset words added in the above step 518-4 in the written language sentence. Correspondingly, the copied word conversion processing probability refers to the preset probability of executing the relevant processing of the copied word conversion processing strategy. The copied word conversion processing probability can also be preset according to actual experience or oral expression habits, such as the copied word conversion processing. The probability can be 8%, 12%, etc., which is not limited here. Under the circumstance that the conversion processing probability of duplicated words is 8%, it indicates that there is an 8% probability of executing the duplicated word conversion processing strategy for written sentences.

基于此,确定第B1书面语语句的复制词语转换处理策略对应的复制词语转换处理概率为8%。Based on this, it is determined that the duplicated word conversion processing probability corresponding to the duplicated word conversion processing strategy of the B1st written language sentence is 8%.

步骤518-6:基于复制词语转换处理概率,确定是否针对第B1书面语语句执行复制词语转换处理策略。Step 518-6: Based on the duplicated word conversion processing probability, determine whether to execute the duplicated word conversion processing strategy for the B1th written language sentence.

具体的,基于复制词语转换处理概率,确定是否针对第B1书面语语句执行复制词语转换处理策略的具体实现方式与上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式类似,参考上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式即可,在此不做赘述。Specifically, based on the duplicated word conversion processing probability, the specific implementation method of determining whether to execute the duplicated word conversion processing strategy for the B1 written language sentence and the above-mentioned method of determining whether to execute the duplicated clause conversion processing strategy for the written language sentence based on the duplicated clause conversion processing probability The specific implementation method is similar, referring to the above-mentioned specific implementation method for determining whether to execute the copy clause conversion processing strategy for the written language sentence based on the conversion processing probability based on the copy clause, which will not be repeated here.

具体实施时,若确定执行复制词语转换处理策略,则执行下述步骤518-7;若确定不执行复制词语转换处理策略,则直接将第B1书面语语句作为第B2书面语语句,执行下述步骤518-8。In the specific implementation, if it is determined to execute the duplicated word conversion processing strategy, the following step 518-7 is executed; if it is determined not to execute the duplicated word conversion processing strategy, the B1th written language sentence is directly regarded as the B2th written language sentence, and the following step 518 is executed. -8.

基于此,假设基于复制词语转换处理概率8%,确定针对第B1书面语语句执行复制词语转换处理策略。Based on this, it is assumed that the duplicated word conversion processing strategy is executed for the B1 th written language sentence based on the assumption that the duplicated word conversion processing probability is 8%.

步骤518-7:在确定执行复制词语转换处理策略的情况下,对第B1书面语语句中添加的目标预设词语进行复制,获得复制词语,并将复制词语按照预设词语插入规则插入第B1书面语语句中,获得插入后的第B2书面语语句。Step 518-7: In the case of determining to execute the copy word conversion processing strategy, copy the target preset words added in the B1st written language sentence to obtain the copied words, and insert the copied words into the B1th written language according to the preset word insertion rules. In the sentence, the B2-th written language sentence after insertion is obtained.

基于此,在确定执行复制词语转换处理策略的情况下,对第B1书面语语句中目标预设词语“哇塞”进行复制,获得的复制词语也为“哇塞”,在预设词语插入规则为将目标预设词语插入目标预设词语所处位置之前的情况下,将复制词语插入至第B1书面语语句中目标预设词语所处位置之前,获得插入后的第B2书面语语句为:“计算机被我使用,哇塞,哇塞,速度很快,速度很快,而且是非常便捷的,对对对”。Based on this, in the case of determining to execute the copy word conversion processing strategy, the target preset word "Wase" in the B1 written language sentence is copied, and the obtained copied word is also "Wase". When the preset word is inserted before the position of the target preset word, insert the copied word before the position of the target preset word in the B1 written language sentence, and the inserted B2 written language sentence is obtained as: "The computer is used by me , wow, wow, it's fast, it's fast, and it's very convenient, right, right."

步骤518-8:确定第B2书面语语句的各个词语转换处理策略中乱序词语转换处理策略对应的乱序词语转换处理概率。Step 518-8: Determine the out-of-order word conversion processing probability corresponding to the out-of-order word conversion processing strategy in each word conversion processing strategy of the B2th written language sentence.

其中,乱序词语转换处理策略,是指对书面语语句中的词语进行乱序的处理策略。相应地,乱序词语转换处理概率,是指预先设置的执行乱序词语转换处理策略相关处理的概率,该乱序词语转换处理概率,也可以根据实际经验或口语表达习惯进行预先设置,比如乱序词语转换处理概率可以为6%、9%等,在此不做限制。在乱序词语转换处理概率为6%的情况下,表明对书面语语句有6%的概率执行乱序词语转换处理策略。Among them, the out-of-order word conversion processing strategy refers to a processing strategy for out-of-order words in a written sentence. Correspondingly, the conversion processing probability of out-of-order words refers to the preset probability of executing the relevant processing of the out-of-order word conversion processing strategy. The conversion processing probability of ordinal words can be 6%, 9%, etc., which is not limited here. Under the circumstance that the conversion processing probability of out-of-order words is 6%, it indicates that there is a 6% probability of executing out-of-order word conversion processing strategies for written sentences.

基于此,确定第B2书面语语句的乱序词语转换处理策略对应的乱序词语转换处理概率为6%。Based on this, it is determined that the out-of-order word conversion processing probability corresponding to the out-of-order word conversion processing strategy of the B2th written language sentence is 6%.

步骤518-9:基于乱序词语转换处理概率,确定是否针对第B2书面语语句执行乱序词语转换处理策略。Step 518-9: Based on the out-of-order word conversion processing probability, determine whether to execute the out-of-order word conversion processing strategy for the B2th written language sentence.

具体的,基于乱序词语转换处理概率,确定是否针对第B2书面语语句执行乱序词语转换处理策略的具体实现方式与上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式类似,参考上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式即可,在此不做赘述。Specifically, based on the out-of-order word conversion processing probability, it is determined whether to execute the out-of-order word conversion processing strategy for the B2 written language sentence. The specific implementation of the strategy is similar. Refer to the above-mentioned conversion processing probability based on the duplicated clause to determine whether to execute the specific implementation of the duplicated clause conversion processing strategy for the written language sentence, which will not be repeated here.

具体实施时,若确定执行乱序词语转换处理策略,则执行下述步骤518-10;若确定不执行乱序词语转换处理策略,则直接将第B2书面语语句作为第B3书面语语句,执行下述步骤520。During specific implementation, if it is determined to execute the out-of-order word conversion processing strategy, the following steps 518-10 are executed; if it is determined not to execute the out-of-order word conversion processing strategy, the B2th written language sentence is directly regarded as the B3th written language sentence, and the following steps are executed. Step 520.

基于此,假设基于乱序词语转换处理概率5%,确定针对第B2书面语语句执行乱序词语转换处理策略。Based on this, assuming that the out-of-order word conversion processing probability is 5%, it is determined that the out-of-order word conversion processing strategy is executed for the B2th written language sentence.

步骤518-10:在确定执行乱序词语转换处理策略的情况下,按照预设词语采样规则对第B2书面语语句中的词语进行词语采样,获得第B2书面语语句中的目标词语。基于此,在确定执行乱序词语转换处理策略的情况下,随机对第B2书面语语句进行2个字符数量的词语采样,获得第B2书面语语句中的目标词语为“使用”。Step 518-10: In the case of determining to execute the out-of-order word conversion processing strategy, perform word sampling on the words in the B2th written language sentence according to the preset word sampling rule to obtain the target word in the B2th written language sentence. Based on this, in the case of determining to execute the out-of-order word conversion processing strategy, randomly sample words of 2 characters in the B2th written language sentence, and obtain the target word in the B2th written language sentence as "use".

步骤518-11:在第B2书面语语句中删除目标词语,并将目标词语插入删除后的第B2书面语语句中目标词语对应的预设插入范围内,获得转换后的第B3书面语语句。Step 518-11: Delete the target word in the B2-th written language sentence, and insert the target word into the preset insertion range corresponding to the target word in the deleted B2-th written language sentence to obtain the converted B3-th written language sentence.

具体的,在第B2书面语语句中删除目标词语“使用”,获得删除后的第B2书面语语句:“计算机被我,速度很快,速度很快,而且是非常便捷的”。并将目标词语“计算”随机插入删除后的第B2书面语语句中[-3,3]的字符区间内,获得转换后的第B3书面语语句为:“计算机使用被我,哇塞,哇塞,速度很快,速度很快,而且是非常便捷的,对对对”。Specifically, delete the target word "use" in the B2 written language sentence, and obtain the deleted B2 written language sentence: "The computer is used by me, and the speed is very fast, and it is very convenient." The target word "calculation" is randomly inserted into the character interval [-3, 3] in the deleted B2 written language sentence, and the converted B3 written language sentence is obtained as: "The computer is used by me, wow, wow, very fast. It’s fast, it’s fast, and it’s very convenient, right, right.”

步骤520:对第B3书面语语句进行字符单元的转换处理,获得第C书面语语句。Step 520: Perform character unit conversion processing on the B3th written language sentence to obtain the Cth written language sentence.

具体的,在上述对书面语语句进行词语单元的转换处理,获得第B3书面语语句的基础上,对第B3书面语语句进行字符单元的转换处理,具体通过如下执行如下步骤520-1至步骤520-4进行实现:Specifically, on the basis that the word unit conversion processing is performed on the written language sentence and the B3 written language sentence is obtained, the character unit conversion processing is performed on the B3 written language sentence. Specifically, the following steps 520-1 to 520-4 are executed as follows To implement:

步骤520-1:确定第B3书面语语句的乱序字符转换处理策略对应的乱序字符转换处理概率。Step 520-1: Determine the out-of-order character conversion processing probability corresponding to the out-of-order character conversion processing strategy of the B3th written language sentence.

具体的,乱序字符转换处理策略,是指对书面语语句中的字符进行乱序的处理策略。相应地,乱序字符转换处理概率,是指预先设置的执行乱序字符转换处理策略相关处理的概率,该乱序字符转换处理概率,也可以根据实际经验或口语表达习惯进行预先设置,比如乱序字符转换处理概率可以为5%、9%等,在此不做限制。在乱序字符转换处理概率为5%的情况下,表明对书面语语句有6%的概率执行乱序字符转换处理策略。Specifically, the out-of-order character conversion processing strategy refers to a processing strategy for out-of-order characters in a written sentence. Correspondingly, the out-of-order character conversion processing probability refers to the preset probability of executing the relevant processing of the out-of-order character conversion processing strategy. The out-of-order character conversion processing probability can also be preset according to actual experience or oral expression habits, such as random The conversion processing probability of ordinal characters can be 5%, 9%, etc., which is not limited here. When the probability of out-of-order character conversion processing is 5%, it indicates that there is a 6% probability of executing out-of-order character conversion processing strategy for written sentences.

基于此,确定第B3书面语语句的乱序字符转换处理策略对应的乱序字符转换处理概率为5%。Based on this, it is determined that the out-of-order character conversion processing probability corresponding to the out-of-order character conversion processing strategy of the B3 written language sentence is 5%.

步骤520-2:基于乱序字符转换处理概率,确定是否针对第B3书面语语句执行乱序字符转换处理策略。Step 520-2: Based on the out-of-order character conversion processing probability, determine whether to execute the out-of-order character conversion processing strategy for the B3th written language sentence.

具体的,基于乱序字符转换处理概率,确定是否针对第B3书面语语句执行乱序字符转换处理策略的具体实现方式与上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式类似,参考上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式即可,在此不做赘述。Specifically, based on the out-of-order character conversion processing probability, it is determined whether to execute the out-of-order character conversion processing strategy for the B3 written language sentence. The specific implementation of the strategy is similar. Refer to the above-mentioned conversion processing probability based on the duplicated clause to determine whether to execute the specific implementation of the duplicated clause conversion processing strategy for the written language sentence, which will not be repeated here.

具体实施时,若确定执行乱序字符转换处理策略,则执行下述步骤520-3;若确定不执行乱序字符转换处理策略,则直接将第B3书面语语句作为第C书面语语句,执行下述步骤522。During specific implementation, if it is determined to execute the out-of-order character conversion processing strategy, the following step 520-3 is executed; if it is determined not to execute the out-of-order character conversion processing strategy, the B3th written language statement is directly used as the Cth written language statement, and the following step 520-3 is executed. Step 522.

基于此,假设基于乱序字符转换处理概率5%,确定针对第B3书面语语句执行乱序字符转换处理策略。Based on this, assuming that the out-of-order character conversion processing probability is 5%, it is determined that the out-of-order character conversion processing strategy is executed for the B3th written language sentence.

步骤520-3:在确定执行乱序字符转换处理策略的情况下,按照预设字符采样规则对第B3书面语语句中的字符进行字符采样,获得第B3书面语语句中的目标字符。基于此,在确定执行乱序字符转换处理策略的情况下,随机对第B3书面语语句进行字符采样,获得第B3书面语语句中的目标字符为“使”。Step 520-3: In the case of determining to execute the out-of-order character conversion processing strategy, character sampling is performed on the characters in the B3th written language sentence according to the preset character sampling rule to obtain the target character in the B3th written language sentence. Based on this, when it is determined to execute the out-of-order character conversion processing strategy, character sampling is performed on the B3th written language sentence at random, and the target character in the B3th written language sentence is obtained as "make".

步骤520-4:在第B3书面语语句中删除目标字符,并将目标字符插入删除后的第B3书面语语句中目标字符对应的预设字符插入范围内,获得转换后的第C书面语语句。具体的,在第B3书面语语句中删除目标字符“使”,获得删除后的第B3书面语语句:“计算机用被我,哇塞,哇塞,速度很快,速度很快,而且是非常便捷的,对对对”。并将目标字符“使”随机插入删除后的第B3书面语语句中该目标字符所属的子句范围内,获得转换后的第C书面语语句为:“计算机用被我使,哇塞,哇塞,速度很快,速度很快,而且是非常便捷的,对对对”。Step 520-4: Delete the target character in the B3th written language sentence, and insert the target character into the preset character insertion range corresponding to the target character in the B3th written language sentence after deletion, to obtain the converted Cth written language sentence. Specifically, delete the target character "make" in the B3 written language sentence, and obtain the deleted B3 written language sentence: "The computer is used by me, wow, wow, very fast, very fast, and very convenient, yes Yep". Randomly insert the target character "make" into the scope of the clause to which the target character belongs in the deleted B3 written language sentence, and the converted C written language sentence is obtained as: "The computer is used by me, wow, wow, very fast. It’s fast, it’s fast, and it’s very convenient, right, right.”

步骤522:对第C书面语语句进行符号单元的转换处理,获得第D2书面语语句。Step 522: Perform symbol unit conversion processing on the C-th written language sentence to obtain the D2-th written language sentence.

具体的,在上述对书面语语句进行字符单元的转换处理,获得第C书面语语句的基础上,对第C书面语语句进行符号单元的转换处理,具体通过如下执行如下步骤522-1至步骤522-8进行实现:Specifically, on the basis of performing the character unit conversion process on the written language sentence and obtaining the Cth written language sentence, the symbol unit conversion process is performed on the Cth written language sentence. Specifically, the following steps 522-1 to 522-8 are executed as follows To implement:

步骤522-1:确定第C书面语语句的符号转换处理策略中删除符号转换处理策略对应的删除符号转换处理概率。Step 522-1: Determine the deletion symbol conversion processing probability corresponding to the deletion symbol conversion processing strategy in the symbol conversion processing strategy of the Cth written language sentence.

具体的,删除符号转换处理策略,是指对书面语语句进行符号删除的处理策略。相应地,删除符号转换处理概率,是指预先设置的执行删除符号转换处理策略相关处理的概率,该删除符号转换处理概率,也可以根据实际经验或口语表达习惯进行预先设置,比如删除符号转换处理概率可以为8%、12%等,在此不做限制。在删除符号转换处理概率为8%的情况下,表明对书面语语句有8%的概率执行删除符号转换处理策略。Specifically, the deletion symbol conversion processing strategy refers to a processing strategy for performing symbol deletion on written language sentences. Correspondingly, the deletion symbol conversion processing probability refers to the preset probability of executing the relevant processing of the deletion symbol conversion processing strategy. The deletion symbol conversion processing probability can also be preset according to actual experience or oral expression habits, such as deletion symbol conversion processing. The probability can be 8%, 12%, etc., which is not limited here. Under the circumstance that the probability of deletion symbol conversion processing is 8%, it indicates that there is an 8% probability of executing the deletion symbol conversion processing strategy for written sentences.

基于此,确定第C书面语语句的符号转换处理策略中删除符号转换处理策略对应的删除符号转换处理概率为8%,则针对第C书面语语句执行删除符号转换处理策略的概率为8%。Based on this, it is determined that the deletion symbol conversion processing probability corresponding to the deletion symbol conversion processing strategy in the symbol conversion processing strategy of the Cth written language sentence is 8%, and the probability of executing the deletion symbol conversion processing strategy for the Cth written language sentence is 8%.

步骤522-2:基于删除符号转转换处理概率,确定是否针对第C书面语语句执行删除符号转换处理策略。Step 522-2: Based on the probability of the deletion symbol conversion processing, determine whether to execute the deletion symbol conversion processing strategy for the Cth written language sentence.

具体的,基于删除符号转换处理概率,确定是否针对第C书面语语句执行删除符号转换处理策略的具体实现方式与上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式类似,参考上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式即可,在此不做赘述。Specifically, based on the probability of the deletion symbol conversion processing, it is determined whether to execute the deletion symbol conversion processing strategy for the Cth written language sentence. The specific implementation method is similar, referring to the above-mentioned specific implementation method for determining whether to execute the copy clause conversion processing strategy for the written language sentence based on the conversion processing probability based on the copy clause, which will not be repeated here.

具体实施时,若确定执行删除符号转换处理策略,则执行下述步骤522-3;若确定不执行删除符号转换处理策略,则直接将第C书面语语句作为第D1书面语语句,执行下述步骤522-5。During specific implementation, if it is determined to execute the deletion symbol conversion processing strategy, the following step 522-3 is executed; if it is determined not to execute the deletion symbol conversion processing strategy, the Cth written language statement is directly regarded as the D1th written language statement, and the following step 522 is executed -5.

基于此,假设基于删除符号转换处理概率8%,确定针对C书面语语句执行删除符号转换处理策略。Based on this, it is assumed that the deletion symbol conversion processing strategy is executed for the C written language statement based on the deletion symbol conversion processing probability of 8%.

步骤522-3:在确定执行删除符号转换处理策略的情况下,按照预设符号采样规则对第C书面语语句进行符号采样,获得第C书面语语句中的目标标点符号。基于此,在确定执行删除符号转换处理策略的情况下,随机对第C书面语语句进行符号采样,获得第C书面语语句中的目标标点符号为第一个“速度很快”子句后的逗号。Step 522-3: In the case of determining to execute the delete symbol conversion processing strategy, perform symbol sampling on the Cth written language sentence according to the preset symbol sampling rule, and obtain the target punctuation mark in the Cth written language sentence. Based on this, in the case of determining the implementation of the deletion symbol conversion processing strategy, the Cth written language sentence is randomly sampled to obtain the target punctuation mark in the Cth written language sentence is the comma after the first "fast" clause.

步骤522-4:在第C书面语语句中删除目标标点符号,获得转换后的第D1书面语语句。Step 522-4: Delete the target punctuation in the C-th written language sentence to obtain the converted D1-th written language sentence.

具体的,在第C书面语语句中删除第一个“速度很快”子句后的逗号,获得删除后的第D1书面语语句:“计算机用被我使,哇塞,哇塞,速度很快速度很快,而且是非常便捷的,对对对”。Specifically, delete the comma after the first "very fast" clause in the C-th written language statement, and obtain the deleted D1-th written language statement: "The computer is used by me, wow, wow, the speed is very fast. , and it’s very convenient, right, right.”

步骤522-5:确定第D1书面语语句的符号转换处理策略中添加符号转换处理策略对应的添加符号转换处理概率。Step 522-5: Determine the added symbol conversion processing probability corresponding to the added symbol conversion processing strategy in the symbol conversion processing strategy of the D1-th written language sentence.

其中,添加符号转换处理策略,是指对书面语语句进行符号添加的处理策略。相应地,添加符号转换处理概率,是指预先设置的执行添加符号转换处理策略相关处理的概率,该添加符号转换处理概率,也可以根据实际经验或口语表达习惯进行预先设置,比如添加符号转换处理概率可以为2%、5%等,在此不做限制。在添加符号转换处理概率为2%的情况下,表明对书面语语句有2%的概率执行添加符号转换处理策略。The processing strategy for adding symbols conversion refers to a processing strategy for adding symbols to written language sentences. Correspondingly, the added symbol conversion processing probability refers to the preset probability of executing the related processing of the added symbol conversion processing strategy. The added symbol conversion processing probability can also be preset according to actual experience or oral expression habits, such as adding a symbol conversion processing. The probability can be 2%, 5%, etc., which is not limited here. In the case of adding symbol conversion processing probability of 2%, it indicates that there is a 2% probability to implement the adding symbol conversion processing strategy for written language sentences.

基于此,确定第D1书面语语句的添加符号转换处理策略对应的添加符号转换处理概率为2%。Based on this, it is determined that the added sign conversion processing probability corresponding to the added sign conversion processing strategy of the D1-th written language sentence is 2%.

步骤522-6:基于添加符号转转换处理概率,确定是否针对第D1书面语语句执行添加符号转换处理策略。Step 522 - 6 : Based on the probability of adding symbol conversion processing, determine whether to execute the adding symbol conversion processing strategy for the D1 th written language sentence.

具体的,基于添加符号转换处理概率,确定是否针对第D1书面语语句执行添加符号转换处理策略的具体实现方式与上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式类似,参考上述基于复制子句转换处理概率,确定是否针对书面语语句执行复制子句转换处理策略的具体实现方式即可,在此不做赘述。Specifically, based on the added symbol conversion processing probability, it is determined whether to execute the added symbol conversion processing strategy for the D1 written language sentence. The specific implementation method is similar, referring to the above-mentioned specific implementation method for determining whether to execute the copy clause conversion processing strategy for the written language sentence based on the conversion processing probability based on the copy clause, which will not be repeated here.

具体实施时,若确定执行添加符号转换处理策略,则执行下述步骤522-7;若确定不执行添加符号转换处理策略,则直接将第D1书面语语句作为第D2书面语语句,执行下述步骤524。In the specific implementation, if it is determined to execute the addition symbol conversion processing strategy, the following step 522-7 is executed; if it is determined not to execute the addition symbol conversion processing strategy, the D1-th written language statement is directly used as the D2-th written language statement, and the following step 524 is executed .

基于此,假设基于添加符号转换处理概率2%,确定针对D1书面语语句执行添加符号转换处理策略。Based on this, it is assumed that based on the addition-symbol conversion processing probability of 2%, it is determined that the addition-symbol conversion processing strategy is executed for the D1 written language sentence.

步骤522-7:在确定执行添加符号转换处理策略的情况下,按照预设符号子句采样规则对第D1书面语语句进行符号子句采样,获得第D1书面语语句中的目标符号子句。Step 522-7: If it is determined to execute the addition symbol conversion processing strategy, perform symbol clause sampling on the D1 th written language sentence according to the preset symbol clause sampling rule to obtain the target symbol clause in the D1 th written language sentence.

基于此,在确定执行添加符号转换处理策略的情况下,在第D1书面语语句采样字符数量最多的子句包括“速度很快速度很快”以及“而且是非常便捷的”,再从这两个子句中随机选择一个子句,获得目标字符子句,该目标字符子句为“速度很快速度很快”。Based on this, in the case of determining the implementation of the conversion processing strategy of adding symbols, the clauses with the largest number of sampled characters in the D1 written language sentence include "very fast and very fast" and "and it is very convenient", and then from these two sub-clauses Randomly select a clause in the sentence to obtain the target character clause, the target character clause is "very fast very fast".

步骤522-8:在目标符号子句中随机插入预设标点符号,获得转换后的第D2书面语语句,将第D2书面语语句作为目标口语语句。Step 522-8: Randomly insert a preset punctuation mark into the target symbol clause, obtain the converted D2-th written language sentence, and use the D2-th written language sentence as the target spoken language sentence.

具体的,预设标点符号为“!”的情况下,在目标符号子句“速度很快速度很快”中随机插入预设标点符号“!”,获得转换后的D2书面语语句为:“计算机用被我使,哇塞,哇塞,速度很快速度很快,!而且是非常便捷的,对对对”。将第D2书面语语句作为目标口语语句。Specifically, when the preset punctuation symbol is "!", the preset punctuation symbol "!" is randomly inserted into the target symbol clause "speed is very fast", and the converted D2 written language sentence is obtained as: "computer I use it, wow, wow, very fast, very fast! And it's very convenient, right, right." Use the D2 written language sentence as the target spoken sentence.

步骤524:对每个待处理文本中的每个书面语句对应的第D2书面语语句进行组合,获得每个待处理文本对应的口语文本。Step 524: Combine the D2-th written language sentences corresponding to each written sentence in each of the texts to be processed to obtain a spoken text corresponding to each of the to-be-processed texts.

具体的,对待处理文本T中包含的n个书面语语句,分别执行上述步骤516至步骤522之后,可以获得待处理文本T中每个书面语语句对应的目标口语语句,则将这n个目标口语语句按照n个书面语语句在待处理文本T中的语句顺序进行组合,获得待处理文本T对应的口语文本。其中,口语文本中包括上述书面语语句1对应的目标口语语句:“计算机用被我使,哇塞,哇塞,速度很快速度很快,!而且是非常便捷的,对对对”。Specifically, after performing the above steps 516 to 522 for the n written sentences contained in the text to be processed, respectively, the target spoken sentence corresponding to each written sentence in the to-be-processed text T can be obtained. Combine the n written sentences in the sentence order in the text T to be processed to obtain the spoken text corresponding to the text T to be processed. Among them, the spoken text includes the target spoken sentence corresponding to the above written sentence 1: "The computer is used by me, wow, wow, very fast! And it is very convenient, right, right".

步骤526:识别口语文本中的异常信息。Step 526: Identify abnormal information in the spoken text.

具体的,通过预设异常识别规则,识别出口语文本中书面语语句1对应的目标口语语句中的异常信息为“,!”Specifically, the abnormal information in the target spoken sentence corresponding to the written sentence 1 in the spoken text is identified as ",!"

步骤528:根据异常信息对口语文本进行数据清洗,获得清洗后的口语文本。Step 528: Perform data cleaning on the spoken text according to the abnormal information to obtain the cleaned spoken text.

实际应用中,根据异常信息对口语文本进行数据清洗,可以将口语文本中识别的异常信息进行过滤或调整。In practical applications, data cleaning is performed on the spoken text according to the abnormal information, and the abnormal information identified in the spoken text can be filtered or adjusted.

具体的,根据异常信息“,!”对口语文本中书面语语句1对应的目标口语语句进行数据清洗,获得清洗后的口语文本中书面语语句1对应的目标口语语句变更为:“计算机用被我使,哇塞,哇塞,速度很快速度很快!而且是非常便捷的,对对对”。Specifically, according to the abnormal information ",!", data cleaning is performed on the target spoken sentence corresponding to the written sentence 1 in the spoken text, and the target spoken sentence corresponding to the written sentence 1 in the cleaned spoken text is changed to: "The computer is used by me. , wow, wow, it's fast, fast! And it's very convenient, right, right."

步骤530:基于书面语文本和回译书面语文本与清洗后的口语文本的对应关系,构建样本语料。Step 530: Construct a sample corpus based on the written text and the correspondence between the back-translated written text and the cleaned spoken text.

实际应用中,由于对每个书面语文本或回译书面语文本都可以获得对应的清洗后的口语文本。因此可以获取大量的书面语文本,并对每个书面语文本分别进行上述步骤502至步骤528,获得每个书面语文本对应的回译书面语文本,以及这些书面语文本对应的清洗后的口语文本。将具有对应关系的书面语文本与清洗后口语文本组成样本语料对,并将这些样本语料对组成样本语料。In practical applications, a corresponding cleaned spoken text can be obtained for each written text or back-translated written text. Therefore, a large number of written texts can be obtained, and the above steps 502 to 528 are respectively performed for each written text to obtain back-translated written texts corresponding to each written text and cleaned spoken texts corresponding to these written texts. A sample corpus pair is composed of the written language text and the cleaned spoken language text with the corresponding relationship, and the sample corpus pair is composed of the sample corpus.

进一步的,由于各种转换处理策略具有各自的转换处理概率,因此每次执行上述步骤516至步骤522的过程中,各种转换处理策略是否执行都是不固定的,也因此,每次执行上述步骤516至步骤522生成的口语文本很大概率是不相同的。基于此可以对每个待处理文本多次执行上述步骤516至步骤522,从而生成为每个待处理文本生成多种口语文本,从而进一步扩充样本语料。Further, since various conversion processing strategies have their own conversion processing probabilities, it is not fixed whether various conversion processing strategies are executed each time the above steps 516 to 522 are executed. There is a high probability that the spoken texts generated in steps 516 to 522 are not the same. Based on this, the above-mentioned steps 516 to 522 can be performed for each text to be processed multiple times, so as to generate a variety of spoken texts for each text to be processed, so as to further expand the sample corpus.

具体的,获取销售领域的m个书面语文本(这m个书面语文本中包括上述书面语文本T),并对每个书面语文本进行上述步骤504至步骤512的回译处理,获得m个回译书面语文本,再将这m个书面语文本和m个回译书面语文本作为待处理文本,执行上述步骤514至步骤528,获得m个书面语文本对应的m个清洗后的口语文本,以及m个回译书面语文本对应的m个清洗后的口语文本,将m个书面语文本作为训练样本,并将其对应的m个清洗后的口语文本作为样本标签,以及将m个回译书面语文本作为训练样本,并将其对应的m个清洗后的口语文本作为样本标签,构建样本语料。Specifically, acquire m written language texts in the sales field (the m written language texts include the above written language text T), and perform back-translation processing from the above steps 504 to 512 on each written language text to obtain m back-translated written language texts , and then use the m written texts and m back-translated written texts as texts to be processed, and execute the above steps 514 to 528 to obtain m cleaned spoken texts corresponding to the m written texts, and m back-translated written texts Corresponding m cleaned spoken texts, m written texts are taken as training samples, and m corresponding m cleaned spoken texts are taken as sample labels, and m back-translated written texts are taken as training samples, and the m written texts are taken as training samples. The corresponding m cleaned spoken texts are used as sample labels to construct sample corpus.

步骤532:通过样本语料对初始书面语改写模型进行训练,直至获得满足第二训练停止条件的书面语改写模型。Step 532 : Train the initial written language rewriting model through the sample corpus until a written language rewriting model that satisfies the second training stop condition is obtained.

具体的,通过样本语料对初始书面语改写模型进行训练,在训练满足预设i次迭代次数的情况下,停止训练,获得书面语改写模型M1。Specifically, the initial written language rewriting model is trained through the sample corpus, and when the training meets the preset number of iterations of i, the training is stopped, and the written language rewriting model M1 is obtained.

此外,还可以通过对上述步骤516至步骤522中的多种转换处理概率设置更低的概率值的基础上,再重新对上述的m个书面语文本中包含的书面语语句执行上述步骤516至步骤528,获得这m个书面语文本对应的m个清洗后的口语文本SST,再基于这m个书面语文本作为训练样本,并将其对应的m个清洗后的口语文本SST作为样本标签,构建基础样本语料。再通过基础样本语料对上述初始书面语改写模型进行训练,直至获得满足第一训练停止条件的书面语转换模型M2。In addition, the above steps 516 to 528 may be re-executed on the written language sentences contained in the above m written language texts on the basis of setting lower probability values for the various conversion processing probabilities in the above-mentioned steps 516 to 522 , obtain m cleaned oral texts SST corresponding to the m written texts, and then use the m written texts as training samples and the corresponding m cleaned oral texts SST as sample labels to construct a basic sample corpus . The above-mentioned initial written language rewriting model is then trained through the basic sample corpus until a written language conversion model M2 that satisfies the first training stop condition is obtained.

具体实施时,第一训练停止条件是指通过基础样本语料训练初始书面语改写模型的训练停止条件,该第一训练停止条件可以与上述第二训练停止条件相同,也可以不同,在此不做限制。书面语转换模型可以理解为训练完成的对口语文本进行轻微书面语改写的模型。具体实施时,由于书面语转换模型的采用较书面语改写模型的样本语料更为简单的样本语料进行模型训练,因此,该书面语转换模型,可以用于处理较为轻微的语句改写。In specific implementation, the first training stop condition refers to the training stop condition for training the initial written language rewriting model by using the basic sample corpus. The first training stop condition may be the same as or different from the above-mentioned second training stop condition, which is not limited here. . The written language conversion model can be understood as a trained model that slightly rewrites the spoken text in written language. During specific implementation, since the written language conversion model uses simpler sample corpus than the written language rewriting model for model training, the written language conversion model can be used to deal with relatively minor sentence rewriting.

步骤534:获取目标口语文本。Step 534: Obtain the target spoken text.

具体的,获取目标口语文本T4。该目标口语文本T4可以为销售领域中的任一口语文本。Specifically, the target spoken text T4 is obtained. The target spoken text T4 can be any spoken text in the sales field.

步骤536:将目标口语文本通过文本分类模型进行文本分类,获得口语文本类型。Step 536: Perform text classification on the target spoken text through a text classification model to obtain the spoken text type.

其中,文本分类模型,是指预先训练的对口语文本进行分类的模型,实际应用中,该文本分类模型,可以是CNN(卷积神经网络)、RNN(循环神经网络)、LSTM(长短时记忆网络)、FastText、TextCNN、HAN模型等,在此不做限制。Among them, the text classification model refers to a pre-trained model for classifying spoken text. In practical applications, the text classification model can be CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory) network), FastText, TextCNN, HAN model, etc., and there are no restrictions here.

实际应用中,可以获取大量的口语文本,并对这些口语文本进行标注处理,获得每个口语文本对应的文本标签,该文本标签包括:无效文本类型、模糊文本类型、标准文本类型等,在此不做限制。再通过口语文本以及口语文本对应的文本标签构建训练样本,通过训练样本训练初始文本分类模型,获得上述训练完成的文本分类模型。In practical applications, a large number of spoken texts can be obtained, and these spoken texts can be marked and processed to obtain a text label corresponding to each spoken text. The text labels include: invalid text type, fuzzy text type, standard text type, etc. Here No restrictions. Then, a training sample is constructed by using the spoken text and the text label corresponding to the spoken text, and an initial text classification model is trained through the training sample, and the above-trained text classification model is obtained.

具体的,将目标口语文本T4通过预先训练完成的文本分类模型进行文本分类,获得该文本分类模型输出的该目标口语文本T4对应的文本类型。Specifically, the target spoken text T4 is subjected to text classification through a pre-trained text classification model, and a text type corresponding to the target spoken text T4 output by the text classification model is obtained.

步骤538:在文本类型为无效文本类型的情况下,删除该目标口语文本。Step 538: If the text type is an invalid text type, delete the target spoken text.

具体的,假设文本类型为无效文本类型的情况下,删除该目标口语文本T4即可。Specifically, if the text type is an invalid text type, the target spoken text T4 may be deleted.

步骤540:在文本分类为标准文本类型的情况下,将目标口语文本输入书面语改写模型进行书面语改写,获得书面语改写模型输出的目标书面语文本。Step 540: When the text is classified as a standard text type, input the target spoken text into the written language rewriting model for written language rewriting, and obtain the target written language text output by the written language rewriting model.

具体的,假设文本类型为标准文本类型的情况下,将目标口语文本T4输入书面语改写模型M1进行书面语改写,获得书面语改写模型M1输出的目标书面语文本T5。Specifically, assuming that the text type is a standard text type, input the target spoken text T4 into the written language rewriting model M1 for written language rewriting, and obtain the target written language text T5 output by the written language rewriting model M1.

步骤542:在文本类型为模糊文本类型的情况下,将目标口语文本输入书面语转换模型进行书面语改写,获得书面语转换模型输出的转换书面语文本。Step 542: When the text type is the fuzzy text type, input the target spoken text into the written language conversion model for written language rewriting, and obtain the converted written language text output by the written language conversion model.

具体的,假设文本类型为模糊文本类型的情况下,将目标口语文本T4输入书面语转换模型M2进行书面语改写,获得书面语转换模型M2输出的转换书面语文本T6。Specifically, assuming that the text type is a fuzzy text type, the target spoken text T4 is input into the written language conversion model M2 for written language rewriting, and the converted written language text T6 output by the written language conversion model M2 is obtained.

综上,本申请实施例提供的文本处理方法,通过对书面语文本进行回译处理,获得回译语书面文本,实现以回译书面语文本对原书面语文本进行扩充。在此基础上以预设的转换处理概率对书面语文本进行子句级别、词语级别、字符级别以及符号级别的转换处理,从而进一步扩充书面语文本对应的口语文本。并将扩充后的书面语文本以及口语文本进行组合生成样本语料,实现了对书面语改写模型的样本语料的自动生成,并且丰富了书面语改写模型的样本语料,从而提高该样本语料的生成效率,并间接地通过丰富样本语料提高了书面语改写模型的改写准确度。To sum up, in the text processing method provided by the embodiments of the present application, by performing back-translation processing on the written text to obtain the back-translated written text, the original written text can be expanded with the back-translated written text. On this basis, the written text is subjected to clause-level, word-level, character-level and symbol-level conversion processing with a preset conversion processing probability, thereby further expanding the spoken text corresponding to the written text. The expanded written text and spoken text are combined to generate sample corpus, which realizes the automatic generation of the sample corpus of the written language rewriting model, and enriches the sample corpus of the written language rewriting model, thereby improving the generation efficiency of the sample corpus, and indirectly By enriching the sample corpus, the rewriting accuracy of the written language rewriting model is improved.

与上述方法实施例相对应,本申请还提供了文本处理装置实施例,图6示出了本申请一实施例提供的文本处理装置的结构示意图。如图6所示,该装置包括:Corresponding to the foregoing method embodiments, the present application further provides an embodiment of a text processing apparatus, and FIG. 6 shows a schematic structural diagram of a text processing apparatus provided by an embodiment of the present application. As shown in Figure 6, the device includes:

获取模块602,被配置为获取目标口语文本;an obtaining module 602, configured to obtain the target spoken text;

分类模块604,被配置为将所述目标口语文本进行分类处理,获得所述目标口语文本对应的文本类型;The classification module 604 is configured to classify the target spoken text to obtain a text type corresponding to the target spoken text;

选择模块606,被配置为在所述文本类型为标准文本类型的情况下,根据所述标准文本类型选择对应的书面语改写模型;A selection module 606, configured to select a corresponding written language rewriting model according to the standard text type when the text type is a standard text type;

处理模块608,被配置为将所述目标口语文本输入所述书面语改写模型进行处理,获得所述目标口语文本对应的目标书面语文本;其中,所述书面语改写模型,基于书面语文本以及对所述书面语文本进行回译和转换处理获得的口语文本训练得到。The processing module 608 is configured to input the target spoken language text into the written language rewriting model for processing, and obtain target written language text corresponding to the target spoken language text; wherein, the written language rewriting model is based on the written language text and the written language rewriting model. The text is back-translated and converted from the spoken text obtained by training.

可选地,所述文本处理装置,还包括:Optionally, the text processing apparatus further includes:

选择模型模块,被配置为在所述文本类型为模糊文本类型的情况下,根据所述模糊文本类型选择对应的书面语转换模型;a selection model module, configured to select a corresponding written language conversion model according to the fuzzy text type when the text type is a fuzzy text type;

将所述目标口语文本输入所述书面语转换模型进行处理,获得所述目标口语文本对应的转换书面语文本;其中,所述书面语转换模型,基于所述书面语文本以及对所述书面语文本进行转换处理获得的基础口语文本训练得到。Input the target spoken language text into the written language conversion model for processing, and obtain the converted written language text corresponding to the target spoken language text; wherein, the written language conversion model is obtained based on the written language text and the conversion processing of the written language text. The basic oral text training is obtained.

可选地,所述书面语转换模型的训练,通过运行如下模块实现:Optionally, the training of the written language conversion model is implemented by running the following modules:

第一获取模块,被配置为获取书面语文本;a first obtaining module, configured to obtain written text;

第一转换模块,被配置为对所述书面语文本进行语句组成单元的转换处理,获得基础口语文本;a first conversion module, configured to perform conversion processing of sentence composition units on the written text to obtain basic spoken text;

第一构建模块,被配置为基于所述书面语文本与所述基础口语文本的对应关系,构建基础样本语料;a first building module, configured to construct a basic sample corpus based on the correspondence between the written text and the basic spoken text;

第一训练模块,被配置为通过所述基础样本语料对初始书面语转换模型进行训练,直至获得满足第一训练停止条件的所述书面语转换模型。The first training module is configured to train an initial written language conversion model by using the basic sample corpus until the written language conversion model that satisfies the first training stopping condition is obtained.

可选地,所述分类模块604,进一步被配置为:Optionally, the classification module 604 is further configured to:

将所述目标口语文本输入文本分类模型进行分类处理,获得所述目标口语文本对应的文本类型;其中,所述文本分类模型的训练,通过运行如下模块实现:Inputting the target spoken text into a text classification model for classification processing to obtain a text type corresponding to the target spoken text; wherein, the training of the text classification model is achieved by running the following modules:

获取样本模块,被配置为获取样本口语文本以及样本口语文本对应的语义清晰度标签;a sample acquisition module, configured to acquire the sample spoken text and the semantic clarity label corresponding to the sample spoken text;

构建样本对模块,被配置为基于所述样本口语文本以及所述语义清晰度标签构建训练样本对;a sample pair construction module configured to construct a training sample pair based on the sample spoken text and the semantic clarity label;

模型训练模块,被配置为通过所述训练样本对对初始文本分类模型进行模型训练,直至获得满足分类训练停止条件的所述文本分类模型。The model training module is configured to perform model training on the initial text classification model by using the training samples until the text classification model that satisfies the classification training stop condition is obtained.

可选地,所述处理模块608,进一步被配置为:Optionally, the processing module 608 is further configured to:

将所述目标口语文本进行分句处理,获得所述目标口语文本中包含的语句序列;The target spoken text is subjected to sentence processing to obtain a sequence of sentences contained in the target spoken text;

将所述语句序列中的口语句单元依次输入所述书面语改写模型的编码层进行编码处理,获得所述口语句单元对应的语句特征向量和词表向量,其中,所述词表向量由所述口语句单元与词表进行映射获得;The spoken sentence units in the sentence sequence are sequentially input into the coding layer of the written language rewriting model for encoding processing, and the sentence feature vector and vocabulary vector corresponding to the spoken sentence unit are obtained, wherein the vocabulary vector is determined by the The spoken sentence unit and the vocabulary are mapped to obtain;

计算所述语句特征向量与所述词表向量之间的向量积,并将所述向量积输入所述书面语改写模型的解码层进行解码处理,获得所述目标口语文本对应的目标书面语文本。Calculate the vector product between the sentence feature vector and the vocabulary vector, and input the vector product into the decoding layer of the written language rewriting model for decoding processing to obtain the target written language text corresponding to the target spoken language text.

可选地,所述书面语改写模型的训练,通过运行如下模块实现:Optionally, the training of the written language rewriting model is implemented by running the following modules:

第二获取模块,被配置为获取书面语文本;a second obtaining module, configured to obtain written text;

回译模块,被配置为通过对所述书面语文本进行回译处理,获得所述书面语文本对应的回译书面语文本;a back-translation module, configured to obtain back-translated written text corresponding to the written text by performing back-translation processing on the written text;

第二转换模块,被配置为对所述书面语文本和所述回译书面语文本分别进行语句组成单元的转换处理,获得口语文本;The second conversion module is configured to perform conversion processing of sentence composition units on the written text and the back-translated written text, respectively, to obtain spoken text;

第二构建模块,被配置为基于所述书面语文本和所述回译书面语文本与所述口语文本的对应关系,构建样本语料;a second building module, configured to build a sample corpus based on the written text and the correspondence between the back-translated written text and the spoken text;

第二训练模块,被配置为通过样本语料对初始书面语改写模型进行训练,直至获得满足第二训练停止条件的书面语改写模型。The second training module is configured to train the initial written language rewriting model through the sample corpus until a written language rewriting model that satisfies the second training stopping condition is obtained.

可选地,所述回译模块,包括:Optionally, the back translation module includes:

翻译子模块,被配置为将所述书面语文本翻译为预设语种对应的译文书面语文本;a translation submodule, configured to translate the written text into the translated written text corresponding to the preset language;

回译子模块,被配置为将所述译文书面语文本回译为所述书面语文本所属的目标语种,获得初始回译书面语文本;A back-translation sub-module, configured to back-translate the written text of the translation into the target language to which the written text belongs, and obtain the initial back-translated written text;

替换子模块,被配置为通过所述书面语文本中的关键词语对所述初始回译书面语文本中所述关键词语对应的目标关键词语进行替换,获得回译书面语文本。The replacement submodule is configured to replace the target key words corresponding to the key words in the initial back-translated written text by using the key words in the written text to obtain back-translated written text.

可选地,所述回译模块,还包括:Optionally, the back translation module further includes:

词性分析子模块,被配置为通过对所述书面语文本进行词性分析,识别所述书面语文本中词性为预设词性的关键词语;A part-of-speech analysis submodule, configured to identify key words whose part-of-speech is a preset part of speech in the written-language text by performing part-of-speech analysis on the written language text;

标记子模块,被配置为在所述书面语文本中对所述关键词语所处的位置进行位置标记;a marking submodule, configured to mark the position of the key phrase in the written text;

相应地,所述替换子模块,进一步被配置为:Correspondingly, the replacement submodule is further configured as:

通过所述关键词语,对所述初始回译书面语文本中所述位置标记对应的目标关键词语进行替换,获得回译书面语文本。By using the key words, the target key words corresponding to the position marks in the initial back-translated written text are replaced to obtain back-translated written text.

可选地,所述语句组成单元包括下述至少一项:子句单元、词语单元、字符单元以及符号单元。Optionally, the sentence composition unit includes at least one of the following: a clause unit, a word unit, a character unit, and a symbol unit.

可选地,在所述语句组成单元为子句单元的情况下,所述第二转换模块,包括:Optionally, when the statement composition unit is a clause unit, the second conversion module includes:

第一识别子模块,被配置为对所述待处理书面语文本进行语句识别,获得所述待处理书面语文本中包含的书面语语句;a first identification submodule, configured to perform sentence identification on the written language text to be processed, and obtain written language sentences contained in the written language text to be processed;

子句转换子模块,被配置为对所述书面语语句进行子句单元的转换处理,获得转换后的书面语语句;a clause conversion submodule, configured to perform conversion processing of clause units on the written language statement to obtain the converted written language statement;

第一确定模块,被配置为基于所述转换后的书面语语句确定口语文本。The first determination module is configured to determine the spoken text based on the converted written sentence.

可选地,所述子句转换子模块,进一步被配置为:Optionally, the clause conversion submodule is further configured to:

按照预设子句采样规则对所述书面语语句进行子句采样,获得所述书面语语句中的目标子句;在所述书面语语句中对所述目标子句进行转换处理,获得转换后的书面语语句;和/或Perform clause sampling on the written language statement according to the preset clause sampling rule to obtain the target clause in the written language statement; perform conversion processing on the target clause in the written language statement to obtain the converted written language statement ;and / or

确定预设子句集合中包含的预设子句对应的子句位置概率分布;基于所述子句位置概率分布在所述预设子句中确定目标预设子句以及所述目标预设子句对应的子句添加位置;根据所述子句添加位置将所述目标预设子句添加至所述书面语语句中,获得转换后的书面语语句。Determine the clause position probability distribution corresponding to the preset clause included in the preset clause set; determine the target preset clause and the target preset sub-clause in the preset clause based on the clause position probability distribution The clause addition position corresponding to the sentence is added; the target preset clause is added to the written language sentence according to the clause addition position to obtain the converted written language sentence.

可选地,所述子句转换子模块,进一步被配置为:Optionally, the clause conversion submodule is further configured to:

对所述目标子句进行复制获得复制目标子句,并将所述复制目标子句按照预设子句插入位置插入至所述书面语语句,获得转换后的书面语语句;和/或,Duplicating the target clause to obtain a duplicate target clause, and inserting the duplicate target clause into the written language statement according to a preset clause insertion position to obtain a converted written language statement; and/or,

在所述书面语语句中将删除所述目标子句;将所述目标子句按照预设子句插入规则插入删除后的书面语语句,获得转换后的书面语语句;和/或The target clause is to be deleted in the written language statement; the target clause is inserted into the deleted written language statement according to the preset clause insertion rule to obtain the converted written language statement; and/or

对所述目标子句进行句法分析,获得所述目标子句对应的句法结构;通过将目标子句按照所述句法结构对应的目标句法结构进行转换,获得转换后的书面语语句。Syntactic analysis is performed on the target clause to obtain a syntactic structure corresponding to the target clause; the converted written language sentence is obtained by converting the target clause according to the target syntactic structure corresponding to the syntactic structure.

可选地,在所述语句组成单元为词语单元的情况下,所述第二转换模块,包括:Optionally, in the case that the sentence composition unit is a word unit, the second conversion module includes:

第二识别子模块,被配置为对所述待处理书面语文本进行语句识别,获得所述待处理书面语文本中包含的书面语语句;The second identification submodule is configured to perform sentence identification on the written language text to be processed, and obtain written language sentences contained in the written language text to be processed;

确定分布子模块,被配置为确定预设词语集合中包含的预设词语对应的词语位置概率分布;A determination distribution submodule, configured to determine the word position probability distribution corresponding to the preset words included in the preset word set;

添加词语子模块,被配置为根据所述词语位置概率分布在所述预设词语中确定目标预设词语以及所述目标预设词语对应的词语添加位置,并根据所述词语添加位置将所述目标预设词语插添加至所述书面语语句中,获得转换后的书面语语句;A word adding sub-module is configured to determine a target preset word and a word addition position corresponding to the target preset word in the preset words according to the word position probability distribution, and add the word addition position according to the word addition position. Inserting target preset words into the written language sentence to obtain the converted written language sentence;

第二确定模块,被配置为基于转换后的书面语语句确定口语文本。The second determination module is configured to determine the spoken text based on the converted written sentence.

可选地,在所述语句组成单元为词语单元的情况下,所述第二转换模块,进一步被配置为:Optionally, in the case that the sentence composition unit is a word unit, the second conversion module is further configured to:

对所述待处理书面语文本进行语句识别,获得所述书面语文本中包含的书面语语句;Perform sentence recognition on the written language text to be processed to obtain written language sentences contained in the written language text;

按照预设词语采样规则对所述书面语语句中的词语进行词语采样,获得所述书面语语句中的目标词语;Perform word sampling on the words in the written language sentence according to the preset word sampling rule to obtain the target word in the written language sentence;

在所述书面语语句中删除所述目标词语,并将所述目标词语插入删除后的书面语语句中所述目标词语对应的预设插入范围内,获得转换后的书面语语句;Deleting the target word in the written language statement, and inserting the target word into the preset insertion range corresponding to the target word in the deleted written language statement to obtain a converted written language statement;

基于转换后的书面语语句确定口语文本。The spoken text is determined based on the transformed written sentences.

可选地,所述第二转换模块,进一步被配置为:Optionally, the second conversion module is further configured as:

复制词语子模块,被配置为对转换后的书面语语句中添加的所述目标预设词语进行复制,获得复制词语;The copy word submodule is configured to copy the target preset words added in the converted written language sentence to obtain the copied words;

插入词语子模块,被配置为将所述复制词语按照预设词语插入规则插入所述转换后的书面语语句中,获得插入后的书面语语句;A word insertion submodule, configured to insert the copied word into the converted written language statement according to a preset word insertion rule, to obtain the inserted written language statement;

第三确定模块,被配置为基于插入后的书面语语句确定口语文本。The third determination module is configured to determine the spoken text based on the inserted written sentence.

可选地,在所述语句组成单元为字符单元的情况下,所述第二转换模块,进一步被配置为:Optionally, in the case that the statement composition unit is a character unit, the second conversion module is further configured as:

对所述待处理书面语文本进行语句识别,获得所述待处理书面语文本中包含的书面语语句;Perform sentence recognition on the written language text to be processed to obtain written language sentences contained in the written language text to be processed;

按照预设字符采样规则对所述书面语语句中的字符进行字符采样,获得书面语语句中的目标字符;According to the preset character sampling rule, character sampling is performed on the characters in the written language sentence to obtain the target character in the written language sentence;

在所述书面语语句中删除所述目标字符,并将所述目标字符插入删除后的书面语语句中所述目标字符对应的预设字符插入范围内,获得转换后的书面语语句;Deleting the target character in the written language statement, and inserting the target character into the preset character insertion range corresponding to the target character in the deleted written language statement, to obtain the converted written language statement;

基于转换后的书面语语句确定口语文本。The spoken text is determined based on the transformed written sentences.

可选地,在所述语句组成单元为符号单元的情况下,所述第二转换模块,进一步被配置为:Optionally, in the case that the statement composition unit is a symbol unit, the second conversion module is further configured as:

对所述待处理书面语文本进行语句识别,获得所述待处理书面语文本中包含的书面语语句;按照预设符号采样规则对所述书面语语句进行符号采样,获得所述书面语语句中的目标标点符号,在所述书面语语句中删除所述目标标点符号,获得转换后的书面语语句;基于转换后的书面语语句确定口语文本;和/或Perform sentence recognition on the written language text to be processed to obtain written language sentences contained in the written language text to be processed; perform symbol sampling on the written language sentences according to preset symbol sampling rules to obtain target punctuation marks in the written language sentences, Deleting the target punctuation in the written sentence to obtain a converted written sentence; determining a spoken text based on the converted written sentence; and/or

对所述待处理书面语文本进行语句识别,获得所述待处理书面语文本中包含的书面语语句;按照预设符号子句采样规则对所述书面语语句进行符号子句采样,获得所述书面语语句中的目标符号子句,在所述目标符号子句中插入预设标点符号,获得转换后的书面语语句;基于转换后的书面语语句确定口语文本。Perform sentence recognition on the written language text to be processed to obtain written language sentences contained in the written language text to be processed; perform symbol clause sampling on the written language sentence according to a preset symbol clause sampling rule to obtain the written language sentence in the written language sentence. A target symbol clause, inserting a preset punctuation mark into the target symbol clause to obtain a converted written language sentence; determining the spoken text based on the converted written language sentence.

可选地,所述第二转换模块,进一步被配置为:Optionally, the second conversion module is further configured as:

确定所述待处理书面语文本的转换处理策略对应的转换处理概率;determining the conversion processing probability corresponding to the conversion processing strategy of the written text to be processed;

基于所述转换处理概率,在所述转换处理策略中确定待执行的目标转换处理策略;Based on the conversion processing probability, determining the target conversion processing strategy to be executed in the conversion processing strategy;

通过执行所述目标转换处理策略对所述书面语文本进行语句组成单元的转换处理,获得所述待处理书面语文本对应的口语文本。By executing the target conversion processing strategy, the written language text is subjected to sentence composition unit conversion processing, so as to obtain the spoken language text corresponding to the written language text to be processed.

可选地,所述文本处理装置,还包括:Optionally, the text processing apparatus further includes:

识别信息模块,被配置为识别所述口语文本中的异常信息;an identification information module configured to identify abnormal information in the spoken text;

清洗模块,被配置为根据所述异常信息对所述口语文本进行数据清洗,获得清洗后的口语文本;a cleaning module, configured to perform data cleaning on the spoken text according to the abnormal information to obtain cleaned spoken text;

构建样本语料模块,被配置为基于所述书面语文本和所述回译书面语文本与所述清洗后的口语文本的对应关系,构建样本语料。A sample corpus construction module is configured to construct a sample corpus based on the written text and the correspondence between the back-translated written text and the cleaned spoken text.

可选地,所述第二转换模块,进一步被配置为:Optionally, the second conversion module is further configured as:

对所述书面语文本进行语句组成单元的转换处理,获得所述书面语文本对应的第一口语文本;performing conversion processing on the written language text of the sentence composition unit to obtain the first spoken language text corresponding to the written language text;

对所述回译书面语文本进行语句组成单元的转换处理,获得所述回译书面语文本对应的第二口语文本;Performing sentence composition unit conversion processing on the back-translated written text to obtain a second spoken text corresponding to the back-translated written text;

将所述第一口语文本以及所述第二口语文本作为所述口语文本。The first spoken text and the second spoken text are used as the spoken text.

可选地,所述文本处理装置,还包括:Optionally, the text processing apparatus further includes:

删除模块,被配置为在所述文本类型为无效文本类型的情况下,删除所述目标口语文本。A deletion module, configured to delete the target spoken text when the text type is an invalid text type.

本申请实施例提供的文本处理装置,通过获取目标口语文本;将所述目标口语文本进行分类处理,获得所述目标口语文本对应的文本类型;再在所述文本类型为标准文本类型的情况下,根据所述标准文本类型选择对应的书面语改写模型,实现了根据目标口语文本的文本类型选择适合目标口语文本的书面语改写模型;再将所述目标口语文本输入所述书面语改写模型进行处理,获得所述目标口语文本对应的目标书面语文本,使书面语改写更加具有针对性,并提高了书面语改写的准确性。其中,所述书面语改写模型,基于书面语文本以及对所述书面语文本进行回译和转换处理获得的口语文本训练得到,实现了基于回译以及转换处理对书面语文本进行预处理,从而为模型训练提供大量口语文本-书面语文本的样本语料,简化了模型的训练难度,也避免了人工耗时费力收集并处理大量的文本数据,节约了时间成本以及人力成本。The text processing apparatus provided by the embodiment of the present application obtains the target spoken text by obtaining the target spoken text; classifying the target spoken text to obtain the text type corresponding to the target spoken text; and then in the case that the text type is a standard text type , select the corresponding written language rewriting model according to the standard text type, and realize the selection of the written language rewriting model suitable for the target spoken language text according to the text type of the target spoken language text; and then input the target spoken language text into the written language rewriting model for processing, and obtain The target written language text corresponding to the target spoken language text makes the written language rewriting more targeted and improves the accuracy of the written language rewriting. The written language rewriting model is obtained by training based on the written language text and the spoken language text obtained by back-translation and conversion processing of the written language text, and realizes the preprocessing of the written language text based on the back-translation and conversion processing, thereby providing model training. A large amount of spoken text-written text sample corpus simplifies the training difficulty of the model, and also avoids the time-consuming and laborious collection and processing of large amounts of text data, saving time and labor costs.

上述为本实施例的一种文本处理装置的示意性方案。需要说明的是,该文本处理装置的技术方案与上述的文本处理方法的技术方案属于同一构思,文本处理装置的技术方案未详细描述的细节内容,均可以参见上述文本处理方法的技术方案的描述。The above is a schematic solution of a text processing apparatus according to this embodiment. It should be noted that the technical solution of the text processing device and the technical solution of the above-mentioned text processing method belong to the same concept, and the details that are not described in detail in the technical solution of the text processing device can be referred to the description of the technical solution of the above-mentioned text processing method. .

需要说明的是,装置权利要求中的各组成部分应当理解为实现该程序流程各步骤或该方法各步骤所必须建立的功能模块,各个功能模块并非实际的功能分割或者分离限定。由这样一组功能模块限定的装置权利要求应当理解为主要通过说明书记载的计算机程序实现该解决方案的功能模块构架,而不应当理解为主要通过硬件方式实现该解决方案的实体装置。It should be noted that each component in the device claim should be understood as a functional module that must be established to realize each step of the program flow or each step of the method, and each functional module is not an actual function division or separation limitation. A device claim defined by such a set of functional modules should be understood as a functional module architecture that mainly implements the solution through the computer program described in the specification, and should not be understood as a physical device that mainly implements the solution through hardware.

本申请一实施例中还提供一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机指令,所述处理器执行所述计算机指令时实现所述的文本处理方法的步骤。An embodiment of the present application further provides a computing device, including a memory, a processor, and computer instructions stored in the memory and executable on the processor, and the processor implements the text processing when executing the computer instructions steps of the method.

上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的文本处理方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述文本处理方法的技术方案的描述。The above is a schematic solution of a computing device according to this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above text processing method belong to the same concept, and the details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above text processing method.

本申请一实施例还提供一种计算机可读存储介质,其存储有计算机指令,所述计算机指令被处理器执行时实现如前所述文本处理方法的步骤。An embodiment of the present application further provides a computer-readable storage medium, which stores computer instructions, and when the computer instructions are executed by a processor, implements the steps of the aforementioned text processing method.

上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的文本处理方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述文本处理方法的技术方案的描述。The above is a schematic solution of a computer-readable storage medium of this embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the above-mentioned text processing method belong to the same concept, and the details not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above-mentioned text processing method.

本申请实施例公开了一种芯片,其存储有计算机指令,所述计算机指令被处理器执行时实现如前所述文本处理方法的步骤。The embodiment of the present application discloses a chip, which stores computer instructions, and when the computer instructions are executed by a processor, implements the steps of the text processing method as described above.

上述为本实施例的一种芯片的示意性方案。需要说明的是,该芯片的技术方案与上述的文本处理方法的技术方案属于同一构思,芯片的技术方案未详细描述的细节内容,均可以参见上述文本处理方法的技术方案的描述。The above is a schematic solution of a chip of this embodiment. It should be noted that the technical solution of the chip and the technical solution of the text processing method above belong to the same concept, and the details of the technical solution of the chip that are not described in detail can be referred to the description of the technical solution of the text processing method.

上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present application. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

所述计算机指令包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。The computer instructions include computer program code, which may be in source code form, object code form, an executable file, some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Electric carrier signals and telecommunication signals are not included.

需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本申请所必须的。It should be noted that, for the convenience of description, the foregoing method embodiments are described as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence. Because in accordance with the present application, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily all necessary for the present application.

在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

以上公开的本申请优选实施例只是用于帮助阐述本申请。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本申请的内容,可作很多的修改和变化。本申请选取并具体描述这些实施例,是为了更好地解释本申请的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本申请。本申请仅受权利要求书及其全部范围和等效物的限制。The preferred embodiments of the present application disclosed above are only provided to help illustrate the present application. Alternative embodiments are not intended to exhaust all details, nor do they limit the invention to only the described embodiments. Obviously, many modifications and variations are possible in light of the content of this application. The present application selects and specifically describes these embodiments in order to better explain the principles and practical applications of the present application, so that those skilled in the art can well understand and utilize the present application. This application is to be limited only by the claims, along with their full scope and equivalents.

Claims (14)

1.一种文本处理方法,其特征在于,包括:1. a text processing method, is characterized in that, comprises: 获取目标口语文本;Get the target spoken text; 将所述目标口语文本进行分类处理,获得所述目标口语文本对应的文本类型;classifying the target spoken text to obtain a text type corresponding to the target spoken text; 在所述文本类型为标准文本类型的情况下,根据所述标准文本类型选择对应的书面语改写模型;In the case that the text type is a standard text type, selecting a corresponding written language rewriting model according to the standard text type; 将所述目标口语文本输入所述书面语改写模型进行处理,获得所述目标口语文本对应的目标书面语文本;Inputting the target spoken text into the written rewriting model for processing to obtain a target written text corresponding to the target spoken text; 其中,所述书面语改写模型,基于书面语文本以及对所述书面语文本进行回译和转换处理获得的口语文本训练得到。The written language rewriting model is obtained by training based on the written language text and the spoken language text obtained by back-translating and converting the written language text. 2.根据权利要求1所述的文本处理方法,其特征在于,所述获得所述目标口语文本对应的文本类型之后,还包括:2. The text processing method according to claim 1, wherein after obtaining the text type corresponding to the target spoken text, the method further comprises: 在所述文本类型为模糊文本类型的情况下,根据所述模糊文本类型选择对应的书面语转换模型;In the case that the text type is a fuzzy text type, selecting a corresponding written language conversion model according to the fuzzy text type; 将所述目标口语文本输入所述书面语转换模型进行处理,获得所述目标口语文本对应的转换书面语文本;Inputting the target spoken language text into the written language conversion model for processing to obtain the converted written language text corresponding to the target spoken language text; 其中,所述书面语转换模型,基于所述书面语文本以及对所述书面语文本进行转换处理获得的基础口语文本训练得到。Wherein, the written language conversion model is obtained by training based on the written language text and the basic spoken language text obtained by converting the written language text. 3.根据权利要求2所述的文本处理方法,其特征在于,所述书面语转换模型的训练,包括:3. The text processing method according to claim 2, wherein the training of the written language conversion model comprises: 获取书面语文本;obtain written text; 对所述书面语文本进行语句组成单元的转换处理,获得基础口语文本;performing conversion processing on the written text to form units of sentences to obtain basic spoken text; 基于所述书面语文本与所述基础口语文本的对应关系,构建基础样本语料;constructing a basic sample corpus based on the correspondence between the written text and the basic spoken text; 通过所述基础样本语料对初始书面语转换模型进行训练,直至获得满足第二训练停止条件的所述书面语转换模型。The initial written language conversion model is trained through the basic sample corpus until the written language conversion model that satisfies the second training stop condition is obtained. 4.根据权利要求1所述的文本处理方法,其特征在于,所述将所述目标口语文本进行分类处理,获得所述目标口语文本对应的文本类型,包括:4. The text processing method according to claim 1, wherein the classifying and processing the target spoken text to obtain a text type corresponding to the target spoken text, comprising: 将所述目标口语文本输入文本分类模型进行分类处理,获得所述目标口语文本对应的文本类型;其中,所述文本分类模型的训练,包括:Inputting the target spoken text into a text classification model for classification processing to obtain a text type corresponding to the target spoken text; wherein, the training of the text classification model includes: 获取样本口语文本以及样本口语文本对应的语义清晰度标签;Obtain the sample spoken text and the semantic clarity label corresponding to the sample spoken text; 基于所述样本口语文本以及所述语义清晰度标签构建训练样本对;constructing a training sample pair based on the sample spoken text and the semantic intelligibility label; 通过所述训练样本对对初始文本分类模型进行模型训练,直至获得满足分类训练停止条件的所述文本分类模型。Model training is performed on the initial text classification model through the training samples until the text classification model that satisfies the classification training stop condition is obtained. 5.根据权利要求1所述的文本处理方法,其特征在于,所述将所述目标口语文本输入所述书面语改写模型进行处理,获得所述目标口语文本对应的目标书面语文本,包括:5. The text processing method according to claim 1, characterized in that, inputting the target spoken text into the written rewriting model for processing, and obtaining the target written text corresponding to the target spoken text, comprising: 将所述目标口语文本进行分句处理,获得所述目标口语文本中包含的语句序列;The target spoken text is subjected to sentence processing to obtain a sequence of sentences contained in the target spoken text; 将所述语句序列中的口语句单元依次输入所述书面语改写模型的编码层进行编码处理,获得所述口语句单元对应的语句特征向量和词表向量,其中,所述词表向量由所述口语句单元与词表进行映射获得;The spoken sentence units in the sentence sequence are sequentially input into the coding layer of the written language rewriting model for encoding processing, and the sentence feature vector and vocabulary vector corresponding to the spoken sentence unit are obtained, wherein the vocabulary vector is determined by the The spoken sentence unit and the vocabulary are mapped to obtain; 计算所述语句特征向量与所述词表向量之间的向量积,并将所述向量积输入所述书面语改写模型的解码层进行解码处理,获得所述目标口语文本对应的目标书面语文本。Calculate the vector product between the sentence feature vector and the vocabulary vector, and input the vector product into the decoding layer of the written language rewriting model for decoding processing to obtain the target written language text corresponding to the target spoken language text. 6.根据权利要求1所述的文本处理方法,其特征在于,所述书面语改写模型,基于书面语文本以及对所述书面语文本进行回译和转换处理获得的口语文本训练得到,包括:6. The text processing method according to claim 1, wherein the written language rewriting model is obtained by training based on the written language text and the spoken language text obtained by back-translating and converting the written language text, comprising: 获取书面语文本;obtain written text; 通过对所述书面语文本进行回译处理,获得所述书面语文本对应的回译书面语文本;By performing back-translation processing on the written text, a back-translated written text corresponding to the written text is obtained; 对所述书面语文本和所述回译书面语文本分别进行语句组成单元的转换处理,获得口语文本;The written text and the back-translated written text are respectively subjected to the conversion processing of the sentence constituent units to obtain the spoken text; 基于所述书面语文本和所述回译书面语文本与所述口语文本的对应关系,构建样本语料;constructing a sample corpus based on the written text and the correspondence between the back-translated written text and the spoken text; 通过所述样本语料对初始书面语改写模型进行训练,直至获得满足第二训练停止条件的所述书面语改写模型。The initial written language rewriting model is trained through the sample corpus until the written language rewriting model that satisfies the second training stop condition is obtained. 7.根据权利要求6所述的文本处理方法,其特征在于,所述通过对所述书面语文本进行回译处理,获得所述书面语文本对应的回译书面语文本,包括:7. The text processing method according to claim 6, wherein, by performing back-translation processing on the written text to obtain a back-translated written text corresponding to the written text, the method comprises: 将所述书面语文本翻译为预设语种对应的译文书面语文本;Translate the written language text into the translated written language text corresponding to the preset language; 将所述译文书面语文本回译为所述书面语文本所属的目标语种,获得初始回译书面语文本;Back-translating the written text of the translation into the target language to which the written text belongs to obtain the initial back-translated written text; 通过所述书面语文本中的关键词语对所述初始回译书面语文本中所述关键词语对应的目标关键词语进行替换,获得回译书面语文本。The back-translated written text is obtained by replacing the target key words corresponding to the key words in the initial back-translated written text with the key words in the written text. 8.根据权利要求7所述的文本处理方法,其特征在于,所述将所述书面语文本翻译为预设语种对应的译文书面语文本之前,还包括:8. The text processing method according to claim 7, wherein before translating the written text into the translated written text corresponding to a preset language, the method further comprises: 通过对所述书面语文本进行词性分析,识别所述书面语文本中词性为预设词性的关键词语;By performing part-of-speech analysis on the written text, identifying key words whose part of speech is a preset part of speech in the written text; 在所述书面语文本中对所述关键词语所处的位置进行位置标记;Mark the position of the key phrase in the written text; 相应地,所述通过所述书面语文本中的关键词语对所述初始回译书面语文本中所述关键词语对应的目标关键词语进行替换,获得回译书面语文本,包括:Correspondingly, the target key words corresponding to the key words in the initial back-translated written text are replaced by the key words in the written text to obtain the back-translated written text, including: 基于所述位置标记,通过所述关键词语对所述初始回译书面语文本中对应的目标关键词语进行替换,获得回译书面语文本。Based on the position marker, the corresponding target key words in the initial back-translated written text are replaced by the key words to obtain back-translated written text. 9.根据权利要求3或6所述的文本处理方法,其特征在于,所述语句组成单元包括下述至少一项:子句单元、词语单元、字符单元以及符号单元。9 . The text processing method according to claim 3 or 6 , wherein the sentence composition unit comprises at least one of the following: a clause unit, a word unit, a character unit and a symbol unit. 10 . 10.根据权利要求6所述的文本处理方法,其特征在于,将所述书面语文本和所述回译书面语文本中任意一个书面语文本作为待处理书面文本,对所述待处理书面文本进行语句组成单元的转换处理,包括:10 . The text processing method according to claim 6 , wherein any one of the written text and the back-translated written text is used as the written text to be processed, and the written text to be processed is composed of sentences. 11 . Unit conversion processing, including: 确定所述待处理书面语文本的转换处理策略对应的转换处理概率;determining the conversion processing probability corresponding to the conversion processing strategy of the written text to be processed; 基于所述转换处理概率,在所述转换处理策略中确定待执行的目标转换处理策略;Based on the conversion processing probability, determining the target conversion processing strategy to be executed in the conversion processing strategy; 通过执行所述目标转换处理策略对所述待处理书面语文本进行语句组成单元的转换处理,获得所述待处理书面语文本对应的口语文本。By executing the target conversion processing strategy, the to-be-processed written text is subjected to sentence composition unit conversion processing, so as to obtain the spoken text corresponding to the to-be-processed written text. 11.根据权利要求1所述的文本处理方法,其特征在于,所述获得所述目标口语文本对应的文本类型之后,还包括:11. The text processing method according to claim 1, wherein after obtaining the text type corresponding to the target spoken text, the method further comprises: 在所述文本类型为无效文本类型的情况下,删除所述目标口语文本。In the case that the text type is an invalid text type, the target spoken text is deleted. 12.一种文本处理装置,其特征在于,包括:12. A text processing device, comprising: 获取模块,被配置为获取目标口语文本;The acquisition module is configured to acquire the target spoken text; 分类模块,被配置为将所述目标口语文本进行分类处理,获得所述目标口语文本对应的文本类型;a classification module, configured to classify the target spoken text to obtain a text type corresponding to the target spoken text; 选择模块,被配置为在所述文本类型为标准文本类型的情况下,根据所述标准文本类型选择对应的书面语改写模型;a selection module, configured to select a corresponding written language rewriting model according to the standard text type when the text type is a standard text type; 处理模块,被配置为将所述目标口语文本输入所述书面语改写模型进行处理,获得所述目标口语文本对应的目标书面语文本;其中,所述书面语改写模型,基于书面语文本以及对所述书面语文本进行回译和转换处理获得的口语文本训练得到。a processing module, configured to input the target spoken text into the written language rewriting model for processing, and obtain a target written language text corresponding to the target spoken language text; wherein, the written language rewriting model is based on the written language text and the written language text It is obtained by training the spoken text obtained by back-translation and conversion processing. 13.一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机指令,其特征在于,所述处理器执行所述计算机指令时实现权利要求1-11任意一项所述方法的步骤。13. A computing device, comprising a memory, a processor, and computer instructions stored on the memory and running on the processor, wherein the processor implements any one of claims 1-11 when executing the computer instructions the steps of the method described in item. 14.一种计算机可读存储介质,其存储有计算机指令,其特征在于,所述计算机指令被处理器执行时实现权利要求1-11任意一项所述方法的步骤。14. A computer-readable storage medium storing computer instructions, wherein when the computer instructions are executed by a processor, the steps of the method according to any one of claims 1-11 are implemented.
CN202210257335.6A 2022-03-16 2022-03-16 Text processing method and device Pending CN114357122A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210257335.6A CN114357122A (en) 2022-03-16 2022-03-16 Text processing method and device
CN202210590875.6A CN114880436A (en) 2022-03-16 2022-03-16 Text processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210257335.6A CN114357122A (en) 2022-03-16 2022-03-16 Text processing method and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202210590875.6A Division CN114880436A (en) 2022-03-16 2022-03-16 Text processing method and device

Publications (1)

Publication Number Publication Date
CN114357122A true CN114357122A (en) 2022-04-15

Family

ID=81094603

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202210590875.6A Pending CN114880436A (en) 2022-03-16 2022-03-16 Text processing method and device
CN202210257335.6A Pending CN114357122A (en) 2022-03-16 2022-03-16 Text processing method and device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202210590875.6A Pending CN114880436A (en) 2022-03-16 2022-03-16 Text processing method and device

Country Status (1)

Country Link
CN (2) CN114880436A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115358215A (en) * 2022-04-29 2022-11-18 南京信息工程大学 Generation type abstract error correction method aiming at fact consistency
CN119558270A (en) * 2025-01-24 2025-03-04 人民中科(北京)智能技术有限公司 Spoken text rewriting method and device based on counterfactual reasoning

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119294364B (en) * 2024-12-13 2025-03-28 贵阳康养职业大学 Natural language processing-based public health investigation text analysis method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140358519A1 (en) * 2013-06-03 2014-12-04 Xerox Corporation Confidence-driven rewriting of source texts for improved translation
CN104731775A (en) * 2015-02-26 2015-06-24 北京捷通华声语音技术有限公司 Method and device for converting spoken languages to written languages
CN106354716A (en) * 2015-07-17 2017-01-25 华为技术有限公司 Method and device for converting text
CN110287461A (en) * 2019-05-24 2019-09-27 北京百度网讯科技有限公司 Text conversion method, device and storage medium
CN111666775A (en) * 2020-05-21 2020-09-15 平安科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN111737983A (en) * 2020-06-22 2020-10-02 网易(杭州)网络有限公司 Text writing style processing method, device, equipment and storage medium
CN113822052A (en) * 2020-06-18 2021-12-21 上海流利说信息技术有限公司 A text error detection method, device, electronic device and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7552045B2 (en) * 2006-12-18 2009-06-23 Nokia Corporation Method, apparatus and computer program product for providing flexible text based language identification
CN105843811B (en) * 2015-01-13 2019-12-06 华为技术有限公司 method and apparatus for converting text
CN111414732A (en) * 2019-01-07 2020-07-14 北京嘀嘀无限科技发展有限公司 Text style conversion method and device, electronic equipment and storage medium
CN110188327B (en) * 2019-05-30 2021-05-14 北京百度网讯科技有限公司 Method and device for removing spoken language of text
CN112733554B (en) * 2020-12-23 2021-09-07 深圳市爱科云通科技有限公司 Spoken text processing method, apparatus, server and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140358519A1 (en) * 2013-06-03 2014-12-04 Xerox Corporation Confidence-driven rewriting of source texts for improved translation
CN104731775A (en) * 2015-02-26 2015-06-24 北京捷通华声语音技术有限公司 Method and device for converting spoken languages to written languages
CN106354716A (en) * 2015-07-17 2017-01-25 华为技术有限公司 Method and device for converting text
CN110287461A (en) * 2019-05-24 2019-09-27 北京百度网讯科技有限公司 Text conversion method, device and storage medium
CN111666775A (en) * 2020-05-21 2020-09-15 平安科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN113822052A (en) * 2020-06-18 2021-12-21 上海流利说信息技术有限公司 A text error detection method, device, electronic device and storage medium
CN111737983A (en) * 2020-06-22 2020-10-02 网易(杭州)网络有限公司 Text writing style processing method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YU CHENG 等: "Contextual Text Style Transfer", 《ARXIV》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115358215A (en) * 2022-04-29 2022-11-18 南京信息工程大学 Generation type abstract error correction method aiming at fact consistency
CN115358215B (en) * 2022-04-29 2025-06-27 南京信息工程大学 A generative summary error correction method for factual consistency
CN119558270A (en) * 2025-01-24 2025-03-04 人民中科(北京)智能技术有限公司 Spoken text rewriting method and device based on counterfactual reasoning

Also Published As

Publication number Publication date
CN114880436A (en) 2022-08-09

Similar Documents

Publication Publication Date Title
CN109840331B (en) A Neural Machine Translation Method Based on User Dictionary
CN110390103B (en) Automatic short text summarization method and system based on double encoders
CN109117483B (en) Training method and device of neural network machine translation model
CN109359294B (en) Ancient Chinese translation method based on neural machine translation
CN111832275A (en) Text creation method, device, device and storage medium
JP7335300B2 (en) Knowledge pre-trained model training method, apparatus and electronic equipment
CN114357122A (en) Text processing method and device
CN110134954B (en) Named entity recognition method based on Attention mechanism
JP2010250814A (en) Part-of-speech tagging system, training device and method of part-of-speech tagging model
CN105261358A (en) N-gram grammar model constructing method for voice identification and voice identification system
CN116796744A (en) Entity relation extraction method and system based on deep learning
CN115392259A (en) Microblog text sentiment analysis method and system based on confrontation training fusion BERT
CN117251524A (en) Short text classification method based on multi-strategy fusion
CN116187282A (en) Training method of text review model, text review method and device
CN115587590A (en) Training corpus construction method, translation model training method, translation method
CN116049387A (en) Short text classification method, device and medium based on graph convolution
Yazar et al. Low-resource neural machine translation: A systematic literature review
CN107168953A (en) The new word discovery method and system that word-based vector is characterized in mass text
CN116340502A (en) Information retrieval method and device based on semantic understanding
Sun [Retracted] Analysis of Chinese Machine Translation Training Based on Deep Learning Technology
WO2022227166A1 (en) Word replacement method and apparatus, electronic device, and storage medium
CN114298037A (en) A deep learning-based text summary acquisition method
CN117573096B (en) Intelligent code completion method integrating abstract syntax tree structure information
CN117290515B (en) Training method of text annotation model, method and device for generating text graph
CN118503411A (en) Outline generation method, model training method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination