CN115705474A

CN115705474A - Document translation method, apparatus, storage medium, and program product

Info

Publication number: CN115705474A
Application number: CN202110895692.0A
Authority: CN
Inventors: 邓敏捷
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2023-02-17

Abstract

Embodiments of the present disclosure provide a document translation method, device, storage medium, and program product. By analyzing the original document to be translated, at least one paragraph text to be translated and an initial syntax tree including the paragraph text are obtained, and the node mark where each paragraph text is located The identification of each paragraph copy; according to the style information of each paragraph copy, convert each paragraph copy into a paragraph copy with the first style tag; obtain the translation corresponding to each paragraph copy, and the translation is marked with the same position as the first style tag The second style label of the type; replace each paragraph copy in the initial syntax tree with its corresponding translation according to the identification of each paragraph copy, and restore the document according to the replaced syntax tree to obtain the translation document corresponding to the original document. By translating at the granularity of paragraph copy, the semantics of the translation are complete and accurate, and by replacing the paragraph copy with the translation in the syntax tree, it is ensured that the style of the translated document is consistent with the original document, improving the quality of document translation.

Description

Document translation method, device, storage medium and program product

技术领域technical field

本公开实施例涉及计算机技术领域，尤其涉及一种文档翻译方法、设备、存储介质及程序产品。The embodiments of the present disclosure relate to the field of computer technology, and in particular to a document translation method, device, storage medium and program product.

背景技术Background technique

文档翻译是将一种语言的文档翻译成为另一种目标语言文档的文本处理过程。伴随着全球化进程的不断深化和互联网的迅速发展，文档翻译需求日益涌现。Document translation is the text processing process of translating a document in one language into a document in another target language. With the continuous deepening of the globalization process and the rapid development of the Internet, the demand for document translation is increasingly emerging.

市面上一些机器翻译方式对文档提取最底部的文案，甚至会把文档源格式转换为另外一个种格式，如Google的文档翻译，然后进行机器翻译，整个翻译过程受限于提取的最小单元粒度以及翻译引擎的准确度，可能导致一些不可控的翻译结果，例如翻译语句不顺畅，或者格式上存在不可控偏差；而若想避免这些问题，可提高最小单元粒度判定门槛，但受限于机器翻译的语法顺序，会导致无法精准还原，导致这个场景局限在一句话或者几段短小的文字上，最终翻译的结果仍可能语义不地道，或者语义上下文丢失。Some machine translation methods on the market extract the bottom copy of the document, and even convert the source format of the document into another format, such as Google’s document translation, and then perform machine translation. The entire translation process is limited by the minimum unit granularity of extraction and The accuracy of the translation engine may lead to some uncontrollable translation results, such as unsmooth translation sentences, or uncontrollable deviations in the format; if you want to avoid these problems, you can increase the minimum unit granularity judgment threshold, but it is limited by machine translation The grammatical order cannot be accurately restored, and the scene is limited to one sentence or a few short paragraphs of text. The final translation result may still be semantically inappropriate, or the semantic context may be lost.

发明内容Contents of the invention

本公开实施例提供一种文档翻译方法及设文档翻译方法、设备、存储介质及程序产品，以提高文档翻译准确性，且保留文档格式。Embodiments of the present disclosure provide a document translation method and a document translation method, device, storage medium, and program product, so as to improve the accuracy of document translation and preserve the document format.

第一方面，本公开实施例提供一种文档翻译方法，包括：In a first aspect, an embodiment of the present disclosure provides a document translation method, including:

对待翻译的原文档进行解析，获取待翻译的至少一个段落文案、以及包括所述段落文案的初始语法树，其中所述初始语法树中各段落文案所在节点标记有各段落文案的标识；Analyzing the original document to be translated, obtaining at least one paragraph text to be translated, and an initial syntax tree including the paragraph text, wherein the node where each paragraph text in the initial syntax tree is marked with the identification of each paragraph text;

根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案；converting each paragraph copy into a paragraph copy with a first style tag according to the style information of each paragraph copy;

获取各段落文案对应的译文，其中所述译文中与第一样式标签对应位置处标记有相同类型的第二样式标签；Obtain the translation corresponding to each paragraph copy, wherein the position corresponding to the first style label in the translation is marked with the same type of second style label;

根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文，并根据替换后的语法树进行文档还原，得到原文档对应的译文文档。Each paragraph text in the initial syntax tree is replaced with its corresponding translation according to the identification of each paragraph text, and the document is restored according to the replaced syntax tree to obtain a translation document corresponding to the original document.

第二方面，本公开实施例提供一种文档翻译设备，包括：In a second aspect, an embodiment of the present disclosure provides a document translation device, including:

解析单元，用于对待翻译的原文档进行解析，获取待翻译的至少一个段落文案、以及包括所述段落文案的初始语法树，其中所述初始语法树中各段落文案所在节点标记有各段落文案的标识；A parsing unit, configured to parse the original document to be translated, obtain at least one paragraph text to be translated, and an initial syntax tree including the paragraph text, wherein the node where each paragraph text in the initial syntax tree is marked with each paragraph text logo;

转化单元，用于根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案；a conversion unit, configured to convert each paragraph copy into a paragraph copy with a first style tag according to the style information of each paragraph copy;

翻译单元，用于获取各段落文案对应的译文，其中所述译文中与第一样式标签对应位置处标记有相同类型的第二样式标签；a translation unit, configured to obtain the translation corresponding to each paragraph copy, wherein the position corresponding to the first style label in the translation is marked with the same type of second style label;

还原单元，用于根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文，并根据替换后的语法树进行文档还原，得到原文档对应的译文文档。The restoration unit is configured to replace each paragraph text in the initial syntax tree with its corresponding translation according to the identification of each paragraph text, and restore the document according to the replaced syntax tree to obtain the translation document corresponding to the original document.

第三方面，本公开实施例提供一种电子设备，包括：至少一个处理器和存储器；In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor and a memory;

所述存储器存储计算机执行指令；the memory stores computer-executable instructions;

所述至少一个处理器执行所述存储器存储的计算机执行指令，使得所述至少一个处理器执行如上第一方面以及第一方面各种可能的设计所述的文档翻译方法。The at least one processor executes the computer-executed instructions stored in the memory, so that the at least one processor executes the document translation method described in the above first aspect and various possible designs of the first aspect.

第四方面，本公开实施例提供一种计算机可读存储介质，所述计算机可读存储介质中存储有计算机执行指令，当处理器执行所述计算机执行指令时，实现如上第一方面以及第一方面各种可能的设计所述的文档翻译方法。In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the processor executes the computer-executable instructions, the above first aspect and the first Aspects of various possible designs of the described document translation method.

第五方面，本公开实施例提供一种计算机程序产品，包括计算机执行指令，当处理器执行所述计算机执行指令时，实现如上第一方面以及第一方面各种可能的设计所述的文档翻译方法。In the fifth aspect, the embodiments of the present disclosure provide a computer program product, including computer-executable instructions. When the processor executes the computer-executable instructions, the document translation described in the above first aspect and various possible designs of the first aspect is realized. method.

本实施例提供的文档翻译方法、设备、存储介质及程序产品，通过对待翻译的原文档进行解析，获取待翻译的至少一个段落文案、以及包括段落文案的初始语法树，其中初始语法树中各段落文案所在节点标记有各段落文案的标识；根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案；获取各段落文案对应的译文，其中译文中与第一样式标签对应位置处标记有相同类型的第二样式标签；根据各段落文案的标识将初始语法树中的每一段落文案替换为其对应的译文，并根据替换后的语法树进行文档还原，得到原文档对应的译文文档。通过以段落文案为粒度进行翻译，可保证译文语义完整准确，并且通过在语法树中以译文替换对应的段落文案，实现基于语法树对文档样式的还原，保证译文文档与原文档的样式一致，提高文档翻译效果和翻译质量。The document translation method, device, storage medium, and program product provided in this embodiment obtain at least one paragraph text to be translated and an initial syntax tree including the paragraph text by parsing the original document to be translated, wherein each of the initial syntax trees The node where the paragraph copy is located is marked with the identification of each paragraph copy; according to the style information of each paragraph copy, each paragraph copy is converted into a paragraph copy with the first style label; the translation corresponding to each paragraph copy is obtained, and the translation is the same as the first The corresponding position of the style tag is marked with a second style tag of the same type; replace each paragraph copy in the initial syntax tree with its corresponding translation according to the identification of each paragraph copy, and restore the document according to the replaced syntax tree to obtain The translation document corresponding to the original document. By translating at the granularity of paragraph copy, the semantics of the translation can be guaranteed to be complete and accurate, and by replacing the corresponding paragraph copy with the translation in the syntax tree, the restoration of the document style based on the syntax tree can be realized, ensuring that the style of the translated document is consistent with that of the original document. Improve document translation and translation quality.

附图说明Description of drawings

为了更清楚地说明本公开实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本公开的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present disclosure. Those skilled in the art can also obtain other drawings based on these drawings without any creative effort.

图1a为待翻译的原文档的示例图；Figure 1a is an example diagram of the original document to be translated;

图1b和图1c为现有技术的机器翻译结果的示意图；Figure 1b and Figure 1c are schematic diagrams of machine translation results in the prior art;

图2为本公开一实施例提供的文档翻译方法流程示意图；FIG. 2 is a schematic flowchart of a document translation method provided by an embodiment of the present disclosure;

图3为本公开一实施例提供的文档翻译方法流程示意图；FIG. 3 is a schematic flowchart of a document translation method provided by an embodiment of the present disclosure;

图4a为本公开一实施例提供的具有第一样式标签的段落文案的示意图；Fig. 4a is a schematic diagram of a paragraph text with a first style tag provided by an embodiment of the present disclosure;

图4b为本公开一实施例提供的具有第二样式标签的译文的示意图；Fig. 4b is a schematic diagram of a translation with a second style tag provided by an embodiment of the present disclosure;

图5a为本公开一实施例提供的Office文档翻译过程中段落文案与译文的示意图；Fig. 5a is a schematic diagram of paragraph copy and translation in the Office document translation process provided by an embodiment of the present disclosure;

图5b为本公开一实施例提供的Office文档对应的译文文档的示意图；Fig. 5b is a schematic diagram of a translation document corresponding to an Office document provided by an embodiment of the present disclosure;

图6为本公开一实施例提供的云文档翻译过程中段落文案与译文的示意图；FIG. 6 is a schematic diagram of paragraph copy and translation in the cloud document translation process provided by an embodiment of the present disclosure;

图7为本公开一实施例提供的Markdown文档翻译过程中段落文案与译文的示意图；FIG. 7 is a schematic diagram of paragraph copy and translation in the process of translating a Markdown document provided by an embodiment of the present disclosure;

图8为本公开实施例提供的文档翻译设备的结构框图；FIG. 8 is a structural block diagram of a document translation device provided by an embodiment of the present disclosure;

图9为本公开实施例提供的电子设备的硬件结构示意图。FIG. 9 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the present disclosure.

具体实施方式Detailed ways

为使本公开实施例的目的、技术方案和优点更加清楚，下面将结合本公开实施例中的附图，对本公开实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本公开一部分实施例，而不是全部的实施例。基于本公开中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments It is a part of the embodiments of the present disclosure, but not all of them. Based on the embodiments in the present disclosure, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present disclosure.

市面上一些机器翻译方式对文档提取最底部的文案，甚至会把文档源格式转换为另外一个种格式，如Google的文档翻译，然后进行机器翻译，整个翻译过程受限于提取的最小单元粒度以及翻译引擎的准确度，可能导致一些不可控的翻译结果，例如翻译语句不顺畅，或者格式上存在不可控偏差；而若想避免这些问题，可提高最小单元粒度判定门槛，但受限于机器翻译的语法顺序，会导致无法精准还原，导致这个场景局限在一句话或者几段短小的文字上，最终翻译的结果仍可能语义不地道，或者语义上下文丢失。如图1a为待翻译的原文档，图1b和图1c为现有机器翻译的结果，可见翻译质量较差。Some machine translation methods on the market extract the bottom copy of the document, and even convert the source format of the document into another format, such as Google’s document translation, and then perform machine translation. The entire translation process is limited by the minimum unit granularity of extraction and The accuracy of the translation engine may lead to some uncontrollable translation results, such as unsmooth translation sentences, or uncontrollable deviations in the format; if you want to avoid these problems, you can increase the minimum unit granularity judgment threshold, but it is limited by machine translation The grammatical order cannot be accurately restored, and the scene is limited to one sentence or a few short paragraphs of text. The final translation result may still be semantically inappropriate, or the semantic context may be lost. Figure 1a shows the original document to be translated, and Figures 1b and 1c show the results of existing machine translation, which shows that the translation quality is poor.

为了解决上述技术问题，本公开实施例提供一种文档翻译方法，对待翻译的原文档进行解析，以段落为粒度对原文档提取段落文案，以保证译文语义完整准确，并且通过在语法树中以译文替换对应的段落文案，实现基于语法树对文档样式的还原，保证译文文档与原文档的样式一致；并且本公开实施例可适用于Office文档、云文档、Markdown文档等，具有通用性。In order to solve the above-mentioned technical problems, the embodiment of the present disclosure provides a document translation method, which analyzes the original document to be translated, and extracts the paragraph copy from the original document at the granularity of paragraphs, so as to ensure that the semantics of the translation is complete and accurate, and through the syntax tree in The translation replaces the corresponding paragraph copy, realizes the restoration of the document style based on the grammar tree, and ensures that the style of the translated document is consistent with the original document; and the embodiments of the present disclosure are applicable to Office documents, cloud documents, Markdown documents, etc., and are universal.

具体的，对待翻译的原文档进行解析，获取待翻译的至少一个段落文案、以及包括所述段落文案的初始语法树，其中所述初始语法树中各段落文案所在节点标记有各段落文案的标识；根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案；获取各段落文案对应的译文，其中所述译文中与第一样式标签对应位置处标记有相同类型的第二样式标签；根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文，并根据替换后的语法树进行文档还原，得到原文档对应的译文文档。Specifically, analyze the original document to be translated, obtain at least one paragraph text to be translated, and an initial syntax tree including the paragraph text, wherein the node where each paragraph text in the initial syntax tree is marked with the identification of each paragraph text ; According to the style information of each paragraph copy, convert each paragraph copy into a paragraph copy with the first style tag; obtain the translation corresponding to each paragraph copy, wherein the position corresponding to the first style tag in the translation is marked with the same type of second style label; replace each paragraph copy in the initial syntax tree with its corresponding translation according to the identification of each paragraph copy, and restore the document according to the replaced syntax tree to obtain the translation document corresponding to the original document.

下面将结合具体实施例及附图对本公开的文档翻译方法进行详细介绍。The document translation method of the present disclosure will be described in detail below with reference to specific embodiments and accompanying drawings.

实施例一Embodiment one

参考图2，图2为本公开实施例提供的文档翻译方法流程示意图。本实施例的方法可以应用在终端设备或服务器中，该文档翻译方法包括：Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a document translation method provided by an embodiment of the present disclosure. The method of this embodiment can be applied in a terminal device or a server, and the document translation method includes:

S201、对待翻译的原文档进行解析，获取待翻译的至少一个段落文案、以及包括所述段落文案的初始语法树，其中所述初始语法树中各段落文案所在节点标记有各段落文案的标识。S201. Analyze the original document to be translated, and obtain at least one paragraph text to be translated and an initial syntax tree including the paragraph text, wherein the node where each paragraph text is located in the initial syntax tree is marked with an identifier of each paragraph text.

在本实施例中，待翻译的原文档可以为Office文档、云文档、Markdown文档等文档格式，当然也可以为其他的文档格式，例如html、xliff、pdf、csv等格式。In this embodiment, the original document to be translated may be in document format such as Office document, cloud document, Markdown document, etc. Of course, it may also be in other document format, such as html, xliff, pdf, csv and other formats.

考虑到现有的机器翻译的发翻译过程的准确度受限于所提提取的最小单元粒度，而通常机器翻译提取的最小单元粒度都较小，导致语义上下文丢失，语义不准确，因此本实施例中以段落作为最小单元粒度，从原文档获取的段落文案可以是以分割换行符来划分，一个段落文案中可能包括以一个句子，也可能包括多个句子。Considering that the accuracy of the existing machine translation translation process is limited by the minimum unit granularity extracted, and usually the minimum unit granularity extracted by machine translation is small, resulting in the loss of semantic context and inaccurate semantics, so this implementation In the example, paragraphs are used as the minimum unit granularity. The paragraph text obtained from the original document can be divided by dividing line breaks. A paragraph text may include one sentence or multiple sentences.

本实施例中可对待翻译的原文档进行解析，包括对原文档的拆分，找到待翻译的段落文案、并获取包括段落文案的初始语法树，其中语法树可以为抽象语法树(AbstractSyntax Tree，AST)，抽象语法树是原文档的源代码语法结构的一种抽象表示，以树状的形式表现编程语言的语法结构，树上的每个节点都表示源代码中的一种结构。本实施例中在获取包括段落文案的初始语法树时可以在每个段落文案所在节点标记该段落文案的标识，以便于后续能够根据段落文案的标识定位到对应的段落文案。In this embodiment, the original document to be translated can be parsed, including splitting the original document, finding the paragraph text to be translated, and obtaining an initial syntax tree including the paragraph text, wherein the syntax tree can be an abstract syntax tree (AbstractSyntax Tree, AST), the abstract syntax tree is an abstract representation of the source code grammatical structure of the original document, and expresses the grammatical structure of the programming language in the form of a tree, and each node on the tree represents a structure in the source code. In this embodiment, when acquiring the initial syntax tree including the paragraph text, the identification of the paragraph text can be marked at the node where each paragraph text is located, so that the corresponding paragraph text can be located subsequently according to the identification of the paragraph text.

可选的，如图3所示，在对待翻译的原文档进行解析，获取待翻译的至少一个段落文案、以及包括所述段落文案的初始语法树时，具体可包括：Optionally, as shown in Figure 3, when parsing the original document to be translated, obtaining at least one paragraph text to be translated and an initial syntax tree including the paragraph text, it may specifically include:

S2011、确定原文档中包括所述段落文案的部分，对原文档中包括所述段落文案的部分进行解析，提取段落文案，并生成包括所述段落文案的初始语法树；S2011. Determine the portion of the original document that includes the paragraph text, analyze the portion of the original document that includes the paragraph text, extract the paragraph text, and generate an initial syntax tree that includes the paragraph text;

S2012、在所述初始语法树中各段落文案所在节点标记各段落文案的标识；S2012. Mark the identification of each paragraph copy in the node where each paragraph copy is located in the initial syntax tree;

S2013、对所述初始语法树进行存储。S2013. Store the initial syntax tree.

在本实施例中，原文档通常由一系列数据组成，例如Office文档本质上是以一个Zip压缩包，其解压内容为一些xml文件组成的文件夹，同样，对于云文档和markdown文档也包括一系列数据，而本实施例中仅关注包括待翻译的段落文案的部分，可以对原文档进行拆解，确定原文档中包括待翻译的段落文案的部分，对原文档中包括所述段落文案的部分进行解析，提取段落文案，并生成包括段落文案的初始语法树；而考虑到翻译不是一瞬间完成从，在后续将段落文案替换为其对应的译文时需要一个映射的标识，并且需要将这些标识存放起来，因此在解析原文档获取段落文案时，在初始语法树中各段落文案所在节点标记各段落文案的标识，将标记了段落文案的标识的初始语法树作为临时文件进行存储，以便于在还原文档时从该临时文件中提取出被标记的段落文案，替换为翻译后的译文。In this embodiment, the original document usually consists of a series of data. For example, an Office document is essentially a Zip compression package, and its decompressed content is a folder composed of some xml files. Similarly, cloud documents and markdown documents also include a series of data, and in this embodiment only focus on the part that includes the paragraph copy to be translated, the original document can be disassembled to determine the part that includes the paragraph copy to be translated in the original document, and the part that includes the paragraph copy in the original document part of the analysis, extract the paragraph copy, and generate the initial syntax tree including the paragraph copy; and considering that the translation is not completed in an instant, a mapping identifier is required when replacing the paragraph copy with its corresponding translation, and these The logo is stored, so when parsing the original document to obtain the paragraph copy, mark the logo of each paragraph copy in the node where each paragraph copy is located in the initial syntax tree, and store the initial syntax tree marked with the logo of the paragraph copy as a temporary file, so that When the document is restored, the marked paragraph copy is extracted from the temporary file and replaced with the translated translation.

S202、根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案。S202. According to the style information of each paragraph copy, convert each paragraph copy into a paragraph copy with a first style tag.

在本实施例中，由于段落文案具有一定的样式，例如加粗、斜体、下划线、删除线、背景色、字体颜色等，还可能包括图像、公式等，可识别段落文案的样式信息，并根据样式信息将段落文案转化为具有第一样式标签的段落文案，如图4a所示。In this embodiment, since the paragraph text has a certain style, such as bold, italic, underline, strikethrough, background color, font color, etc., and may also include images, formulas, etc., the style information of the paragraph text can be identified, and based on The style information transforms the paragraph copy into a paragraph copy with the first style tag, as shown in Figure 4a.

样式标签可以包括双标签或单标签，其中双标签包括两个标签，分别用于标识格式的开始位置和结束位置，例如对于加粗格式，通常是一个文字或几个文字进行了加粗，可以在加粗文字之前插入一个格式标签，在加粗文字之后插入一个格式标签，用于表示两个格式标签之间的文字格式是加粗格式。而单标签用于标识独立存在的格式，例如可用于标识图像、公式等。本实施例中可获取段落文案的样式信息，进而在段落文案中样式信息对应的位置处插入相应的样式标签，举例来讲，可识别某一段落文案中包括加粗格式，并且识别出加粗格式的文本内容，在加粗格式的文本内容之前插入加粗格式对应的样式标签，在加粗格式的文本内容之后也插入加粗格式对应的样式标签。Style tags can include double tags or single tags, where double tags include two tags, which are used to identify the start position and end position of the format, for example, for bold format, usually one or several characters are bolded, you can Insert a formatting label before the bold text, and insert a formatting label after the bold text, which is used to indicate that the text format between the two formatting labels is in bold format. A single tag is used to identify a format that exists independently, for example, it can be used to identify images, formulas, etc. In this embodiment, the style information of the paragraph copy can be obtained, and then the corresponding style tag can be inserted at the position corresponding to the style information in the paragraph copy. Insert the style tag corresponding to the bold format before the text content in the bold format, and insert the style tag corresponding to the bold format after the text content in the bold format.

在一种可选实施例中，在根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案时，具体可包括：In an optional embodiment, when converting each paragraph text into a paragraph text with a first style tag according to the style information of each paragraph text, it may specifically include:

根据各段落文案的样式信息，确定段落文案的公共样式；根据所述公共样式，识别各段落文案中包括的非公共样式的局部文案和/或目标元素，并在段落文案中对非公共样式的局部文案和/或目标元素添加非公共样式对应的样式标签。According to the style information of each paragraph copy, determine the public style of the paragraph copy; according to the public style, identify the partial copy and/or target elements of the non-public style included in each paragraph copy, and correct the non-public style in the paragraph copy Add style tags corresponding to non-public styles for partial copy and/or target elements.

在本实施例中，可首先确定出段落文案的公共样式，其中公共样式可以是段落文案中出现最多的样式，例如段落文本中大部分文本内容都是斜体，而仅有小部分文本内容为其他样式，则确定斜体为段落文案的公共样式；或者也可指定段落文案中的某一种样式为段落文案的公共样式。而基于公共样式，可识别出段落文案中的非公共样式的内容，其中非公共样式的文本内容为局部文案，而图像、公式等为目标元素，由于其具备空间属性，因此也可称图像、公式等目标元素为局部占位元素，进而在局部文案的前后添加其样式对应的样式标签，在目标元素处添加其样式对应的样式标签。In this embodiment, the public style of the paragraph text can be determined first, where the public style can be the style that appears most in the paragraph text, for example, most of the text content in the paragraph text is italic, and only a small part of the text content is other style, italic is determined as the public style of the paragraph copy; or a certain style in the paragraph copy can be designated as the public style of the paragraph copy. Based on the public style, the content of the non-public style in the paragraph copy can be identified. The text content of the non-public style is the partial copy, and the image, formula, etc. are the target elements. Because they have spatial attributes, they can also be called images, Target elements such as formulas are partial placeholder elements, and then add style tags corresponding to their styles before and after the partial copy, and add style tags corresponding to their styles at the target element.

S203、获取各段落文案对应的译文，其中所述译文中与第一样式标签对应位置处标记有相同类型的第二样式标签。S203. Obtain the translation corresponding to each paragraph copy, wherein the position corresponding to the first style tag in the translation is marked with the same type of second style tag.

在本实施例中，获取各段落文案对应的译文可通过人工翻译或机器翻译实现，其中人工翻译时将段落文案展示在终端界面中，接收用户输入的段落文案对应的译文；而机器翻译则是将段落文案输入到机器翻译工具中，输出段落文案的译文。In this embodiment, obtaining the translation corresponding to each paragraph copy can be realized through manual translation or machine translation, wherein during manual translation, the paragraph copy is displayed on the terminal interface, and the translation corresponding to the paragraph copy input by the user is received; while machine translation is Input the paragraph copy into the machine translation tool, and output the translation of the paragraph copy.

进一步的，段落文案中包括有第一样式标签，需要向其对应的译文的对应位置处插入相同类型的第二样式标签，例如段落文案的某局部文案为斜体格式，局部文案之前和之后插入了标识斜体格式的样式标签，在译文中与该局部文案对应内容之前和之后也需要插入标识斜体格式的样式标签，如图4b所示。Furthermore, the paragraph copy contains the first style tag, and it is necessary to insert the same type of second style tag at the corresponding position of the corresponding translation. For example, a part of the paragraph copy is in italic format, and the part of the copy is inserted before and after In order to identify the style tag in italic format, a style tag identifying italic format also needs to be inserted before and after the content corresponding to the partial copy in the translation, as shown in Figure 4b.

可选的，在译文中插入样式标签时，用户可输入标签插入指令，标签插入指令可包括插入位置以及插入标签的类型，插入位置为译文中与段落文案的第一样式标签对应位置，插入标签的类型为与第一样式标签相同类型的第二样式标签，进而响应于用户的标签插入指令，可以在译文中与段落文案的第一样式标签对应位置处插入相同类型的第二样式标签。Optionally, when inserting a style tag in the translation, the user can input a tag insertion command. The tag insertion command can include the insertion position and the type of the tag to be inserted. The insertion position is the position in the translation corresponding to the first style tag of the paragraph copy. Insert The type of the tag is a second-style tag of the same type as the first-style tag, and then in response to the user's tag insertion instruction, a second-style tag of the same type can be inserted in the translation at a position corresponding to the first-style tag of the paragraph copy Label.

S204、根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文，并根据替换后的语法树进行文档还原，得到原文档对应的译文文档。S204. Replace each paragraph text in the initial syntax tree with its corresponding translation according to the identification of each paragraph text, and restore the document according to the replaced syntax tree to obtain a translation document corresponding to the original document.

在本实施例中，在获取到全部待翻译的段落文案对应的译文后，可进行文档还原过程，基于偷梁换柱的思想，根据段落文案标识在初始语法树中定位到段落文案，通过将段落文案替换为对应的译文，得到替换后的语法树，根据替换后的语法树对文档进行还原，得到原文档对应的译文文档，同时还可保证译文文档的样式与原文档的样式保持一致。In this embodiment, after obtaining the translations corresponding to all the paragraphs to be translated, the document restoration process can be carried out. Based on the idea of stealing the beam, the paragraph copy is located in the initial grammar tree according to the paragraph copy identifier, and the paragraph copy is replaced by For the corresponding translation, the replaced syntax tree is obtained, and the document is restored according to the replaced syntax tree to obtain the translation document corresponding to the original document, while ensuring that the style of the translation document is consistent with the style of the original document.

可选的，在S201对包括所述段落文案的部分进行解析，提取段落文案时，可以对原文档中包括所述段落文案的部分提取段落文案，生成第一段落单元数据列表，其中第一段落单元数据列表是将需要翻译的段落文案以列表的方式进行记录，其中段落文案以其具有的样式进行展示，如表1所示：Optionally, in S201, when parsing the part including the paragraph text and extracting the paragraph text, the paragraph text may be extracted from the part of the original document including the paragraph text to generate a first paragraph unit data list, wherein the first paragraph unit data The list is to record the paragraph copy that needs to be translated in the form of a list, and the paragraph copy is displayed in its style, as shown in Table 1:

表1Table 1

标识logo AA BB CC 段落文案paragraph copy Hello打工人Hello worker 吃了吗？您呐！对方放向你发出干饭邀请！是否接受have you eaten? You! The other party sends you an invitation to cook a meal! Whether to accept 走Walk

进一步的，在S202根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案时，可根据所述第一段落单元数据列表中的各段落文案的样式信息，确定段落文案的公共样式；根据所述公共样式，识别所述第一段落单元数据列表中的各段落文案中包括的非公共样式的局部文案和/或目标元素，并在段落文案中对非公共样式的局部文案和/或目标元素添加非公共样式对应的样式标签。Further, in S202, when converting each paragraph copy into a paragraph copy with a first style tag according to the style information of each paragraph copy, the paragraph can be determined according to the style information of each paragraph copy in the first paragraph unit data list. The public style of copywriting; according to the public style, identify the partial copywriting and/or target elements of the non-public style included in each paragraph copy in the first paragraph unit data list, and modify the partial copywriting of the non-public style in the paragraph copy Copywriting and/or target elements add style tags corresponding to non-public styles.

进一步的，所述根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文时，可先将各段落文案对应的译文，转化为第二段落单元数据列表，第二段落单元数据列表中译文样式与对应的段落文案样式相对应，如表2所示：Further, when replacing each paragraph text in the initial grammar tree with its corresponding translation according to the identification of each paragraph text, the translation corresponding to each paragraph text can be converted into a second paragraph unit data list, the first The translation style in the two-paragraph unit data list corresponds to the corresponding paragraph copywriting style, as shown in Table 2:

表2Table 2

进一步的，可根据各段落文案的标识、以及第二段落单元数据列表，将所述初始语法树中的每一段落文案替换为其对应的译文，并删除段落文案的标识。Further, each paragraph text in the initial syntax tree may be replaced with its corresponding translation according to the identification of each paragraph text and the second paragraph unit data list, and the identification of the paragraph text may be deleted.

具体的，在找到对应的段落文案节点中，需要根据从段落编辑器中导出的段落文案对应译文中的样式标签分割为译文的局部文案与局部占位，然后依次遍历原段落文案接口的第一段落单元数据列表中的局部文案或局部占位，依次替换为翻译后的数据节点信息，这样就使解析抽离出来的段落文案真正的按照翻译要求翻译完成。Specifically, in finding the corresponding paragraph copy node, it is necessary to divide the paragraph copy exported from the paragraph editor into the partial copy and partial placeholder of the translation according to the style tags in the corresponding translation, and then traverse the first paragraph of the original paragraph copy interface in turn Partial copywriting or partial placeholders in the unit data list are replaced with translated data node information in turn, so that the paragraph copywriting extracted from the analysis is truly translated according to the translation requirements.

本实施例提供的文档翻译方法，通过对待翻译的原文档进行解析，获取待翻译的至少一个段落文案、以及包括段落文案的初始语法树，其中初始语法树中各段落文案所在节点标记有各段落文案的标识；根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案；获取各段落文案对应的译文，其中译文中与第一样式标签对应位置处标记有相同类型的第二样式标签；根据各段落文案的标识将初始语法树中的每一段落文案替换为其对应的译文，并根据替换后的语法树进行文档还原，得到原文档对应的译文文档。通过以段落文案为粒度进行翻译，可保证译文语义完整准确，并且通过在语法树中以译文替换对应的段落文案，实现基于语法树对文档样式的还原，保证译文文档与原文档的样式一致，提高文档翻译效果和翻译质量。The document translation method provided in this embodiment obtains at least one paragraph text to be translated and an initial syntax tree including the paragraph text by parsing the original document to be translated, wherein the nodes where the texts of each paragraph in the initial syntax tree are marked with each paragraph Copywriting identification; according to the style information of each paragraph copywriting, each paragraph copywriting is converted into a paragraph copywriting with the first style tag; obtain the translation corresponding to each paragraph copywriting, wherein the position corresponding to the first style tag in the translation is marked with The second style tag of the same type; replace each paragraph copy in the initial syntax tree with its corresponding translation according to the identification of each paragraph copy, and restore the document according to the replaced syntax tree to obtain the translation document corresponding to the original document. By translating at the granularity of paragraph copy, the semantics of the translation can be guaranteed to be complete and accurate, and by replacing the corresponding paragraph copy with the translation in the syntax tree, the restoration of the document style based on the syntax tree can be realized, ensuring that the style of the translated document is consistent with that of the original document. Improve document translation and translation quality.

在上述实施例的基础上，下面以Office、云文档、Markdown文档三种不同的文档类型分别对本公开的文档翻译方法进行介绍。On the basis of the above-mentioned embodiments, the document translation method of the present disclosure will be respectively introduced below using three different document types: Office, cloud documents, and Markdown documents.

实施例二Embodiment two

在上述实施例的基础上，本实施例中以Office文档为例对文档翻译方法进行介绍。On the basis of the foregoing embodiments, in this embodiment, an Office document is taken as an example to introduce a document translation method.

对于Office文档，包括Docx文档、Xlsx文档等，由于Office文档本质上是一个Zip压缩包，其解压内容为一些xml文件组成的文件夹，因此本实施例中的文档翻译方法的解析和还原只需要关注需要关注包括待翻译的段落文案部分的xml文件，对其进行段落文案提取，经过翻译转化，最后再生成与原有xml文件对应的译文的xml文件，其中段落文案相关内容已经替换成译文，将其还原到解析时原有xml文件位置，然后再将其压缩为Zip压缩包即可。For Office documents, including Docx documents, Xlsx documents, etc., since Office documents are essentially a Zip archive, its decompressed content is a folder composed of some xml files, so the analysis and restoration of the document translation method in this embodiment only need Attention needs to pay attention to the xml file including the paragraph copy to be translated, extract the paragraph copy from it, translate and transform it, and finally generate the xml file of the translation corresponding to the original xml file, in which the relevant content of the paragraph copy has been replaced with the translation, Restore it to the original xml file location when parsing, and then compress it into a Zip archive.

根据上述过程，本实施例针对Office文档的文档翻译方法包括如下过程：格式拆解、文档解析、翻译、文档还原。下面分别对每个过程进行详细介绍。According to the above process, the document translation method for Office documents in this embodiment includes the following processes: format dismantling, document parsing, translation, and document restoration. Each process is described in detail below.

2.1)格式拆解2.1) Format dismantling

格式拆解是文档解析的前置流程，以便于确定原文档中包括待翻译的段落文案的部分。Format disassembly is the pre-process of document parsing, so as to determine the part of the original document that includes the paragraph copy to be translated.

对于Docx文档，在Docx文档转换为Zip格式解压之后的文件夹目录大致包括：[Content_Types].xml、_rels/文件夹、docProps/文件夹、word/文件夹。For Docx documents, the folder directory after the Docx document is converted to Zip format and decompressed roughly includes: [Content_Types].xml, _rels/folder, docProps/folder, word/folder.

其中，[Content_Types].xml：描述文档各个部分(如：docment.xml)的ContentType，以便程序在显示文档时知道如何解析该部分。Among them, [Content_Types].xml: describe the ContentType of each part of the document (such as: docment.xml), so that the program knows how to parse the part when displaying the document.

Docx文档的解析需要关注的文本文件有内容主体，页眉、页脚、尾注、脚注、文本评论等和段落文案文本相关的xml文件。因此可在若对Docx文档压缩包解压后获取这些包括段落文案的xml文件。The text files that need to be paid attention to in the analysis of Docx documents include the content body, headers, footers, endnotes, footnotes, text comments, etc., and xml files related to paragraph copy text. Therefore, these xml files including paragraph copy can be obtained after decompressing the Docx file.

对于Xlsx文档，在Xlsx文档转换为Zip格式解压之后的文件夹目录大致包括：[Content_Types].xml、_rels/文件夹、docProps/文件夹、xl/文件夹For Xlsx documents, the folder directory after the Xlsx document is converted to Zip format and decompressed roughly includes: [Content_Types].xml, _rels/folder, docProps/folder, xl/folder

其中，[Content_Types].xml：描述文档各个部分，以便程序在显示文档时知道如何解析该部分。Among them, [Content_Types].xml: describe each part of the document, so that the program knows how to parse the part when displaying the document.

Xlsx文档内部的文本，先从ShareStrings.xml文件开始提取所有sheet表中存在的文案，以si标签作为单元，提取出一个大的数组对象，最后结合styles样式信息等文件，通过<c><v>index<v></c>下标映射的方式，注入到sheet[n]的文本渲染过程中。需要注意的是，有些旧Xlsx版本，SharesString.xml里边并没有真实文本的存在，文本内容直接的存在<c><v>content<v></c>，此时也不要下标的方式去做映射。For the text inside the Xlsx document, first extract all the copywriting in the sheet table from the ShareStrings. The >index<v></c> subscript mapping method is injected into the text rendering process of sheet[n]. It should be noted that in some old Xlsx versions, there is no real text in SharesString.xml, and the text content directly exists in <c><v>content<v></c>. At this time, do not use subscripts map.

本实施例中，通过前置流程可将Docx、Xlsx拆解并抽离出段落文案，生成第一段落单元数据列表，以列表的方式记录需要翻译的段落文案。具体可通过如下过程实现：In this embodiment, the Docx and Xlsx can be disassembled and the paragraph texts can be extracted through the pre-process to generate the first paragraph unit data list, and the paragraph texts to be translated can be recorded in the form of a list. Specifically, it can be achieved through the following process:

Docx、Xlsx文档传入的时候以Buffer(缓存)形式传入给jszip加载，在得到文件解析内容后，可以看到其解析出来的内容(如各xml文件等)以path路径zipObject(key,value)的形式存在。When the Docx and Xlsx documents are passed in, they are passed to jszip in the form of Buffer (cache) for loading. After getting the parsed content of the file, you can see the parsed content (such as each xml file, etc.) with the path path zipObject(key,value ) form exists.

进一步的，遍历解析出来的内容，找到需要关注的文件，也即包括段落文案的部分，如下：Further, traverse the parsed content and find the file that needs attention, that is, the part that includes the paragraph copy, as follows:

对于Docx文档：document.xml、header1.xml、font1.xml等包括待翻译的段落文案的xml文件；For Docx documents: document.xml, header1.xml, font1.xml and other xml files including paragraph copywriting to be translated;

对于Xlsx文档：shareStrings.xml、sheet1.xml等包括待翻译的段落文案的xml文件。For Xlsx documents: shareStrings.xml, sheet1.xml and other xml files that include paragraphs to be translated.

2.2)文档解析2.2) Document parsing

对于Office文档，通常具有段落(paragraph)的概念，并且渲染原理都是从段落的角度出发实现的，对于Xlsx、PPT文档虽然没有段落的概念，但是其底层数据大致一致。因此，在进行文档解析时，可以将提取的粒度提高到段落这个维度，以保证语义完整准确。For Office documents, there is usually the concept of paragraph (paragraph), and the rendering principle is realized from the perspective of paragraph. For Xlsx and PPT documents, although there is no concept of paragraph, the underlying data is roughly the same. Therefore, when parsing documents, the granularity of extraction can be increased to the dimension of paragraphs to ensure complete and accurate semantics.

本实施例中，需要关注段落文案的样式信息，以及样式对应的局部文案，以展示给翻译人员。In this embodiment, it is necessary to pay attention to the style information of the paragraph copy and the partial copy corresponding to the style, so as to display it to the translator.

对于不同的文档，解析的方法、解析的数据都具备一定的差别，所以在解析的时候需要构造不同的解析器，将长的差不多的但是细节又有差别的数据解析转化为通用数据，其中解析器输入数据为AST，输出为通用数据，通用数据如下所示：For different documents, the parsing method and the parsed data have certain differences. Therefore, different parsers need to be constructed when parsing, and the data that is similar in length but with different details can be parsed into general data. The input data of the converter is AST, and the output is general data. The general data is as follows:

下面介绍如何获取所需的AST数据。The following describes how to obtain the required AST data.

无论是Docx还是Xlsx，在文档解析过程中只关心段落文案，所以只需将包括待翻译的段落文案的xml文件解析并转换为相应文件的AST数据。Whether it is Docx or Xlsx, only the paragraph copy is concerned in the document parsing process, so it only needs to parse and convert the xml file including the paragraph copy to be translated into the AST data of the corresponding file.

在解析的过程中，将每个段落文案包含的局部文案与局部占位顺序遍历，这样就可以提取出这一部分的文案的整体信息，以及在每次遍历的起始与结束以标签的概念做一个相对位置记录。并且在解析每一个段落文案的时候，在源数据来源(AST中)为其生成一个段落文案的标识信息paragraghId。In the process of parsing, the partial copy and partial placeholders included in each paragraph copy are traversed in order, so that the overall information of this part of the copy can be extracted, and the concept of tags can be used at the beginning and end of each traversal. A relative position record. And when parsing each paragraph copy, a paragraph copy identification information paragraghId is generated for it in the source data source (in the AST).

上述文件类型被处理后的AST数据对象都具备一个共同特性，都是以分段的段落文案paragraph作为文案单元，其组成为被文案样式区别开形成的局部文案以及局部占位元素，如下所示。The processed AST data objects of the above-mentioned file types all have a common feature. They all use segmented paragraph copywriting paragraphs as copywriting units, which are composed of partial copywriting and partial placeholder elements that are distinguished by copywriting styles, as shown below .

最后将源数据AST以临时数据进行存储，这里称为临时文件。将从AST提取出来数据存在本地或者远端，以便于基于这些数据去翻译。Finally, the source data AST is stored as temporary data, which is called a temporary file here. The data extracted from AST is stored locally or remotely, so as to be translated based on these data.

具体的，可通过sax(simple API for XML)解析器，将包括段落文案的xml文件转换为纯文本注入到sax解析器，在遇到标签元素时候创建AST节点信息，AST分为元素节点与文本节点。可得到一个可提取段落文案的段落单元数据列表，遍历这些段落单元数据分别处理为可识别的段落文案(局部文案&局部占位打上标签信息)。在获取到这些可识别的段落文案，将textRunList局部文案与局部占位转化为可以识别的标签信息。Specifically, through the sax (simple API for XML) parser, the xml file including the paragraph copy can be converted into plain text and injected into the sax parser, and AST node information will be created when a tag element is encountered. AST is divided into element nodes and text node. A list of paragraph unit data from which paragraph copy can be extracted can be obtained, and these paragraph unit data can be traversed to be processed into identifiable paragraph copy (partial copy & partial placeholder tagging information). After obtaining these identifiable paragraph texts, the textRunList partial copy and local placeholders are converted into identifiable tag information.

基于docx、xlsx本身对每一段paragraph本身的局部文案拆分外，提取作用于局部文案的样式节点上的公共样式，遍历原本的拆分列表，除公共样式之外的文案起始和结束位置打上双标签。将并不是文案的图片等占位，保留原信息，在该位置打上单标签。通过上述过程，可得到具有第一样式标签的段落文案，如图5a左侧部分所示。Based on docx and xlsx itself splitting the local copy of each paragraph itself, extract the public styles on the style nodes that act on the partial copy, traverse the original split list, and mark the start and end positions of the copy except the public style Double tab. The pictures that are not copywriting will occupy the space, keep the original information, and put a single label on this position. Through the above process, the paragraph copy with the first style tag can be obtained, as shown in the left part of Fig. 5a.

2.3)翻译2.3) Translation

在文档解析流程执行完之后，可以看见，文档解析强依赖于文档本身的底层数据是如何拆解段落文案的，也就是说，相同样式的段落文案，得到的局部文案标签信息可能会有所不同(例如Office系列样式考虑字体、字符等)，但无论哪个格式文件都得到内容格式统一的数据对象，翻译的本身就需要将提取的段落文案进行完全翻译。After the document parsing process is executed, it can be seen that document parsing strongly depends on how the underlying data of the document disassembles the paragraph copy. That is to say, for the same style of paragraph copy, the partial copy label information obtained may be different. (For example, the Office series style considers fonts, characters, etc.), but no matter which format file will get a data object with a unified content format, the translation itself needs to completely translate the extracted paragraph copy.

本实施例中，段落文案翻译的目标是，不仅需要将段落文案本身翻译成目标语种的语言，并且还需要将原有的样式(例如是否加粗，是否斜体)还原到相应的位置，特别是存在各种微妙的主谓语的时候。In this embodiment, the goal of paragraph text translation is not only to translate the paragraph text itself into the language of the target language, but also to restore the original style (such as whether it is bold or italic) to the corresponding position, especially When there are various subtle subject-predicates.

具体的翻译过程，可通过人工翻译或机器翻译实现，其中人工翻译时将段落文案展示在终端界面中，接收用户输入的段落文案对应的译文；而机器翻译则是将段落文案输入到机器翻译工具中，输出段落文案的译文。The specific translation process can be realized through manual translation or machine translation. During manual translation, the paragraph copy is displayed on the terminal interface, and the translation corresponding to the paragraph copy entered by the user is received; while machine translation is to input the paragraph copy into the machine translation tool , output the translation of the paragraph copy.

其中，人工翻译具体可将数据导入到一个可视化的编辑器里边，翻译人员便可以在编辑器里边非常灵活的对解析出来的每一个段落文案进行翻译，在翻译的过程中只需要关注解析出来的段落文案数据中显示的第一样式标签，在译文相应的位置插入相同类型的第二样式标签即可，如图5a右侧部分所示，整个过程无需关注这段段落文本原本是带有什么样式，可以使翻译人员更加专注的翻译文案。Among them, manual translation can import data into a visual editor, and translators can flexibly translate each paragraph copy that has been parsed in the editor. During the translation process, they only need to pay attention to the parsed text. For the first style tag displayed in the paragraph copy data, just insert the same type of second style tag at the corresponding position of the translation, as shown in the right part of Figure 5a, the whole process does not need to pay attention to what the paragraph text originally contained Style, which can make the translator more focused on the translation copy.

可选的，翻译人员每一次操作翻译的时候都可以将此次翻译的段落文案内容记录下来，方便在之后的翻译过程中能够进行复用，提高翻译的效率。另外使用编辑器翻译的过程中，编辑器也提供各类翻译引擎，方便翻译人员在翻译一些常用的小短语时候，快速的填充翻译。Optionally, the translators can record the content of the translated paragraphs each time they translate, so that they can be reused in the subsequent translation process and improve the efficiency of translation. In addition, in the process of using the editor to translate, the editor also provides various translation engines, which is convenient for translators to quickly fill in the translation when translating some commonly used small phrases.

可选的，对于翻译人员的每一次翻译，均有数据记录，那么将大量的翻译记录数据整合起来加以机器训练，这将更好的有助于提高翻译文案的质量和效率。Optionally, there is a data record for each translation of the translator, so integrating a large amount of translation record data for machine training will better help improve the quality and efficiency of the translation copy.

2.4)文档还原2.4) Document restoration

在翻译过程完成时，翻译人员已经把文档解析过程提取出来的段落文案按照要求翻译完成，此时可将这些翻译数据转为AST，其中的文本内容已经替换为目标语种的译文。When the translation process is completed, the translators have translated the paragraphs extracted from the document parsing process according to the requirements. At this time, the translation data can be converted into AST, and the text content in it has been replaced with the target language translation.

在文档解析过程执行完毕之后，不仅得出需要翻译的段落文案，而且在提取的过程在相应的初始AST节点上记录paragraphId，将记录paragraphId的初始AST保存为临时文件，所以可以根据paragraphId与原文档格式中的文本文件建立一对一的关系，生成出一个替换后的AST，其中文本内容已经被替换为目标语种，最后按照文档解析的对应还原方式进行还原处理。After the document parsing process is completed, not only the paragraph copy that needs to be translated is obtained, but also the paragraphId is recorded on the corresponding initial AST node during the extraction process, and the initial AST that records the paragraphId is saved as a temporary file, so it can be based on the paragraphId and the original document The text files in the format establish a one-to-one relationship, and a replaced AST is generated, in which the text content has been replaced with the target language, and finally the restoration process is performed according to the corresponding restoration method of document analysis.

具体的，可通过如下方案实现：Specifically, it can be achieved through the following schemes:

在找到对应的段落文案节点中，需要根据从段落编辑器中导出的段落文案对应译文中的样式标签分割为译文的局部文案与局部占位，然后依次遍历原段落文案接口的第一段落单元数据列表中的局部文案或局部占位，依次替换为翻译后的数据节点信息，这样就使解析抽离出来的段落文案真正的按照翻译要求翻译完成。In finding the corresponding paragraph copy node, it is necessary to divide the paragraph copy exported from the paragraph editor into the partial copy and the local placeholder according to the style tags in the translation corresponding to the paragraph copy, and then traverse the first paragraph unit data list of the original paragraph copy interface in turn Partial copywriting or partial placeholders in , are replaced with translated data node information in turn, so that the paragraph copywriting extracted from the analysis is truly translated according to the translation requirements.

更具体的，段落文案经翻译之后，得到的是带有样式标签的译文数据，需要转化为第二段落单元数据列表，docx、xlsx才可以识别。也即，需要将带有样式标签的译文数据根据样式标签还原为对应的样式的内容。可遍历带有样式标签的译文数据，若是识别到双标签，则确定双标签前的文本为普通文本，将其设置为公共样式，而双标签内的文本在公共样式的基础上还具有双标签对应的样式。若识别到单标签，则确定存在局部占位元素，进而进行局部占位。若未识别到样式标签，则样式设置为公共样式。More specifically, after the paragraph copy is translated, the translation data with style tags is obtained, which needs to be converted into a second paragraph unit data list before docx and xlsx can recognize it. That is, it is necessary to restore the translation data with style tags to the content of the corresponding style according to the style tags. The translation data with style tags can be traversed. If a double tag is recognized, the text before the double tag is determined to be normal text, and it is set as a public style, and the text inside the double tag also has a double tag on the basis of the public style corresponding style. If a single label is identified, it is determined that there is a local placeholder element, and then the local placeholder is performed. If no style tag is recognized, the style is set to a public style.

进一步的，还原后的第二段落单元数据列表与第一结果段落单元数据列表，根据paragraphId进行匹配，将textRunList里边的数据做替换(js对象是引用，所以会直接作用于初始AST数据)，此外在提取段落文案、生成初始AST的时候，raw的值记录了原节点数据的paragraphId，所以当还原完一个段落单元数据之后可以之后删除paragraphId。Further, the restored second paragraph unit data list and the first result paragraph unit data list are matched according to the paragraphId, and the data in the textRunList is replaced (the js object is a reference, so it will directly act on the initial AST data), in addition When extracting the paragraph copy and generating the initial AST, the raw value records the paragraphId of the original node data, so the paragraphId can be deleted after restoring a paragraph unit data.

无论是Docx还是Xlsx，在解析过程中关心的包括段落文案的部分xml文件，在还原过程中，只需要将翻译过的AST重新创建新的xml文件，然后替换到提取时候的位置，最后再将解压整个文件夹重新压缩为zip格式的压缩包，并重命名为.docx或者xlsx即可。No matter whether it is Docx or Xlsx, during the parsing process, you are concerned about some xml files including paragraph copywriting. During the restoration process, you only need to recreate a new xml file from the translated AST, and then replace it with the extracted position. Unzip the entire folder and recompress it into a compressed package in zip format, and rename it to .docx or xlsx.

需要注意的是，Docx格式的要求和其他的文档要求会有所有不同，在Docx对于不同的语种字体会有不同的渲染方式，例如阿拉伯语是从右到左渲染，根据文档翻译的目标语言，在文档还原阶段可以为局部文案注入目标语言的语种字体样式。It should be noted that the requirements of the Docx format will be different from those of other documents. In Docx, fonts in different languages will be rendered in different ways. For example, Arabic is rendered from right to left. According to the target language of document translation, In the document restoration stage, the language font style of the target language can be injected into the partial copy.

经过上述过程，可得到如图5b所示的译文文档，译文文档与原文档的样式一致，且翻译质量得到提高。Through the above process, a translated document as shown in Figure 5b can be obtained, the style of the translated document is consistent with that of the original document, and the translation quality is improved.

实施例三Embodiment Three

在上述实施例的基础上，本实施例中以云文档为例对文档翻译方法进行介绍。On the basis of the foregoing embodiments, in this embodiment, a cloud document is taken as an example to introduce a document translation method.

对于云文档，如Lark云文档支持多人实时在线协同编辑，可以从云文档的服务器获取原文档的富文本数据对象，可以采用对Office文档的解析方式进行处理。需要注意的是，云文档并不直接将拿到的数据原封不动的塞回去，因为有些的信息元素并不直接的创建操作，需要舍弃掉不支持的内容，这里可以预先将不支持的信息元素列出来，然后建立禁止名单对不支持的内容过滤即可，等后续支持的时候再去除。For cloud documents, for example, Lark cloud documents support real-time online collaborative editing by multiple people. The rich text data object of the original document can be obtained from the cloud document server, and can be processed by parsing Office documents. It should be noted that the cloud document does not directly insert the obtained data back intact, because some information elements are not directly created, and unsupported content needs to be discarded. Here, the unsupported information can be pre-set List the elements, and then create a forbidden list to filter the unsupported content, and then remove it when it is supported later.

根据上述过程，本实施例针对云文档的文档翻译方法包括如下过程：请求数据、文档解析、翻译、文档还原、上传数据。下面分别对每个过程进行详细介绍。According to the above process, the document translation method for cloud documents in this embodiment includes the following processes: requesting data, document parsing, translation, document restoration, and uploading data. Each process is described in detail below.

3.1)请求数据3.1) Request data

针对云文档，在向服务器发送富文本数据获取请求时，先进行身份验证，通过API(Application Programming Interface，应用程序接口)提供的鉴权逻辑进行鉴权，传入云文档的token标识，根据token标识获取云文档的富文本数据。For cloud documents, when sending rich text data acquisition requests to the server, identity verification is performed first, authentication is performed through the authentication logic provided by the API (Application Programming Interface, application programming interface), and the token identification of the cloud documents is passed in, according to the token Identifies the rich text data obtained from the cloud document.

具体的，可先验证应用授权凭证code，再验证用户身份标识user_access_token，然后在请求携带获取文档信息的lark_doc_token以及header携带token，最终获取文档富文本数据。Specifically, first verify the application authorization credential code, and then verify the user identity user_access_token, and then carry the lark_doc_token for obtaining document information and the token in the header in the request, and finally obtain the rich text data of the document.

3.2)文档解析3.2) Document parsing

通常情况下，在实现机器翻译的解析提取中，只需要关注提取文案的本身，而不需要关注文案带有的样式、以及文本与文本之间的关联，而本实施例中，需要关注段落文案的样式信息，以及样式对应的局部文案，以展示给翻译人员。Usually, in the parsing and extraction of machine translation, you only need to pay attention to the extracted copy itself, not the style of the copy and the relationship between the text and the text. In this embodiment, you need to pay attention to the paragraph copy The style information of , as well as the partial copywriting corresponding to the style, are displayed to translators.

对于云文档，向服务器请求云文档链接的富文本数据对象，在非严格意义上来讲，可将富文本数据对象简单理解为伪AST，For cloud documents, the rich text data object of the cloud document link is requested from the server. In a non-strict sense, the rich text data object can be simply understood as a pseudo AST,

此外，通过将云文档的富文本数据对象记录为第一段落单元数据列表，进一步将记录的数据转换为段落编辑器可识别的段落文案数据，对数据做加工，在存在样式的地方做二次加工(同Markdown，详见实施例四)，最终得到具有第一样式标签的段落文案数据，如图6左侧部分所示。In addition, by recording the rich text data object of the cloud document as the first paragraph unit data list, the recorded data is further converted into paragraph copy data that can be recognized by the paragraph editor, and the data is processed, and secondary processing is performed where the style exists (Same as Markdown, see Embodiment 4 for details), finally obtain the paragraph copy data with the first style tag, as shown in the left part of FIG. 6 .

3.3)翻译3.3) Translation

段落文案翻译的目标是，不仅需要将段落文案本身翻译成目标语种的语言，并且还需要将原有的样式(例如是否加粗，是否斜体)还原到相应的位置，特别是存在各种微妙的主谓语的时候。The goal of paragraph copy translation is not only to translate the paragraph copy itself into the target language, but also to restore the original style (such as whether it is bold or italic) to the corresponding position, especially when there are various subtle when the subject predicate.

其中，人工翻译具体可将数据导入到一个可视化的编辑器里边，翻译人员便可以在编辑器里边非常灵活的对解析出来的每一个段落文案进行翻译，在翻译的过程中只需要关注解析出来的段落文案数据中显示的第一样式标签，在译文相应的位置插入相同类型的第二样式标签即可，如图6右侧部分所示，整个过程无需关注这段段落文本原本是带有什么样式，可以使翻译人员更加专注的翻译文案。翻译过程同实施例二，此处不再赘述。Among them, manual translation can import data into a visual editor, and translators can flexibly translate each paragraph copy that has been parsed in the editor. During the translation process, they only need to pay attention to the parsed text. For the first-style tag displayed in the paragraph copy data, just insert the same type of second-style tag at the corresponding position of the translation, as shown in the right part of Figure 6, the whole process does not need to pay attention to what the paragraph text originally contained Style, which can make the translator more focused on the translation copy. The translation process is the same as that in Embodiment 2, and will not be repeated here.

3.4)文档还原3.4) Document restoration

在获取到段落文案翻译之后的包括第二样式标签的译文数据之后，需要将其转化为第二段落单元数据列表。也即，需要将带有样式标签的译文数据根据样式标签还原为对应的样式的内容。可遍历带有样式标签的译文数据，若是识别到双标签，则确定双标签前的文本为普通文本，将其设置为公共样式，而双标签内的文本在公共样式的基础上还具有双标签对应的样式。若识别到单标签，则确定存在局部占位元素，进而进行局部占位。若未识别到样式标签，则样式设置为公共样式。After obtaining the translation data including the second style tag after the translation of the paragraph copy, it needs to be converted into a second paragraph unit data list. That is, it is necessary to restore the translation data with style tags to the content of the corresponding style according to the style tags. The translation data with style tags can be traversed. If a double tag is recognized, the text before the double tag is determined to be normal text, and it is set as a public style, and the text inside the double tag also has a double tag on the basis of the public style corresponding style. If a single label is identified, it is determined that there is a local placeholder element, and then the local placeholder is performed. If no style tag is recognized, the style is set to a public style.

最后将第二段落单元数据列表数据，根据第一结果段落单元数据列表，按照paragraphId标记回填到被标记过的临时富文本数据对象(也即上述的伪AST)中，除去paragraphId标记后，生成的新的富文本数据对象。具体可通过如下方式实现：Finally, backfill the second paragraph unit data list data into the marked temporary rich text data object (that is, the above-mentioned pseudo AST) according to the paragraphId tag according to the first result paragraph unit data list, and remove the paragraphId tag to generate The new rich text data object. Specifically, it can be achieved in the following ways:

将新的富文本数据对象发送给服务器提供的创建云文档的接口，以使根据新的富文本数据对象创建译文的云文档，并返回译文的云文档链接，根据云文档链接可获取译文的云文档。Send the new rich text data object to the cloud document creation interface provided by the server, so that the cloud document of the translation can be created based on the new rich text data object, and the cloud document link of the translation can be returned, and the cloud document of the translation can be obtained according to the cloud document link document.

实施例四Embodiment Four

在上述实施例的基础上，本实施例中以Markdown文档为例对文档翻译方法进行介绍。On the basis of the foregoing embodiments, in this embodiment, a Markdown document is taken as an example to introduce a document translation method.

首先需要说明的是，无论是Office文档还是云文档，解析出来的段落文案数据都是以从起始到文案结束做一次遍历，也就是说局部文案样式并没嵌套的关系在，有利于双标签的建立。而对于Markdown文档，解析出来的段落文案数据中，局部文案样式可能存在嵌套关系，所以文档解析阶段需要额外的抹平嵌套关系、在文档还原阶段需要创建嵌套关系的逻辑处理。First of all, it needs to be explained that whether it is an Office document or a cloud document, the parsed paragraph copy data is traversed from the beginning to the end of the copy, that is to say, there is no nesting relationship between the partial copy styles, which is conducive to double Label creation. For Markdown documents, in the parsed paragraph copy data, partial copy styles may have nested relationships, so the document parsing stage requires additional logical processing to smooth out nested relationships, and in the document restoration stage to create nested relationships.

根据上述所述，本实施例针对Markdown文档的文档翻译方法包括如下过程：前置流程、文档解析、翻译、文档还原。下面分别对每个过程进行详细介绍。According to the above, the document translation method for Markdown documents in this embodiment includes the following processes: pre-process, document parsing, translation, and document restoration. Each process is described in detail below.

4.1)前置流程4.1) Pre-process

Markdown内容为遵循各自的Markdown规范的纯文本。Markdown纯文本内容需要根据不同的规范下制定的token标记来进行拆解，并得到待翻译的段落文案。Markdown content is plain text following the respective Markdown specification. The plain text content of Markdown needs to be disassembled according to the token mark formulated under different specifications, and the paragraph copy to be translated is obtained.

针对于如下的Markdown文档样例：For the following Markdown document samples:

#Hello打工人#Hello hit worker

*吃了吗？**您**呐！*对方向你发出**干饭**邀请！是否接收*have you eaten? **You** Nah! *The other party sends you a **cooked meal** invitation! whether to receive

走！[img](http://tosv.boe.byted.org/obj/tostest/dagongren.png)Walk! [img](http://tosv.boe.byted.org/obj/tostest/dagongren.png)

Markdown文件存在的形式其实就为一单长串字符串，通过遍历，将其进行语法词法一番操作，例如marked、remark，由块解析(找到标题，段落...)到内联解析(具体解析标题、段落...)，最终生成初始AST语法树，并标识段落文案的paragraghId。The existence form of Markdown files is actually a single long string of strings. Through traversal, it is subjected to grammatical and lexical operations, such as marked and remark, from block parsing (find titles, paragraphs...) to inline parsing (specifically Parse the title, paragraph...), and finally generate the initial AST syntax tree, and identify the paragraghId of the paragraph copy.

通过remark解析后得到初始AST语法树，进而可得到段落单元数据列表，由于Markdown的初始AST语法树本身对段落文案样式的处理是存在嵌套关系，如在上述Markdown文档的样例中，“吃了吗？您呐！”，“您”字做了加粗，在相对应的文案样式中是在斜体样式中嵌套了加粗样式。也即加粗的“您”这个节点嵌套在了斜体节点的children对象里边，然后“您”这个节点两侧依旧还是普通的文本文案节点，在解析的时候将children样式嵌套关系记录起来，在还原的时候原样返回即可。After parsing through remark, the initial AST syntax tree can be obtained, and then the paragraph unit data list can be obtained. Since the initial AST syntax tree of Markdown itself has a nested relationship in the processing of paragraph copy styles, as in the above example of the Markdown document, "eat Are you? You!", the word "you" is bolded, and the bold style is nested in the italic style in the corresponding copy style. That is to say, the bold "you" node is nested in the children object of the italic node, and the two sides of the "you" node are still ordinary text copy nodes. When parsing, the nesting relationship of the children style is recorded. When restoring, you can return it as it is.

需要说明的是，在Markdown文档对应的语法中，只有部分样式支持嵌套，如加粗、斜体、删除线，像行内代码、链接图片这一类，虽然是以样式的形式存在(即会记录在formating数据字段里)，但是不支持嵌套。It should be noted that in the syntax corresponding to Markdown documents, only some styles support nesting, such as bold, italic, strikethrough, such as inline code, link pictures, etc., although they exist in the form of styles (that is, they will be recorded in the formatting data field), but nesting is not supported.

解析出来AST数据，局部文案与段落文案是存在嵌套关系的，这导致了样式直接存在着嵌套，也就是不能直接通过文案的样式来准确的区别局部文案。必须解决这一层嵌套关系，才可以进入文档解析，输出通用数据。After parsing the AST data, there is a nesting relationship between partial copywriting and paragraph copywriting, which leads to the direct nesting of styles, that is, the partial copywriting cannot be accurately distinguished directly by the style of the copywriting. This layer of nesting relationship must be resolved before entering document parsing and outputting general data.

因此，在获取到Markdown的AST语法树后需要将段落文案里边的样式内容的起始结束位置铺平。可选的，可根据样式嵌套关系将段落文案分割为父样式部分和子样式部分，并对父样式部分和子样式部分标记不同的第一样式标签；其中，父样式部分的第一样式标签用于标识父样式的样式信息，子样式部分的第一样式标签用于标识父样式和子样式的样式信息。Therefore, after obtaining the AST syntax tree of Markdown, it is necessary to flatten the start and end positions of the style content in the paragraph copy. Optionally, the paragraph copy can be divided into a parent style part and a child style part according to the style nesting relationship, and different first style tags are marked for the parent style part and the child style part; wherein, the first style tag of the parent style part It is used to identify the style information of the parent style, and the first style tag in the child style part is used to identify the style information of the parent style and the child style.

4.2)文档解析4.2) Document parsing

对于Markdown文档，在上述过程中提取的记录到的待翻译的段落文案可记录到第一段落单元数据列表，进而可对第一段落单元数据列表加工，给对应的段落文案中局部文案的样式在原位置记录上样式标签tag信息。具体可通过如下过程实现：For Markdown documents, the paragraph copy to be translated extracted and recorded in the above process can be recorded in the first paragraph unit data list, and then the first paragraph unit data list can be processed, and the style of the partial copy in the corresponding paragraph copy can be recorded in the original position The upper style tag tag information. Specifically, it can be realized through the following process:

拆解单个局部文案，如果所有样式标签都不存在的情况，则提前返回；否则继续下述过程；Disassemble a single partial copy, if all style tags do not exist, return in advance; otherwise continue the following process;

获取段落文案的公共样式；基于公共样式定位段落文案中的非公共样式部分；Obtain the public style of the paragraph copy; locate the non-public style part in the paragraph copy based on the public style;

在非公共样式部分中定位存在嵌套关系的非公共样式部分，根据样式嵌套关系将该部分分割为父样式部分和子样式部分，并对父样式部分和子样式部分标记不同的第一样式标签；其中父样式部分的第一样式标签用于标识父样式的样式信息，子样式部分的第一样式标签用于标识父样式和子样式的样式信息；例如上述的“吃了吗？您呐！”，“您”字在斜体样式中嵌套了加粗样式，斜体为父样式，加粗为子样式，将“吃了吗？您呐！”分割为三部分，包括“吃了吗？”、“您”、“呐！”，其中“吃了吗？”和“呐！”这两部分的第一样式标签用于标识斜体样式，而“您”这部分的第一样式标签用于标识斜体和加粗两部分；Locate the non-public style part with a nested relationship in the non-public style part, split the part into a parent style part and a child style part according to the style nesting relationship, and mark the parent style part and the child style part with different first style tags ;The first style tag in the parent style part is used to identify the style information of the parent style, and the first style tag in the child style part is used to identify the style information of the parent style and the child style; !", the word "you" has a bold style nested in the italic style, the italic is the parent style, and the bold is the child style, and the "Have you eaten? You!" is divided into three parts, including "Have you eaten?" ", "You", "Nah!", the first style tags of the two parts "Have you eaten?" and "Nah!" are used to identify the italic style, and the first style tag of the "You" part Used to identify the italic and bold parts;

而对于属于不支持嵌套关系的样式的部分、和/或属于局部占位的部分，插入相应的第一样式标签。And for the part belonging to the style that does not support the nesting relationship, and/or the part belonging to the partial placeholder, the corresponding first style tag is inserted.

通过上述过程最终得到具有第一样式标签的段落文案数据，如图7左侧部分所示。Through the above process, the paragraph copy data with the first style tag is finally obtained, as shown in the left part of FIG. 7 .

4.3)翻译4.3) Translation

其中，人工翻译具体可将数据导入到一个可视化的编辑器里边，翻译人员便可以在编辑器里边非常灵活的对解析出来的每一个段落文案进行翻译，在翻译的过程中只需要关注解析出来的段落文案数据中显示的第一样式标签，在译文相应的位置插入相同类型的第二样式标签即可，如图7右侧部分所示，整个过程无需关注这段段落文本原本是带有什么样式，可以使翻译人员更加专注的翻译文案。翻译过程同实施例二，此处不再赘述。Among them, manual translation can import data into a visual editor, and translators can flexibly translate each paragraph copy that has been parsed in the editor. During the translation process, they only need to pay attention to the parsed text. For the first-style tag displayed in the paragraph copy data, just insert the same type of second-style tag at the corresponding position of the translation, as shown in the right part of Figure 7, and the whole process does not need to pay attention to what the paragraph text originally contained Style, which can make the translator more focused on the translation copy. The translation process is the same as that in Embodiment 2, and will not be repeated here.

4.3)文档还原4.3) Document restoration

由于在Mardown文件的解析过程中，与其他格式的文件解析不同，因为需要根据段落文案中的文案样式存在嵌套关系，需要对嵌套关系进行抹平的操作，同样在还原的过程中，就需要将这些嵌套信息进行复原，这里还原的方式可以根据文案前后存在双标签(且有相同样式信息)的情况，从而得到新的AST，最终利用这个新的AST生成翻译后的Markdown文件。In the parsing process of Mardown files, it is different from parsing files in other formats, because there is a nesting relationship based on the copy styles in the paragraph copy, and the nesting relationship needs to be smoothed out. Also in the restoration process, just These nested information need to be restored. The method of restoration here can be based on the existence of double tags (and the same style information) before and after the copy, so as to obtain a new AST, and finally use this new AST to generate a translated Markdown file.

也即，根据样式嵌套关系、各段落文案的标识、各段落文案的译文，将初始语法树中的每一段落文案替换为其对应的译文，并复原样式嵌套关系，生成替换后的语法树；根据替换后的语法树生成译文文档对应的Markdown文档。That is, according to the style nesting relationship, the identification of each paragraph copy, and the translation of each paragraph copy, replace each paragraph copy in the initial syntax tree with its corresponding translation, and restore the style nesting relationship to generate a replaced syntax tree ;Generate the Markdown document corresponding to the translation document according to the replaced syntax tree.

具体的，在获取段落文案翻译之后的译文数据后，带有第二样式标签信息的译文数据是按照字符串从左到右的方式记录，这类需要还原在文档提交过程中的记录的嵌套样式关系。可选的，文档还原可通过如下过程实现：Specifically, after obtaining the translation data after the translation of the paragraph copy, the translation data with the second style tag information is recorded from left to right according to the character string, and the nesting of records that need to be restored during the document submission process style relationship. Optionally, document restoration can be implemented through the following process:

遍历带有第二样式标签信息的译文数据，在公共样式的基础上依次识别译文中各第二样式标签，定位存在嵌套关系的部分，并定位其中的父样式部分和子样式部分，进而还原父样式和子样式的嵌套关系；Traversing the translation data with the second style tag information, sequentially identifying each second style tag in the translation on the basis of the public style, locating the parts with nested relationship, and locating the parent style part and child style part, and then restoring the parent style The nesting relationship between styles and substyles;

对于其他第二样式标签，还原为对应的样式。For other second-style tags, revert to the corresponding style.

进一步的，根据Markdown文档提交时候记录的临时paragraphId信息，还原相应的段落文案。具体的，在恢复嵌套关系后，根据各段落文案的paragraphId，将初始语法树中的段落文案替换为对应的译文；再通过remark过程根据替换后的语法树生成翻译后的Markdown文档。Further, restore the corresponding paragraph copy according to the temporary paragraphId information recorded when the Markdown document was submitted. Specifically, after restoring the nesting relationship, replace the paragraph copy in the initial syntax tree with the corresponding translation according to the paragraphId of each paragraph copy; and then generate a translated Markdown document according to the replaced syntax tree through the remark process.

经过上述过程，得到翻译后的Markdown文档内容如下：After the above process, the content of the translated Markdown document is as follows:

#Hello commuter#Hello commuter

*Have**you**eaten yet？*I would like to invite you**to dinner**！Acceptthe invitation？*Have**you**eaten yet? *I would like to invite you**to dinner**! Accept the invitation?

Yes,with pleasure！[img](http://tosv.boe.byted.org/obj/tostest/dagongren.png)Yes, with pleasure! [img](http://tosv.boe.byted.org/obj/tostest/dagongren.png)

可以看到，翻译后的文档与机器翻译相比，文档翻译质量有了大幅提升。It can be seen that compared with machine translation, the quality of document translation has been greatly improved after translation.

实施例五Embodiment five

对应于上文实施例的文档翻译方法，图8为本公开实施例提供的文档翻译设备的结构框图。为了便于说明，仅示出了与本公开实施例相关的部分。参照图8，所述文档翻译设备800包括：解析单元801、转化单元802、翻译单元803、以及还原单元804。Corresponding to the document translation method in the above embodiments, FIG. 8 is a structural block diagram of a document translation device provided in an embodiment of the present disclosure. For ease of description, only the parts related to the embodiments of the present disclosure are shown. Referring to FIG. 8 , the document translation device 800 includes: a parsing unit 801 , a conversion unit 802 , a translation unit 803 , and a restoration unit 804 .

解析单元801，用于对待翻译的原文档进行解析，获取待翻译的至少一个段落文案、以及包括所述段落文案的初始语法树，其中所述初始语法树中各段落文案所在节点标记有各段落文案的标识；Parsing unit 801, configured to parse the original document to be translated, obtain at least one paragraph text to be translated, and an initial syntax tree including the paragraph text, wherein the nodes where each paragraph text in the initial syntax tree is marked with each paragraph copywriting identification;

转化单元802，用于根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案；A conversion unit 802, configured to convert each paragraph copy into a paragraph copy with a first style tag according to the style information of each paragraph copy;

翻译单元803，用于获取各段落文案对应的译文，其中所述译文中与第一样式标签对应位置处标记有相同类型的第二样式标签；The translation unit 803 is configured to obtain the translation corresponding to each paragraph text, wherein the position corresponding to the first style label in the translation is marked with the same type of second style label;

还原单元804，用于根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文，并根据替换后的语法树进行文档还原，得到原文档对应的译文文档。The restoration unit 804 is configured to replace each paragraph text in the initial syntax tree with its corresponding translation according to the identification of each paragraph text, and perform document restoration according to the replaced syntax tree to obtain the translation document corresponding to the original document.

在一种可能的实施例中，所述解析单元801在对待翻译的原文档进行解析，获取待翻译的至少一个段落文案、以及包括所述段落文案的初始语法树时，用于：In a possible embodiment, when the parsing unit 801 parses the original document to be translated, and acquires at least one paragraph text to be translated and an initial syntax tree including the paragraph text, it is configured to:

确定原文档中包括所述段落文案的部分，对原文档中包括所述段落文案的部分进行解析，提取段落文案，并生成包括所述段落文案的初始语法树；determining the part of the original document that includes the paragraph text, parsing the part of the original document that includes the paragraph text, extracting the paragraph text, and generating an initial syntax tree including the paragraph text;

在所述初始语法树中各段落文案所在节点标记各段落文案的标识；In the initial grammar tree, mark the identification of each paragraph copy at the node where each paragraph copy is located;

对所述初始语法树进行存储。The initial syntax tree is stored.

在一种可能的实施例中，所述翻译单元803在获取各段落文案对应的译文时，用于：In a possible embodiment, when the translation unit 803 obtains the translation corresponding to each paragraph text, it is used to:

接收用户输入、或者机器翻译输出的每一段落文案对应的译文；Receive the translation corresponding to each paragraph copy input by the user or output by machine translation;

响应于用户的标签插入指令，在译文中与段落文案的第一样式标签对应位置处插入相同类型的第二样式标签。In response to a user's tag insertion instruction, a second style tag of the same type is inserted at a position in the translation corresponding to the first style tag of the paragraph copy.

在一种可能的实施例中，所述解析单元801在对包括所述段落文案的部分进行解析，提取段落文案时，用于：In a possible embodiment, when parsing the part including the paragraph text and extracting the paragraph text, the parsing unit 801 is configured to:

对原文档中包括所述段落文案的部分提取段落文案，生成第一段落单元数据列表；extracting the paragraph text from the part of the original document that includes the paragraph text, and generating a first paragraph unit data list;

所述转化单元802在根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案时，用于：The conversion unit 802 is used to:

根据所述第一段落单元数据列表中的各段落文案的样式信息，确定段落文案的公共样式；According to the style information of each paragraph copy in the first paragraph unit data list, determine the common style of the paragraph copy;

根据所述公共样式，识别所述第一段落单元数据列表中的各段落文案中包括的非公共样式的局部文案和/或目标元素，并在段落文案中对非公共样式的局部文案和/或目标元素添加非公共样式对应的样式标签。According to the public style, identify the partial copy and/or target elements of the non-public style included in each paragraph copy in the first paragraph unit data list, and perform the local copy and/or target element of the non-public style in the paragraph copy Elements add style tags corresponding to non-public styles.

在一种可能的实施例中，所述还原单元804在根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文时，用于：In a possible embodiment, when replacing each paragraph text in the initial syntax tree with its corresponding translation according to the identification of each paragraph text, the restoration unit 804 is configured to:

将各段落文案对应的译文，转化为第二段落单元数据列表；Convert the translation corresponding to each paragraph copy into the second paragraph unit data list;

根据各段落文案的标识、以及所述第二段落单元数据列表，将所述初始语法树中的每一段落文案替换为其对应的译文，并删除段落文案的标识。According to the identification of each paragraph text and the second paragraph unit data list, each paragraph text in the initial syntax tree is replaced with its corresponding translation, and the identification of the paragraph text is deleted.

在一种可能的实施例中，所述解析单元801在确定原文档中包括所述段落文案的部分，对原文档中包括所述段落文案的部分进行解析，提取段落文案，并生成包括所述段落文案的初始语法树，包括：In a possible embodiment, the parsing unit 801 determines the part of the original document that includes the paragraph text, parses the part of the original document that includes the paragraph text, extracts the paragraph text, and generates Initial syntax tree for paragraph copy, including:

若所述原文档为Office文档，对Office文档压缩包解压后获取包括段落文案的xml文件；If the original document is an Office document, after decompressing the compressed package of the Office document, an xml file including paragraph text is obtained;

通过sax解析器对包括段落文案的xml文件进行解析，提取段落文案，并生成所述初始语法树。The xml file including the paragraph text is parsed by a sax parser, the paragraph text is extracted, and the initial syntax tree is generated.

在一种可能的实施例中，所述还原单元804在根据替换后的语法树进行文档还原，得到原文档对应的译文文档时，用于：In a possible embodiment, the restoring unit 804 is configured to:

根据替换后的语法树创建新xml文件；Create a new xml file based on the replaced syntax tree;

将新xml文件替换Office文档压缩包解压文件中的所述包括段落文案的xml文件；Replace the xml file containing the paragraph text in the decompressed file of the Office document zip package with the new xml file;

对Office文档压缩包解压文件重新进行压缩，得到为译文文档对应的Office文档压缩包。Recompress the decompressed files of the Office document compression package to obtain the Office document compression package corresponding to the translated document.

在一种可能的实施例中，所述解析单元801确定原文档中包括所述段落文案的部分，对原文档中包括所述段落文案的部分进行解析，提取段落文案，并生成包括所述段落文案的初始语法树时，用于：In a possible embodiment, the parsing unit 801 determines the portion of the original document that includes the paragraph text, parses the portion of the original document that includes the paragraph text, extracts the paragraph text, and generates an When copywriting the initial syntax tree, use to:

若所述原文档为云文档，向服务器发送富文本数据获取请求，并接收服务器在经过鉴权后发送的所述云文档的富文本数据对象；If the original document is a cloud document, send a rich text data acquisition request to the server, and receive the rich text data object of the cloud document sent by the server after authentication;

根据所述富文本对象获取所述段落文案，并将所述富文本数据对象确定为所述初始语法树。The paragraph copy is obtained according to the rich text object, and the rich text data object is determined as the initial syntax tree.

在一种可能的实施例中，所述还原单元804在根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文，并根据替换后的语法树进行文档还原，得到原文档对应的译文文档时，用于：In a possible embodiment, the restoration unit 804 replaces each paragraph text in the initial syntax tree with its corresponding translation according to the identification of each paragraph text, and restores the document according to the replaced syntax tree, When obtaining the translation document corresponding to the original document, it is used to:

根据各段落文案的标识、各段落文案的译文和所述初始语法树，生成译文的富文本数据对象；Generate a rich text data object of the translation according to the identification of each paragraph copy, the translation of each paragraph copy and the initial syntax tree;

将译文的富文本数据对象发送给所述服务器，以使所述服务器根据译文的富文本数据对象创建译文的云文档，并返回译文的云文档链接；Sending the rich text data object of the translation to the server, so that the server creates a cloud document of the translation according to the rich text data object of the translation, and returns the cloud document link of the translation;

根据所述云文档链接获取译文的云文档。Obtain the cloud document of the translation according to the cloud document link.

在一种可能的实施例中，所述解析单元801在确定原文档中包括所述段落文案的部分，对原文档中包括所述段落文案的部分进行解析，提取段落文案，并生成包括所述段落文案的初始语法树时，用于：In a possible embodiment, the parsing unit 801 determines the part of the original document that includes the paragraph text, parses the part of the original document that includes the paragraph text, extracts the paragraph text, and generates When initial syntax tree for paragraph copy, use for:

若所述原文档为Markdown文档，则对Markdown文档进行块解析和内联解析，得到段落文案，并生成初始语法树。If the original document is a Markdown document, block parsing and inline parsing are performed on the Markdown document to obtain a paragraph copy and generate an initial syntax tree.

在一种可能的实施例中，所述转化单元802在根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案时，用于：In a possible embodiment, when the conversion unit 802 converts each paragraph text into a paragraph text with a first style tag according to the style information of each paragraph text, it is used to:

根据所述初始语法树获取所述段落文案中的样式嵌套关系；Obtaining the style nesting relationship in the paragraph copy according to the initial syntax tree;

根据样式嵌套关系将所述段落文案分割为父样式部分和子样式部分，并对父样式部分和子样式部分标记不同的第一样式标签；dividing the paragraph copy into a parent style part and a child style part according to the style nesting relationship, and marking the parent style part and the child style part with different first style tags;

其中，父样式部分的第一样式标签用于标识父样式的样式信息，子样式部分的第一样式标签用于标识父样式和子样式的样式信息。Wherein, the first style tag in the parent style part is used to identify the style information of the parent style, and the first style tag in the child style part is used to identify the style information of the parent style and the child style.

根据所述样式嵌套关系、各段落文案的标识、各段落文案的译文，将所述初始语法树中的每一段落文案替换为其对应的译文，并复原样式嵌套关系，生成替换后的语法树；According to the style nesting relationship, the identification of each paragraph copy, and the translation of each paragraph copy, replace each paragraph copy in the initial syntax tree with its corresponding translation, restore the style nesting relationship, and generate a replaced grammar Tree;

根据替换后的语法树生成译文文档对应的Markdown文档。Generate the Markdown document corresponding to the translation document according to the replaced syntax tree.

本实施例提供的设备，可用于执行上述方法实施例的技术方案，其实现原理和技术效果类似，本实施例此处不再赘述。The device provided in this embodiment can be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effect are similar, so this embodiment will not repeat them here.

实施例六Embodiment six

参考图9，其示出了适于用来实现本公开实施例的电子设备900的结构示意图，该电子设备900可以为终端设备或服务器。其中，终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant，简称PDA)、平板电脑(Portable Android Device，简称PAD)、便携式多媒体播放器(Portable MediaPlayer，简称PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图9示出的电子设备仅仅是一个示例，不应对本公开实施例的功能和使用范围带来任何限制。Referring to FIG. 9 , it shows a schematic structural diagram of an electronic device 900 suitable for implementing the embodiments of the present disclosure. The electronic device 900 may be a terminal device or a server. Wherein, the terminal equipment may include but not limited to mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA for short), tablet computers (Portable Android Device, PAD for short), portable multimedia players (Portable MediaPlayer (PMP for short), mobile terminals such as vehicle-mounted terminals (such as vehicle-mounted navigation terminals), and fixed terminals such as digital TVs and desktop computers. The electronic device shown in FIG. 9 is only an example, and should not limit the functions and application scope of the embodiments of the present disclosure.

如图9所示，电子设备900可以包括处理装置(例如中央处理器、图形处理器等)901，其可以根据存储在只读存储器(Read Only Memory，简称ROM)902中的程序或者从存储装置908加载到随机访问存储器(Random Access Memory，简称RAM)903中的程序而执行各种适当的动作和处理。在RAM 903中，还存储有电子设备900操作所需的各种程序和数据。处理装置901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(I/O)接口905也连接至总线904。As shown in FIG. 9, an electronic device 900 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 901, which may be stored in a program in a read-only memory (Read Only Memory, ROM for short) 902 or from a storage device. 908 programs loaded into the Random Access Memory (RAM for short) 903 to execute various appropriate actions and processes. In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are also stored. The processing device 901 , ROM 902 , and RAM 903 are connected to each other through a bus 904 . An input/output (I/O) interface 905 is also connected to the bus 904 .

通常，以下装置可以连接至I/O接口905：包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置906；包括例如液晶显示器(Liquid CrystalDisplay，简称LCD)、扬声器、振动器等的输出装置907；包括例如磁带、硬盘等的存储装置908；以及通信装置909。通信装置909可以允许电子设备900与其他设备进行无线或有线通信以交换数据。虽然图9示出了具有各种装置的电子设备900，但是应理解的是，并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Generally, the following devices can be connected to the I/O interface 905: an input device 906 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD for short) , an output device 907 such as a speaker, a vibrator, etc.; a storage device 908 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 909. The communication means 909 may allow the electronic device 900 to perform wireless or wired communication with other devices to exchange data. While FIG. 9 shows electronic device 900 having various means, it is to be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信装置909从网络上被下载和安装，或者从存储装置908被安装，或者从ROM902被安装。在该计算机程序被处理装置901执行时，执行本公开实施例的方法中限定的上述功能。In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes program codes for executing the methods shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 909 , or from storage means 908 , or from ROM 902 . When the computer program is executed by the processing device 901, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.

需要说明的是，本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中，计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：电线、光缆、RF(射频)等等，或者上述的任意合适的组合。It should be noted that the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.

上述计算机可读介质可以是上述电子设备中所包含的；也可以是单独存在，而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.

上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被该电子设备执行时，使得该电子设备执行上述实施例所示的方法。The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device is made to execute the methods shown in the above-mentioned embodiments.

可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码，上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络——包括局域网(LocalArea Network，简称LAN)或广域网(Wide Area Network，简称WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out the operations of the present disclosure can be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural Programming Language - such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external computer (e.g. using an Internet Service Provider to connect via the Internet).

附图中的流程图和框图，图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

描述于本公开实施例中所涉及到的单元可以通过软件的方式实现，也可以通过硬件的方式来实现。其中，单元的名称在某种情况下并不构成对该单元本身的限定，例如，第一获取单元还可以被描述为“获取至少两个网际协议地址的单元”。The units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the unit does not constitute a limitation of the unit itself under certain circumstances, for example, the first obtaining unit may also be described as "a unit for obtaining at least two Internet Protocol addresses".

本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如，非限制性地，可以使用的示范类型的硬件逻辑部件包括：现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

第一方面，根据本公开的一个或多个实施例，提供了一种文档翻译方法，包括：In a first aspect, according to one or more embodiments of the present disclosure, a document translation method is provided, including:

根据本公开的一个或多个实施例，所述对待翻译的原文档进行解析，获取待翻译的至少一个段落文案、以及包括所述段落文案的初始语法树，包括：According to one or more embodiments of the present disclosure, the parsing the original document to be translated, obtaining at least one paragraph text to be translated and an initial syntax tree including the paragraph text includes:

对所述初始语法树进行存储。The initial syntax tree is stored.

根据本公开的一个或多个实施例，所述获取各段落文案对应的译文，包括：According to one or more embodiments of the present disclosure, the obtaining the translation corresponding to each paragraph text includes:

根据本公开的一个或多个实施例，所述对包括所述段落文案的部分进行解析，提取段落文案，包括：According to one or more embodiments of the present disclosure, the parsing the part including the paragraph text, and extracting the paragraph text includes:

所述根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案，包括：The step of converting each paragraph copy into a paragraph copy with a first style tag according to the style information of each paragraph copy includes:

根据本公开的一个或多个实施例，所述根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文，包括：According to one or more embodiments of the present disclosure, the replacing each paragraph text in the initial syntax tree with its corresponding translation according to the identification of each paragraph text includes:

根据本公开的一个或多个实施例，所述确定原文档中包括所述段落文案的部分，对原文档中包括所述段落文案的部分进行解析，提取段落文案，并生成包括所述段落文案的初始语法树，包括：According to one or more embodiments of the present disclosure, determining the part of the original document that includes the paragraph text, parsing the part of the original document that includes the paragraph text, extracting the paragraph text, and generating The initial syntax tree of , including:

根据本公开的一个或多个实施例，所述根据替换后的语法树进行文档还原，得到原文档对应的译文文档，包括：According to one or more embodiments of the present disclosure, the document restoration according to the replaced syntax tree to obtain the translation document corresponding to the original document includes:

根据本公开的一个或多个实施例，所述根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文，并根据替换后的语法树进行文档还原，得到原文档对应的译文文档，包括：According to one or more embodiments of the present disclosure, the text of each paragraph in the initial syntax tree is replaced with its corresponding translation according to the identification of each paragraph text, and the document is restored according to the replaced syntax tree to obtain the original The translation document corresponding to the document, including:

根据本公开的一个或多个实施例，所述根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案，包括：According to one or more embodiments of the present disclosure, converting each paragraph text into a paragraph text with a first style tag according to the style information of each paragraph text includes:

第二方面，根据本公开的一个或多个实施例，提供了一种文档翻译设备，包括：In a second aspect, according to one or more embodiments of the present disclosure, a document translation device is provided, including:

根据本公开的一个或多个实施例，所述解析单元在对待翻译的原文档进行解析，获取待翻译的至少一个段落文案、以及包括所述段落文案的初始语法树时，用于：According to one or more embodiments of the present disclosure, when the parsing unit parses the original document to be translated, and acquires at least one paragraph text to be translated and an initial syntax tree including the paragraph text, it is configured to:

对所述初始语法树进行存储。The initial syntax tree is stored.

根据本公开的一个或多个实施例，所述翻译单元在获取各段落文案对应的译文时，用于：According to one or more embodiments of the present disclosure, when acquiring the translation corresponding to each paragraph text, the translation unit is configured to:

根据本公开的一个或多个实施例，所述解析单元在对包括所述段落文案的部分进行解析，提取段落文案时，用于：According to one or more embodiments of the present disclosure, when the parsing unit parses the part including the paragraph text and extracts the paragraph text, it is used to:

所述转化单元在根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案时，用于：When the conversion unit converts each paragraph copy into a paragraph copy with a first style tag according to the style information of each paragraph copy, it is used for:

根据本公开的一个或多个实施例，所述还原单元在根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文时，用于：According to one or more embodiments of the present disclosure, when the restoring unit replaces each paragraph text in the initial syntax tree with its corresponding translation according to the identification of each paragraph text, it is configured to:

根据本公开的一个或多个实施例，所述解析单元在确定原文档中包括所述段落文案的部分，对原文档中包括所述段落文案的部分进行解析，提取段落文案，并生成包括所述段落文案的初始语法树，包括：According to one or more embodiments of the present disclosure, the parsing unit determines the portion of the original document that includes the paragraph text, parses the portion of the original document that includes the paragraph text, extracts the paragraph text, and generates The initial syntax tree of the above paragraph copy, including:

根据本公开的一个或多个实施例，所述还原单元在根据替换后的语法树进行文档还原，得到原文档对应的译文文档时，用于：According to one or more embodiments of the present disclosure, when the restoration unit restores the document according to the replaced syntax tree to obtain the translation document corresponding to the original document, it is configured to:

根据本公开的一个或多个实施例，所述解析单元确定原文档中包括所述段落文案的部分，对原文档中包括所述段落文案的部分进行解析，提取段落文案，并生成包括所述段落文案的初始语法树时，用于：According to one or more embodiments of the present disclosure, the parsing unit determines the part of the original document that includes the paragraph text, parses the part of the original document that includes the paragraph text, extracts the paragraph text, and generates an When initial syntax tree for paragraph copy, use for:

根据本公开的一个或多个实施例，所述还原单元在根据各段落文案的标识将所述初始语法树中的每一段落文案替换为其对应的译文，并根据替换后的语法树进行文档还原，得到原文档对应的译文文档时，用于：According to one or more embodiments of the present disclosure, the restoration unit replaces each paragraph text in the initial syntax tree with its corresponding translation according to the identification of each paragraph text, and performs document restoration according to the replaced syntax tree , when obtaining the translation document corresponding to the original document, it is used for:

根据本公开的一个或多个实施例，所述解析单元在确定原文档中包括所述段落文案的部分，对原文档中包括所述段落文案的部分进行解析，提取段落文案，并生成包括所述段落文案的初始语法树时，用于：According to one or more embodiments of the present disclosure, the parsing unit determines the portion of the original document that includes the paragraph text, parses the portion of the original document that includes the paragraph text, extracts the paragraph text, and generates When describing the initial syntax tree of a paragraph copy, use it to:

根据本公开的一个或多个实施例，所述转化单元在根据各段落文案的样式信息，将各段落文案转化为具有第一样式标签的段落文案时，用于：According to one or more embodiments of the present disclosure, when converting each paragraph text into a paragraph text with a first style tag according to the style information of each paragraph text, the conversion unit is configured to:

第三方面，根据本公开的一个或多个实施例，提供了一种电子设备，包括：至少一个处理器和存储器；In a third aspect, according to one or more embodiments of the present disclosure, an electronic device is provided, including: at least one processor and a memory;

第四方面，根据本公开的一个或多个实施例，提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有计算机执行指令，当处理器执行所述计算机执行指令时，实现如上第一方面以及第一方面各种可能的设计所述的文档翻译方法。In a fourth aspect, according to one or more embodiments of the present disclosure, a computer-readable storage medium is provided, the computer-readable storage medium stores computer-executable instructions, and when a processor executes the computer-executable instructions, Realize the document translation method described in the above first aspect and various possible designs of the first aspect.

第五方面，根据本公开的一个或多个实施例，提供了一种计算机程序产品，包括计算机执行指令，当处理器执行所述计算机执行指令时，实现如上第一方面以及第一方面各种可能的设计所述的文档翻译方法。In the fifth aspect, according to one or more embodiments of the present disclosure, there is provided a computer program product, including computer-executable instructions, when the processor executes the computer-executable instructions, the above first aspect and various aspects of the first aspect are realized. Possible design of the described document translation method.

以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本公开中所涉及的公开范围，并不限于上述技术特征的特定组合而成的技术方案，同时也应涵盖在不脱离上述公开构思的情况下，由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principle. Those skilled in the art should understand that the disclosure scope involved in this disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.

此外，虽然采用特定次序描绘了各操作，但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下，多任务和并行处理可能是有利的。同样地，虽然在上面论述中包含了若干具体实现细节，但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地，在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。In addition, while operations are depicted in a particular order, this should not be understood as requiring that the operations be performed in the particular order shown or performed in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while the above discussion contains several specific implementation details, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题，但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反，上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims.

Claims

1. A method of document translation, comprising:

analyzing an original document to be translated to obtain at least one paragraph pattern to be translated and an initial syntax tree comprising the paragraph pattern, wherein the node of each paragraph pattern in the initial syntax tree is marked with the identifier of each paragraph pattern;

converting each paragraph pattern into a paragraph pattern with a first pattern label according to the pattern information of each paragraph pattern;

acquiring a translation corresponding to each paragraph of the file, wherein a second style label with the same type is marked at a position in the translation corresponding to the first style label;

and replacing each paragraph and text in the initial grammar tree with a corresponding translation thereof according to the identification of each paragraph and text, and performing document restoration according to the replaced grammar tree to obtain a translation document corresponding to the original document.

2. The method of claim 1, wherein parsing the original document to be translated to obtain at least one paragraph pattern to be translated and an initial syntax tree including the paragraph pattern comprises:

determining a part of the original document including the paragraph pattern, analyzing the part of the original document including the paragraph pattern, extracting the paragraph pattern, and generating an initial syntax tree including the paragraph pattern;

marking the mark of each paragraph pattern on the node of each paragraph pattern in the initial syntax tree;

storing the initial syntax tree.

3. The method according to claim 2, wherein the obtaining of the translation corresponding to each paragraph of the document comprises:

receiving a translation corresponding to each paragraph of the document input by a user or output by machine translation;

and inserting a second style label of the same type at a position corresponding to the first style label of the paragraph file in the translation in response to the label inserting instruction of the user.

4. The method of claim 2, wherein parsing the portion including the paragraph pattern to extract the paragraph pattern comprises:

extracting paragraph patterns from the part of the original document including the paragraph patterns to generate a first paragraph unit data list;

converting each paragraph pattern into a paragraph pattern with a first pattern label according to the pattern information of each paragraph pattern, including:

determining the common style of the paragraph patterns according to the style information of each paragraph pattern in the first paragraph unit data list;

according to the common style, identifying a local style and/or a target element of a non-common style in each paragraph style in the first paragraph unit data list, and adding a style label corresponding to the non-common style to the local style and/or the target element of the non-common style in the paragraph style.

5. The method of claim 4, wherein said replacing each paragraph in the initial syntax tree with its corresponding translation based on the identification of each paragraph, comprises:

converting the translation corresponding to each paragraph case into a second paragraph unit data list;

and replacing each paragraph pattern in the initial syntax tree with a corresponding translation thereof according to the identifier of each paragraph pattern and the second paragraph unit data list, and deleting the identifier of the paragraph pattern.

6. The method of any one of claims 2-5, wherein determining the portion of the document that includes the paragraph pattern, parsing the portion of the document that includes the paragraph pattern, extracting the paragraph pattern, and generating the initial syntax tree that includes the paragraph pattern comprises:

if the original document is an Office document, decompressing an Office document compressed packet and then acquiring an xml file comprising paragraph patterns;

and analyzing the xml file comprising the paragraph and the text by using a sax analyzer, extracting the paragraph and the text and generating the initial syntax tree.

7. The method according to claim 6, wherein the performing document restoration according to the replaced syntax tree to obtain a translation document corresponding to the original document comprises:

creating a new xml file according to the replaced syntax tree;

replacing the xml file comprising the paragraph pattern in the Office document compressed packet decompression file with a new xml file;

and re-compressing the Office document compression packet decompression file to obtain an Office document compression packet corresponding to the translated document.

8. The method of any one of claims 2-5, wherein determining the portion of the document that includes the paragraph pattern, parsing the portion of the document that includes the paragraph pattern, extracting the paragraph pattern, and generating the initial syntax tree that includes the paragraph pattern comprises:

if the original document is a cloud document, sending a rich text data acquisition request to a server, and receiving a rich text data object of the cloud document sent by the server after authentication;

and acquiring the paragraph pattern according to the rich text object, and determining the rich text data object as the initial syntax tree.

9. The method of claim 8, wherein the replacing each paragraph in the initial syntax tree with its corresponding translation according to the identifier of each paragraph, and performing document restoration according to the replaced syntax tree to obtain a translation document corresponding to the original document, comprises:

generating a rich text data object of the translation according to the identification of each paragraph case, the translation of each paragraph case and the initial syntax tree;

sending the rich text data object of the translation to the server, so that the server creates a cloud document of the translation according to the rich text data object of the translation, and returns a cloud document link of the translation;

and acquiring the cloud document of the translation according to the cloud document link.

10. The method of any one of claims 2-5, wherein determining the portion of the document that includes the paragraph pattern, parsing the portion of the document that includes the paragraph pattern, extracting the paragraph pattern, and generating the initial syntax tree that includes the paragraph pattern comprises:

and if the original document is a Markdown document, performing block analysis and inline analysis on the Markdown document to obtain a paragraph pattern, and generating an initial syntax tree.

11. The method of any one of claim 10, wherein converting each paragraph document into a paragraph document having a first style label based on the style information of each paragraph document comprises:

acquiring a style nesting relation in the paragraph pattern according to the initial syntax tree;

segmenting the paragraph pattern into a parent pattern part and a child pattern part according to a pattern nesting relation, and marking different first pattern labels on the parent pattern part and the child pattern part;

the first style tag of the parent style part is used for identifying style information of the parent style, and the first style tag of the child style part is used for identifying style information of the parent style and the child style.

12. The method of claim 11, wherein the replacing each paragraph in the initial syntax tree with its corresponding translation according to the identifier of each paragraph, and performing document restoration according to the replaced syntax tree to obtain a translation document corresponding to the original document, comprises:

replacing each paragraph and text in the initial syntax tree with a corresponding translation thereof according to the style nesting relationship, the identification of each paragraph and text and the translation of each paragraph and text, and restoring the style nesting relationship to generate a replaced syntax tree;

and generating a Markdown document corresponding to the translation document according to the replaced syntax tree.

13. A document translation apparatus characterized by comprising:

the system comprises an analysis unit, a translation unit and a translation unit, wherein the analysis unit is used for analyzing an original document to be translated to obtain at least one paragraph pattern to be translated and an initial syntax tree comprising the paragraph pattern, and the node of each paragraph pattern in the initial syntax tree is marked with the identifier of each paragraph pattern;

the conversion unit is used for converting each paragraph pattern into the paragraph pattern with a first pattern label according to the pattern information of each paragraph pattern;

the translation unit is used for acquiring a translation corresponding to each paragraph and document, wherein the position of the translation corresponding to the first style label is marked with a second style label of the same type;

and the restoring unit is used for replacing each paragraph pattern in the initial grammar tree with the corresponding translation thereof according to the identification of each paragraph pattern, and restoring the document according to the replaced grammar tree to obtain the translation document corresponding to the original document.

14. An electronic device, comprising: at least one processor and memory;

the memory stores computer execution instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the document translation method of any of claims 1-12.

15. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the document translation method of any one of claims 1-12.

16. A computer program product comprising computer executable instructions which, when executed by a processor, implement a document translation method according to any one of claims 1 to 12.