CN106372065B - A method and system for developing a multilingual website - Google Patents
A method and system for developing a multilingual website Download PDFInfo
- Publication number
- CN106372065B CN106372065B CN201610958116.5A CN201610958116A CN106372065B CN 106372065 B CN106372065 B CN 106372065B CN 201610958116 A CN201610958116 A CN 201610958116A CN 106372065 B CN106372065 B CN 106372065B
- Authority
- CN
- China
- Prior art keywords
- translation
- website
- data
- multilingual
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
- G06F8/4441—Reducing the execution time required by the program code
- G06F8/4442—Reducing the number of cache misses; Data prefetching
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
本发明涉及自然语言处理技术领域,特别涉及一种多语言网站开发方法及系统。所述多语言网站开发方法包括:步骤a:开发多语言网站的静态网页;步骤b:调用机器翻译接口,对所述多语言网站中动态加入的汉语数据进行多语种翻译处理;步骤c:读取翻译数据,根据所述翻译数据加载并渲染所述多语言网站动态网页。本发明采用机器翻译和人工干预纠正处理方式,大大减少翻译误差,使网页展示效果准确率更高;通过选择utf‑8的Unicode编码格式,避免网页渲染时产生的乱码情况;通过动态载入的缓存机制,解决实时翻译载入过程中,每次需要重新调用机器翻译接口造成的资源消耗问题及加载延迟问题,同时减少人工干预。
The invention relates to the technical field of natural language processing, in particular to a method and system for developing a multilingual website. The method for developing a multilingual website includes: step a: developing a static webpage of the multilingual website; step b: calling a machine translation interface to perform multilingual translation processing on the Chinese data dynamically added in the multilingual website; step c: reading Get translation data, load and render the dynamic web page of the multilingual website according to the translation data. The present invention adopts machine translation and manual intervention to correct processing methods, greatly reduces translation errors, and makes the display effect of web pages more accurate; by selecting the Unicode encoding format of utf-8, the garbled characters generated during web page rendering are avoided; The caching mechanism solves the problem of resource consumption and loading delay caused by the need to re-call the machine translation interface each time during the real-time translation loading process, while reducing manual intervention.
Description
技术领域technical field
本发明涉及自然语言处理技术领域,特别涉及一种多语言网站开发方法及系统。The invention relates to the technical field of natural language processing, in particular to a method and system for developing a multilingual website.
背景技术Background technique
随着互联网商业化的迅速发展,电子商务网站大量涌现,市场竞争日趋激烈。近年来,中国电子商务迅猛发展,在各领域的应用不断拓展和深化,交易额连创新高,带动相关产业蓬勃发展,相关支撑体系不断健全完善,促进了创新动力和能力的不断增强。With the rapid development of Internet commercialization, a large number of e-commerce websites have emerged, and the market competition has become increasingly fierce. In recent years, China's e-commerce has developed rapidly, and its application in various fields has been continuously expanded and deepened. The transaction volume has hit new highs, which has driven the vigorous development of related industries.
我们知道,维吾尔语是一种年代悠久的古老文字,目前用维吾尔语写的书籍文献,历史资料非常之多。它储存了成千上万的维吾尔语化及生活信息,其历史意义和文化价值是弥足珍贵的。因此,民文信息处理技术与维吾尔语言未来的发展前景紧密相连。目前,随着维吾尔族人民的文化和知识水平的逐步提高,会制作维吾尔语网页的人也相续增多。已经有很多人或者团体建了各种类型的维吾尔语网站传播各种信息,这些网站和普通中文网站一样都提供新闻浏览,信息下载等功能,但由于建立网站时使用的维吾尔语软件的编码互不相同,这就导致维吾尔语网页一直处于万马奔腾,互不兼容地境地,大部分的维吾尔语网页信息都不能共享,同时在不同的编码间进行转换不得不耗费大量的工作时间和科研资料。We know that Uyghur is an ancient language with a long history. Currently, there are many books and documents written in Uyghur, and there are many historical materials. It stores thousands of Uyghur language and life information, and its historical significance and cultural value are precious. Therefore, folk information processing technology is closely related to the future development prospects of Uyghur language. At present, with the gradual improvement of the culture and knowledge level of the Uyghur people, the number of people who can create Uyghur language web pages is also increasing. Many people or groups have built various types of Uyghur websites to disseminate various information. Like ordinary Chinese websites, these websites provide news browsing, information download and other functions. It is not the same, which leads to the situation that Uyghur web pages have been in a state of incompatibility with each other. Most of the information of Uyghur web pages cannot be shared. At the same time, the conversion between different codes has to consume a lot of work time and scientific research materials. .
新疆维吾尔自治区是一个多民族多语言居住的地方,电子商务购物成为一种流行的趋势,淘宝的成功证实这种趋势将一直流行下去,但是疆内多数购物平台都是中文版的普通网站,对于大多数不熟悉中文的维吾尔族同胞使用起来困难重重,因此迫切需要一个规范化的维汉双语的购物平台。那么,想开发一款规范的维语版的电商平台不是简简单单的静态网页的维吾尔语化,一个完善地购物机制,需要实时地动态管理,动态地增删改查,人工手动翻译无法满足这种海量数据动态变化的需求,因此,我们需要机器翻译辅助平台的动态变化。Xinjiang Uygur Autonomous Region is a multi-ethnic and multi-lingual place. E-commerce shopping has become a popular trend. The success of Taobao confirms that this trend will continue to be popular. However, most shopping platforms in Xinjiang are ordinary websites in Chinese. Most Uyghur compatriots who are not familiar with Chinese are difficult to use, so a standardized Uyghur-Chinese bilingual shopping platform is urgently needed. Then, to develop a standardized Uyghur version of the e-commerce platform is not a simple Uyghur-language version of a static web page, but a perfect shopping mechanism, which requires real-time dynamic management, dynamic additions, deletions, changes, and searches, which cannot be satisfied by manual translation. The demand for this kind of dynamic change of massive data, therefore, we need the dynamic change of the machine translation assistance platform.
机器翻译是利用计算机把一种自然语言转换成另一种自然语言的过程。机器翻译发展至今,已出现了多种基于不同原理的机器翻译系统。总体可以将机器翻译系统从方法上大致分为四类:基于规则的机器翻译、基于实例的机器翻译、基于统计的机器翻译和混合式机器翻译。不同的机器翻译系统各有所长。例如,基于规则的机器翻译系统擅长于翻译符合规则的句子,翻译的质量较高;基于统计的机器翻译系统具有通用性,自动从语料库中学习语言知识。Machine translation is the process of using computers to convert one natural language into another natural language. Since the development of machine translation, a variety of machine translation systems based on different principles have emerged. In general, machine translation systems can be roughly divided into four categories: rule-based machine translation, instance-based machine translation, statistical-based machine translation, and hybrid machine translation. Different machine translation systems have their own strengths. For example, rule-based machine translation systems are good at translating sentences that conform to rules, and the translation quality is high; statistics-based machine translation systems are versatile and automatically learn language knowledge from corpora.
关于维汉机器翻译的相关参考文献包括:Relevant references on Uyghur-Chinese machine translation include:
[1]蓝伯雄,郑晓娜,徐心.电子商务时代的供应链管理[J].中国管理科学,2000,03:2-8.[1] Lan Boxiong, Zheng Xiaona, Xu Xin. Supply Chain Management in the E-Commerce Era [J]. China Management Science, 2000, 03: 2-8.
[2]汤琭.中国电子商务网络购物平台产业组织分析[D].武汉理工大学,2012.[2] Tang Yan. Analysis of China's E-commerce Online Shopping Platform Industry Organization [D]. Wuhan University of Technology, 2012.
[3]陈韵,张鹏华,任利华.机器翻译研究述评[J].价值工程,2013,01:174-176.[3] Chen Yun, Zhang Penghua, Ren Lihua. A Review of Machine Translation Research [J]. Value Engineering, 2013, 01: 174-176.
[4]朱海.基于混淆网络的机器翻译系统融合研究[D].中国科学技术大学,2010.[4] Zhu Hai. Research on machine translation system fusion based on obfuscated network [D]. University of Science and Technology of China, 2010.
[5]Nagao M.A.Framework of a mechanical translation between Japaneseand English by analogy principle[M].North Holland Publications,1984.[5]Nagao M.A.Framework of a mechanical translation between Japanese and English by analogy principle[M].North Holland Publications,1984.
[6]麦热哈巴·艾力.基于实例的维汉机器翻译若干关键问题研究[D].新疆大学,2014.[6] Mairehaba Aili. Research on some key issues of Uyghur-Chinese machine translation based on examples [D]. Xinjiang University, 2014.
[7]阿里甫·库尔班,阿布力米提·阿不都热依木,吐尔根·依布拉音.维汉机器翻译用电子词典的设计[J].计算机工程与应用,2006,20:76-78.[7] Arif Kurban, Ablimiti Abdureyimu, Turgen Ibrayin. Design of an Electronic Dictionary for Uyghur-Chinese Machine Translation [J]. Computer Engineering and Applications, 2006 , 20:76-78.
[8]卡哈尔江·阿比的热西提.基于实例的汉维—维汉双向机器翻译系统的研究[D].上海交通大学,2012.[8] Rexiti of Kaharjan Abi. Research on Chinese-Uighur-Uighur-Chinese Bidirectional Machine Translation System Based on Cases [D]. Shanghai Jiaotong University, 2012.
[9]古丽松·那斯尔丁,买买提·赛福丁.维汉机器翻译系统电子词典的研究与设计[J].新疆师范大学学报(自然科学版),1997,01:32-36.[9] Gulisong Nasserdin, Maimaiti Saifudin. Research and Design of Electronic Dictionary for Uyghur-Chinese Machine Translation System [J]. Journal of Xinjiang Normal University (Natural Science Edition), 1997, 01:32 -36.
为了解决维汉机器翻译的问题,中国专利申请号201310740830.3公开了一种应用电费自助缴费终端维吾尔语翻译引擎方法,该专利从自助缴费终端选择显示类型如汉文、维吾尔语;若选择汉文,则无需进行机器翻译;若选择维吾尔语,则启动翻译引擎对数据库里的信息进行翻译,并显示在终端界面上,从而大大减少人工互译汉文-维吾尔语的成本和时间。该专利存在的缺点在于:在选择维吾尔语时进行实时机器翻译,虽然大大减少人工互译的成本和时间,仍缺少缓存机制或是提前做好维吾尔语数据库存储,减少网页加载时的延迟。In order to solve the problem of Uyghur machine translation, Chinese Patent Application No. 201310740830.3 discloses a Uyghur translation engine method using a self-service payment terminal for electricity bills. The patent selects display types such as Chinese and Uyghur from the self-service payment terminal; if Chinese is selected, no need Perform machine translation; if Uyghur is selected, the translation engine will be started to translate the information in the database and displayed on the terminal interface, thereby greatly reducing the cost and time of manual translation between Chinese and Uyghur. The disadvantage of this patent lies in the fact that real-time machine translation is performed when selecting Uyghur language. Although the cost and time of manual translation are greatly reduced, there is still a lack of caching mechanism or Uyghur language database storage in advance to reduce the delay when web pages are loaded.
另一中国专利申请号201310197369.1公开了一种企业综合信息管理系统,该专利通过客户端提交信息管理的请求给国际化同步模块,请求包含语言和应用模式的选择;国际同步化模块接收请求并分语种管理,再传输给信息统一管理模块;信息统一管理模块将分语种管理后的请求中的不同信息进行判断并分类管理;将分类管理后的不同信息传输给历史记录模块;历史记录模块接收分类管理后的不同信息并传输客户端。该专利解决了在不同语言环境下页面同步更新问题,用户完整的掌握企业内部的人事、工资、档案、任务和财产等的详细情况;用户的所有操作步骤均同步保存在历史记录模块当中,随时可以无障碍的还原和查看。但该专利存在的缺点在于:国际化同步模块分模块分语种管理,在客户端大量更新数据时,各个模块需要实时同步更新,一方面没有预处理过程,数据返回存在刷新延迟;另一方面,数据更新可能存在误差,没有人工参与纠正过程。Another Chinese patent application No. 201310197369.1 discloses an enterprise comprehensive information management system. The patent submits a request for information management to the internationalization synchronization module through the client, and the request includes the choice of language and application mode; the international synchronization module receives the request and divides it. Language management, and then transmit it to the information unified management module; the information unified management module judges and manages the different information in the requests managed by languages; transmits the different information after classification management to the history record module; the history record module receives the classification Different information after management and transfer client. This patent solves the problem of synchronous updating of pages in different language environments. Users can fully grasp the details of the internal personnel, salary, files, tasks and property of the enterprise; all operation steps of users are synchronously saved in the history record module, and can be stored at any time at any time. It can be restored and viewed without hassle. However, the shortcomings of this patent are: the international synchronization module is managed by modules and languages. When a large amount of data is updated by the client, each module needs to be updated synchronously in real time. On the one hand, there is no preprocessing process, and there is a refresh delay in data return; on the other hand, There may be errors in data updates, and there is no human involvement in the correction process.
综上所述,现有维汉双语机器翻译技术的翻译模式都比较单一,普遍使用动态实时机器翻译,没有缓存机制或是数据预处理过程,B/C模式下网页渲染可能会存在乱码问题和延迟加载问题。To sum up, the translation modes of the existing Uyghur-Chinese bilingual machine translation technology are relatively simple, and dynamic real-time machine translation is generally used, without a caching mechanism or data preprocessing process. In B/C mode, webpage rendering may have garbled characters and Lazy loading issue.
发明内容SUMMARY OF THE INVENTION
本发明提供了一种多语言网站开发方法及系统,旨在至少在一定程度上解决现有技术中的上述技术问题之一。The present invention provides a method and system for developing a multilingual website, aiming to solve one of the above-mentioned technical problems in the prior art at least to a certain extent.
为了解决上述问题,本发明提供了如下技术方案:一种多语言网站开发方法,包括:In order to solve the above problems, the present invention provides the following technical solutions: a method for developing a multilingual website, comprising:
步骤a:开发多语言网站的静态网页;Step a: Develop static web pages for multilingual websites;
步骤b:调用机器翻译接口,对所述多语言网站中动态加入的汉语数据进行多语种翻译处理;Step b: calling the machine translation interface to perform multilingual translation processing on the Chinese data dynamically added in the multilingual website;
步骤c:读取翻译数据,根据所述翻译数据加载并渲染所述多语言网站动态网页。Step c: read the translation data, load and render the dynamic web page of the multilingual website according to the translation data.
本发明实施例采取的技术方案还包括:在所述步骤a中,所述多语言网站至少包括汉语、维吾尔语或/和哈萨克语;所述开发多语言网站的静态网页具体为:通过Unicode字符集的UTF-8编码格式进行多语言网站的静态网页开发。The technical solution adopted in the embodiment of the present invention further includes: in the step a, the multilingual website includes at least Chinese, Uyghur or/and Kazakh; the static webpage for developing the multilingual website is specifically: using Unicode characters Set the UTF-8 encoding format for static web development of multilingual websites.
本发明实施例采取的技术方案还包括:在所述步骤b中,所述对多语言网站中动态加入的汉语数据进行多语种翻译处理具体包括:The technical solution adopted in the embodiment of the present invention further includes: in the step b, the multilingual translation processing of the Chinese data dynamically added in the multilingual website specifically includes:
步骤b1:封装翻译接口,批量取出网站数据库中动态加入的汉语数据,将所述汉语数据存储在文档中,对文档中的汉语数据按行读取,每读取一行调用机器翻译接口进行自动翻译;Step b1: encapsulate the translation interface, take out the Chinese data dynamically added in the website database in batches, store the Chinese data in the document, read the Chinese data in the document line by line, and call the machine translation interface for automatic translation for each line read ;
步骤b2:对所述存储的翻译数据进行人工纠正处理;Step b2: perform manual correction processing on the stored translation data;
步骤b3:将所述人工纠正处理的翻译数据按对应格式存储到所述网站数据库中。Step b3: Store the translation data processed by manual correction in the website database in a corresponding format.
本发明实施例采取的技术方案还包括:所述步骤c中,所述根据翻译数据加载并渲染所述多语言网站动态网页具体包括:在存储翻译数据时,将维吾尔语或哈萨克语的每个字符编码转换成四位的16进制字符串,在网页渲染时,对从网站数据库中读出的维吾尔语或哈萨克语再做一次编码转换。The technical solution adopted in the embodiment of the present invention further includes: in the step c, the loading and rendering of the dynamic web page of the multilingual website according to the translation data specifically includes: when storing the translation data, each Uyghur or Kazakh language The character encoding is converted into a four-digit hexadecimal string, and when the webpage is rendered, the Uyghur or Kazakh language read from the website database is encoded again.
本发明实施例采取的技术方案还包括:所述步骤c还包括:对所述加载网页进行缓存处理;所述网页缓存处理包括文件缓存和内存缓存。The technical solution adopted in the embodiment of the present invention further includes: the step c further includes: performing caching processing on the loaded web page; the web page caching processing includes file caching and memory caching.
本发明实施例采取的另一技术方案为:一种多语言网站开发系统,包括:Another technical solution adopted in the embodiment of the present invention is: a multilingual website development system, comprising:
静态网页开发模块:用于开发多语言网站的静态网页;Static web page development module: used to develop static web pages for multilingual websites;
机器翻译模块:用于调用机器翻译接口,对所述多语言网站中动态加入的汉语数据进行多语种翻译处理;Machine translation module: used to call the machine translation interface to perform multilingual translation processing on the Chinese data dynamically added in the multilingual website;
网页渲染模块:用于读取翻译数据,根据所述翻译数据加载并渲染所述多语言网站动态网页。Web page rendering module: used to read translation data, load and render the dynamic web page of the multilingual website according to the translation data.
本发明实施例采取的技术方案还包括:所述多语言网站至少包括汉语、维吾尔语或/和哈萨克语;所述静态网页开发模块开发多语言网站的静态网页具体为:通过Unicode字符集的UTF-8编码格式进行多语言网站的静态网页开发。The technical solution adopted in the embodiment of the present invention further includes: the multilingual website includes at least Chinese, Uyghur or/and Kazakh; the static webpage development module of the static webpage development module specifically develops the static webpage of the multilingual website: UTF-8 using the Unicode character set -8 encoding format for static web development of multilingual websites.
本发明实施例采取的技术方案还包括网站数据库模块,所述网站数据库模块用于存储多语言网站中动态加入的汉语数据;所述机器翻译模块还包括:The technical solution adopted in the embodiment of the present invention further includes a website database module, which is used for storing Chinese data dynamically added in the multilingual website; the machine translation module further includes:
翻译单元:用于封装翻译接口,批量取出所述网站数据库模块中动态加入的汉语数据,将所述汉语数据存储在文档中,对文档中的汉语数据按行读取,每读取一行调用机器翻译接口进行自动翻译;Translation unit: used to encapsulate the translation interface, take out the Chinese data dynamically added in the website database module in batches, store the Chinese data in the document, read the Chinese data in the document line by line, and call the machine for each line read. Translation interface for automatic translation;
纠错单元:用于对所述存储的翻译数据进行人工纠正处理;Error correction unit: used to perform manual correction processing on the stored translation data;
存储单元:用于将所述人工纠正处理的翻译数据按对应格式存储到所述网站数据库模块中。Storage unit: used for storing the translation data processed by manual correction in the website database module in a corresponding format.
本发明实施例采取的技术方案还包括:所述网页渲染模块根据翻译数据加载并渲染所述多语言网站动态网页具体包括:在存储翻译数据时,将维吾尔语或哈萨克语的每个字符编码转换成四位的16进制字符串,在网页渲染时,对从网站数据库模块中读出的维吾尔语或哈萨克语再做一次编码转换。The technical solution adopted in the embodiment of the present invention further includes: the webpage rendering module loads and renders the dynamic webpage of the multilingual website according to the translation data, which specifically includes: when storing the translation data, encoding and converting each character in Uyghur or Kazakh It is converted into a four-digit hexadecimal string, and when the webpage is rendered, the Uyghur or Kazakh language read from the website database module is encoded and converted again.
本发明实施例采取的技术方案还包括数据缓存模块,所述数据缓存模块用于对所述加载网页进行缓存处理;所述网页缓存处理包括文件缓存和内存缓存。The technical solution adopted in the embodiment of the present invention further includes a data cache module, which is used for performing cache processing on the loaded web page; the web page cache processing includes file cache and memory cache.
相对于现有技术,本发明实施例产生的有益效果在于:本发明实施例的多语言网站开发方法及系统采取静态网页的模板开发和动态数据调用机器翻译接口的结合方式,大大减少人工互译的成本和时间;采用机器翻译和人工干预纠正处理方式,大大减少翻译误差,使网页展示效果准确率更高;通过选择utf-8的Unicode编码格式,避免网页渲染时产生的乱码情况;通过动态载入的缓存机制,解决实时翻译载入过程中,每次需要重新调用机器翻译接口造成的资源消耗问题及加载延迟问题,同时减少人工干预。Compared with the prior art, the beneficial effects of the embodiments of the present invention are: the multilingual website development method and system of the embodiments of the present invention adopt a combination of template development of static web pages and dynamic data invoking a machine translation interface, which greatly reduces manual translation. the cost and time; the use of machine translation and manual intervention to correct the processing method greatly reduces translation errors and makes the web page display effect more accurate; by selecting the utf-8 Unicode encoding format, the garbled code generated during web page rendering is avoided; through dynamic The loading cache mechanism solves the problem of resource consumption and loading delay caused by the need to re-call the machine translation interface each time during the real-time translation loading process, while reducing manual intervention.
附图说明Description of drawings
图1是本发明实施例的多语言网站开发方法的流程图;Fig. 1 is the flow chart of the multilingual website development method of the embodiment of the present invention;
图2是本发明实施例的多语言网站的整体框架图;Fig. 2 is the overall frame diagram of the multilingual website of the embodiment of the present invention;
图3是统计法机器翻译的训练流程图;Fig. 3 is the training flow chart of statistical machine translation;
图4是本发明实施例的多语种人工辅助翻译流程图;Fig. 4 is the multilingual artificial assisted translation flow chart of the embodiment of the present invention;
图5是本发明实施例的多语言网站开发系统的结构示意图。FIG. 5 is a schematic structural diagram of a multilingual website development system according to an embodiment of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
请参阅图1和图2,图1是本发明实施例的多语言网站开发方法的流程图,图2是本发明实施例的多语言网站的整体框架图。本发明实施例的多语言网站开发方法包括以下步骤:Please refer to FIG. 1 and FIG. 2 , FIG. 1 is a flowchart of a method for developing a multilingual website according to an embodiment of the present invention, and FIG. 2 is an overall frame diagram of a multilingual website according to an embodiment of the present invention. The multilingual website development method of the embodiment of the present invention includes the following steps:
步骤10:开发多语言网站模板主题,通过Unicode字符集的UTF-8编码格式进行多语言网站的静态网页开发;Step 10: Develop a multilingual website template theme, and perform static web page development of a multilingual website through the UTF-8 encoding format of the Unicode character set;
在步骤10中,本发明实施例中的多语言网站至少包括汉语、维吾尔语、哈萨克语等。在维吾尔语、哈萨克语的网页开发过程中,统一编码处理是一个关键技术,维吾尔语和哈萨克语属于阿尔泰语系,其文字都借用了阿拉伯语和部分波斯文字母,维吾尔语共有32个字母,哈萨克语共33个字。维吾尔和哈萨克语字是一种手写体的文字,每个字母根据在单词中的位置不同,有单独形式、词首形式、词中形式、词尾形式等4种表现形式,书写时由该字符在单词中的位置决定显现形式。因此,维吾尔语和哈萨克语字符在输入、编辑时具有一些特殊性,具体表现为:(1)书写方向为从右到左,行向为从上而下,输入时光标移动方向与汉、英文书写方向相反,这使维吾尔语、哈萨克语与汉、英文混合编辑时处理技术比较复杂;(2)哈萨克语有33个字母,其中有9个元音字母,24个辅音字母。哈萨克语元音和谐比较严整,辅音同化现象较多。维吾尔语由32个字母组成,而且有120多个字符形式,每个字母有4种不同的书写形式,尾部与下一个字母相连的首写形式、首尾与相邻字母连接的中间形式、首部与上一个字母相连的尾写形式和首尾与相邻字母都不相连接的独立形式,并且根据字母中的位置来确定使用何种形式;(3)维吾尔语的标点符号,例如逗号、问号等与汉、英文符号方向相反。In
在计算机应用领域中存在着很多种相同的字符集,而不同语言的用户在浏览不同语言网页时,经常会出现因为所用的字符集不同而出现乱码情况。一般汉语网站系统中使用简体(GB2312)字符集,而对于维吾尔语和哈萨克语网站系统来说,汉语网站的字符集不支持其语言。所以对要提供维、哈、汉多语言版本的网站来说,应该选择一种对汉语、维吾尔语和哈萨克语均支持的字符集。Unicode字符集对每一种语言的每个字符制定了统一且唯一的用两个字节(也有4字节的)来表示的编码,满足跨语言、跨平台的字符解码和转换处理,但是由于Unicode字符集不兼容IS088:59-1字符集,占用的空间大(对于英文字母,Unicode也需要两个字节来表示),因而产生了UTF字符集。UTF字符集包括2种:UTF-8和UTF-16,其中,UTF-16和Unicode本身的编码规范是一致的,而UTF-8则不同,它定义了一种“区间规则”,这种规则可以和IS088:59-1编码保持最大程度的兼容,同时也可以用来表示所有语言的字符,所以在设计和开发维、哈、汉多语言版本网站时,UTF-8是最理想的选择。通过UTF-8的编码格式,多语言网站的静态网页展现的是没有乱码的正常网页,客户端选择维吾尔语版本模式,系统将加载维吾尔语模板主题。There are many kinds of identical character sets in the field of computer application, and when users of different languages browse webpages in different languages, garbled characters often appear because of different character sets used. Simplified (GB2312) character set is used in general Chinese website system, but for Uyghur and Kazakh website system, the character set of Chinese website does not support their language. Therefore, for websites that want to provide Uyghur, Kazakh and Chinese multilingual versions, a character set that supports Chinese, Uyghur and Kazakh should be selected. The Unicode character set has a unified and unique encoding represented by two bytes (also 4 bytes) for each character of each language, which satisfies cross-language and cross-platform character decoding and conversion processing, but due to The Unicode character set is not compatible with the IS088:59-1 character set, and occupies a large space (for English letters, Unicode also needs two bytes to represent), thus resulting in the UTF character set. There are two types of UTF character sets: UTF-8 and UTF-16. Among them, the encoding specifications of UTF-16 and Unicode are consistent, while UTF-8 is different. It defines an "interval rule". This rule It can maintain maximum compatibility with IS088:59-1 encoding, and can also be used to represent characters in all languages, so UTF-8 is the most ideal choice when designing and developing multi-language versions of websites in Wei, Kazakhstan, and Han. Through the UTF-8 encoding format, the static webpage of the multilingual website displays the normal webpage without garbled characters. The client selects the Uyghur version mode, and the system will load the Uyghur template theme.
步骤20:通过客户端选择语言版本模式,根据语言版本模式显示相应模板主题的静态网页;Step 20: select the language version mode through the client, and display the static web page of the corresponding template theme according to the language version mode;
步骤30:封装机器翻译接口成webAPI,调用机器翻译接口,对网站数据库中动态加入的汉语数据进行多语种翻译处理;Step 30: encapsulate the machine translation interface into a webAPI, call the machine translation interface, and perform multilingual translation processing on the Chinese data dynamically added in the website database;
在步骤30中,本发明实施例的多语言网站模板开发包括静态网页的开发和动态数据加载时调用机器翻译接口的调用过程,开发维哈等多语言的电子商务平台除了需要开发静态网页,批量内容的动态加入同样需要翻译成多语种版本,本发明实施例封装机器翻译接口,调用机器翻译接口高效解决批量加入时的多语种翻译。针对静态网页编码预处理之后,对于动态加入的汉语数据,存储在网站数据库中,并在网站数据库中添加汉语数据对应的维吾尔语字段或哈萨克语字段,用于存储翻译后的维吾尔语或哈萨克语,供动态网页渲染加载。In
目前,机器翻译方式包括规则法和统计法;At present, machine translation methods include rule law and statistical law;
一、规则法:依据语言规则对文本进行分析,再借助计算机程序进行翻译。多数商用机器翻译系统采用规则法。规则法机器翻译系统的运作通过三个连续的阶段实现:分析,转换,生成;根据三个阶段的复杂性分为三级:1. Rules method: The text is analyzed according to the language rules, and then translated with the help of computer programs. Most commercial machine translation systems use the rules approach. The operation of the rule-based machine translation system is realized through three consecutive stages: analysis, conversion, and generation; it is divided into three levels according to the complexity of the three stages:
①直接翻译:词到词的翻译;①Direct translation: word-to-word translation;
②转换翻译:翻译过程要参考并兼顾到原文的词法、句法和语义信息。因为信息来源范围过于宽泛,语法规则过多且相互之间存在矛盾和冲突,转换翻译较为复杂且易出错;②Translation translation: The translation process should refer to and take into account the lexical, syntactic and semantic information of the original text. Because the range of information sources is too broad, there are too many grammatical rules and there are contradictions and conflicts between them, the conversion and translation are complicated and error-prone;
③国际语翻译。③ Interlingua translation.
二、统计法SMT:具体如图3所示,是统计法机器翻译的训练流程图。通过对大量的平行语料进行统计分析,构建统计翻译模型(词汇、比对或是语言模式),进而使用此模型进行翻译,一般会选取统计中出现概率最高的词条作为翻译,概率算法依据贝叶斯定理。假设要把一个英语句子A翻译成汉语,所有汉语句子B,都是A的可能或是非可能的潜在翻译。Pr(A)是类似A表达出现的概率,Pr(B|A)是A翻译成B出现的概率。找到两个参数的最大值,就能缩小句子及其对应翻译检索的范围,从而找出最合适的翻译。SMT根据文本分析程度级别的不同分为两种:基于词的SMT和基于短语的SMT,后一个是目前普遍使用的,Google用的就是这种。翻译文本被自动分为固定长度的词语序列,再对各词语序列在语料库里进行统计分析,以查找到出现对应概率最高的翻译。2. Statistical SMT: As shown in Figure 3, it is the training flow chart of statistical machine translation. Through statistical analysis of a large number of parallel corpora, a statistical translation model (vocabulary, comparison or language pattern) is constructed, and then this model is used for translation. Generally, the entry with the highest probability of occurrence in the statistics is selected as the translation. The probability algorithm is based on the Yeas' theorem. Suppose that an English sentence A is to be translated into Chinese, and all Chinese sentences B are possible or impossible potential translations of A. Pr(A) is the probability of occurrence of an expression similar to A, and Pr(B|A) is the probability of occurrence of A translated into B. Finding the maximum of the two parameters narrows the search of sentences and their corresponding translations to find the most appropriate translation. SMT is divided into two types according to the level of text analysis: word-based SMT and phrase-based SMT. The latter one is currently commonly used, and this is what Google uses. The translated text is automatically divided into fixed-length word sequences, and then each word sequence is statistically analyzed in the corpus to find the translation with the highest corresponding probability.
具体地,请一并参阅图4,是本发明实施例的多语种人工辅助翻译流程图。对网站数据库中动态加入的汉语数据进行多语种翻译处理具体包括以下步骤:Specifically, please refer to FIG. 4 , which is a flowchart of a multilingual human-assisted translation according to an embodiment of the present invention. The multilingual translation processing of the dynamically added Chinese data in the website database includes the following steps:
步骤31:批量取出网站数据库中动态加入的汉语数据(比如商品数据),将汉语数据存储在文档中,编写程序对文档中的汉语数据按行读取,每读取一行调用机器翻译接口自动翻译为维吾尔语或哈萨克语数据,并将翻译之后的维吾尔语或哈萨克语数据采取unicode编码格式存储在结果文档;Step 31: Take out the Chinese data (such as commodity data) dynamically added in the website database in batches, store the Chinese data in the document, and write a program to read the Chinese data in the document line by line, and call the machine translation interface for automatic translation for each line read. It is Uyghur or Kazakh data, and the translated Uyghur or Kazakh data is stored in the result document in unicode encoding format;
在步骤31中,首先对翻译接口进行封装,输入一个个字符串,取出网站数据库中的汉语数据存储在一个文档中,按行读取依次调用翻译接口,返回结果列表存储在结果文档,再依次插入网站数据库对应字段中,这个过程是一个自动化定期执行的过程,所有数据传输和调用翻译接口的过程中强调字符编码格式的统一。In step 31, the translation interface is firstly encapsulated, each character string is input, the Chinese data in the website database is taken out and stored in a document, the translation interface is called in turn by line reading, the returned result list is stored in the result document, and then the Insert into the corresponding fields of the website database. This process is an automated and regularly executed process. In the process of all data transmission and invoking the translation interface, the unification of the character encoding format is emphasized.
步骤32:对自动翻译后的维吾尔语或哈萨克语数据进行人工纠正处理;Step 32: Manually correct the automatically translated Uyghur or Kazakh data;
在步骤32中,由于维哈词库的数量局限,并不能对所有的汉语词组做到百分百准确翻译,为了提高翻译的准确度,本发明通过人工参与来纠正翻译过程中的少量误差,大大减少翻译误差,使网页展示效果准确率更高。In step 32, due to the limited number of Viha thesaurus, it is impossible to achieve 100% accurate translation of all Chinese phrases. In order to improve the accuracy of translation, the present invention corrects a small amount of errors in the translation process through manual participation, It greatly reduces translation errors and makes the display effect of web pages more accurate.
步骤33:对人工纠正过的维吾尔语或哈萨克语数据,按对应格式读取存储到网站数据库的对应字段中,按周期对整个操作流程做自动处理,完成所有静态和动态的双向渲染过程。Step 33: Read and store the manually corrected Uyghur or Kazakh data in the corresponding format into the corresponding fields of the website database, and automatically process the entire operation process periodically to complete all static and dynamic two-way rendering processes.
步骤40:从网站数据库中读取翻译数据,根据翻译数据加载并渲染相应模板主题的动态网页,并对加载网页进行缓存处理;Step 40: read the translation data from the website database, load and render the dynamic web page of the corresponding template theme according to the translation data, and perform cache processing on the loaded web page;
在步骤40中,由于计算机在维吾尔语的限制,所以维吾尔语网站在网页渲染时,就存在着一个Unicode编码和对应的操作系统,浏览器及网站数据库支持的编码格式转换输入、输出的问题,对于几乎所有的网站数据库的驱动程序,默认在程序和网站数据库之间传递数据时都采用ISO-8859-1的编码格式。于是,网站平台将维吾尔语数据存储在网站数据库时,网站数据库驱动程序将把Unicode编码格式转化为ISO-8859-1格式进行存储。在网页渲染时,从网站数据库中读出的维吾尔语数据就成为了乱码。为了解决了维吾尔语、哈萨克语与汉语的读写和存储方式不兼容造成的乱码问题,本发明实施例提出了一种维吾尔语、哈萨克语的编码转换方法,在存储翻译数据时,将维吾尔语、哈萨克语的每个字符编码转换成四位的16进制字符串(如:编码转换后:“062A 0648 064A”),在网页渲染时,对从网站数据库中读出的维吾尔语或哈萨克语再做一次编码转换,这样就不存在乱码问题了。In
缓存处理通常包括:(1)数据缓存:是指网站数据库查询PHP缓存机制,每次访问页面的时候,都会先检测相应的缓存数据是否存在,如果不存在,就连接数据库,得到数据,并把查询结果序列化后保存到文件中,以后同样的查询结果就直接从缓存表或文件中获得。(2)页面缓存:每次访问页面的时候,都会先检测相应的缓存页面文件是否存在,如果不存在,就连接数据库,得到数据,显示页面并同时生成缓存页面文件,这样下次访问的时候页面文件就发挥作用了(模板引擎和网上常见的一些PHP缓存机制类通常有此功能)。(3)时间触发缓存:检查文件是否存在并且时间戳小于设置的过期时间,如果文件修改的时间戳比当前时间戳减去过期时间戳大,那么就用缓存,否则更新缓存。(4)内容触发缓存:当插入数据或更新数据时,强制更新PHP缓存机制。(5)静态缓存:静态缓存是指静态化,直接生成HTML或XML等文本文件,有更新的时候重生成一次,适合于不太变化的页面。(6)内存缓存:Memcached是高性能的,分布式的内存对象PHP缓存机制系统,用于在动态应用中减少数据库负载,提升访问速度。(7)php缓存(8)MYSQL缓存(9)基于反向代理的web缓存(10)DNS轮询。本发明实施例这的缓存处理主要包括文件缓存和内存缓存。缓存的主要作用是降低数据库和php运算器的压力,减少网页渲染时调用机器翻译网站数据库数据带来的延迟,解决实时翻译载入过程中,每次需要重新调用机器翻译接口造成的资源消耗问题,同时也能减少人工的干预。查询到的数据直接存储在缓存里面,不用重复查询网站数据库,mysql的压力会减轻;而php的运算主要体现在,比如对一个复杂的递归运算得到的结果进行缓存,不用每次都浪费CPU进行复杂的运算。进行缓存处理的过程中,维吾尔文的编码处理过程仍然采取上述Unicode编码规范。Cache processing usually includes: (1) Data cache: refers to the website database query PHP cache mechanism, each time a page is accessed, it will first detect whether the corresponding cache data exists, if not, connect to the database, get the data, and put The query results are serialized and saved to a file, and the same query results can be obtained directly from the cache table or file in the future. (2) Page caching: Every time you visit a page, it will first detect whether the corresponding cached page file exists. If it does not exist, it will connect to the database, get the data, display the page and generate the cached page file at the same time, so that the next time you visit The page file comes into play (template engines and some common PHP caching mechanism classes on the Internet usually have this function). (3) Time-triggered cache: Check whether the file exists and the timestamp is less than the set expiration time. If the timestamp of the file modification is greater than the current timestamp minus the expiration timestamp, then use the cache, otherwise update the cache. (4) Content-triggered caching: When data is inserted or updated, the PHP caching mechanism is forced to be updated. (5) Static cache: Static cache refers to static, directly generating text files such as HTML or XML, and regenerating it once when there is an update, which is suitable for pages that do not change very much. (6) Memory cache: Memcached is a high-performance, distributed memory object PHP cache mechanism system, which is used to reduce database load and improve access speed in dynamic applications. (7) php cache (8) MYSQL cache (9) web cache based on reverse proxy (10) DNS polling. The cache processing in this embodiment of the present invention mainly includes file cache and memory cache. The main function of the cache is to reduce the pressure on the database and the PHP calculator, reduce the delay caused by calling the database data of the machine translation website during web page rendering, and solve the problem of resource consumption caused by the need to re-call the machine translation interface each time during the real-time translation loading process. , while reducing manual intervention. The queried data is directly stored in the cache, and there is no need to repeatedly query the website database, and the pressure on mysql will be reduced; and the operation of php is mainly reflected in, for example, caching the results obtained by a complex recursive operation, without wasting CPU every time. complex operations. During the caching process, the Uyghur encoding process still adopts the above-mentioned Unicode encoding standard.
本发明实施例并不仅限于解决维汉、哈汉之间的编码兼容问题,类似的柯汉之间同样使用类似的编码化处理方法,整个调用机器翻译接口对网站数据库数据处理、人工参与的过程也同样适用于其他多语言工作。The embodiments of the present invention are not limited to solving the coding compatibility problem between Uyghur Han and Ha Han. Similar coding processing methods are also used between similar Ke Han, and the whole process of calling the machine translation interface to process the website database data and manual participation The same applies to other multilingual jobs.
请参阅图5,是本发明实施例的多语言网站开发系统的结构图。本发明实施例的多语言网站开发结构包括静态网页开发模块、静态网页显示模块、网站数据库模块、机器翻译模块、网页渲染模块和数据缓存模块;Please refer to FIG. 5 , which is a structural diagram of a multilingual website development system according to an embodiment of the present invention. The multilingual website development structure of the embodiment of the present invention includes a static webpage development module, a static webpage display module, a website database module, a machine translation module, a webpage rendering module and a data cache module;
静态网页开发模块用于开发多语言网站模板主题,通过Unicode字符集的UTF-8编码格式进行多语言网站的静态网页开发;其中,本发明实施例中的多语言网站至少包括汉语、维吾尔语、哈萨克语等。在维吾尔语、哈萨克语的网页开发过程中,统一编码处理是一个关键技术,维吾尔语和哈萨克语属于阿尔泰语系,其文字都借用了阿拉伯语和部分波斯文字母,维吾尔语共有32个字母,哈萨克语共33个字。维吾尔和哈萨克语字是一种手写体的文字,每个字母根据在单词中的位置不同,有单独形式、词首形式、词中形式、词尾形式等4种表现形式,书写时由该字符在单词中的位置决定显现形式。因此,维吾尔语和哈萨克语字符在输入、编辑时具有一些特殊性,具体表现为:(1)书写方向为从右到左,行向为从上而下,输入时光标移动方向与汉、英文书写方向相反,这使维吾尔语、哈萨克语与汉、英文混合编辑时处理技术比较复杂;(2)哈萨克语有33个字母,其中有9个元音字母,24个辅音字母。哈萨克语元音和谐比较严整,辅音同化现象较多。维吾尔语由32个字母组成,而且有120多个字符形式,每个字母有4种不同的书写形式,尾部与下一个字母相连的首写形式、首尾与相邻字母连接的中间形式、首部与上一个字母相连的尾写形式和首尾与相邻字母都不相连接的独立形式,并且根据字母中的位置来确定使用何种形式;(3)维吾尔语的标点符号,例如逗号、问号等与汉、英文符号方向相反。The static webpage development module is used to develop the template theme of the multilingual website, and the static webpage development of the multilingual website is carried out through the UTF-8 encoding format of the Unicode character set; wherein, the multilingual website in the embodiment of the present invention includes at least Chinese, Uyghur, Kazakh etc. In the process of developing Uyghur and Kazakh web pages, unified coding is a key technology. Uyghur and Kazakh belong to the Altai language family, and their texts borrow Arabic and some Persian letters. Uyghur has a total of 32 letters. Kazakh There are 33 characters in the language. Uyghur and Kazakh characters are handwritten characters. Each letter has 4 forms of expression according to its position in the word: individual form, initial form, mid-word form, and suffix form. The position in determines the form of appearance. Therefore, Uyghur and Kazakh characters have some particularities when inputting and editing them, which are embodied as follows: (1) The writing direction is from right to left, and the line direction is from top to bottom. The writing direction is opposite, which makes the processing technology more complicated when Uyghur and Kazakh are mixed with Chinese and English; (2) Kazakh has 33 letters, including 9 vowels and 24 consonants. Kazakh vowels are more harmonious, and consonant assimilation is more common. Uyghur consists of 32 letters, and there are more than 120 character forms, each letter has 4 different writing forms, the initial form with the tail connected to the next letter, the middle form with the head and tail connected with adjacent letters, the head with The last form of the last letter connected to the last letter and the independent form of the first and last letters are not connected to the adjacent letters, and which form to use is determined according to the position of the letter; (3) Uyghur punctuation marks, such as commas, question marks, etc. Chinese and English symbols are in opposite directions.
在计算机应用领域中存在着很多种相同的字符集,而不同语言的用户在浏览不同语言网页时,经常会出现因为所用的字符集不同而出现乱码情况。一般汉语网站系统中使用简体(GB2312)字符集,而对于维吾尔语和哈萨克语网站系统来说,汉语网站的字符集不支持其语言。所以对要提供维、哈、汉多语言版本的网站来说,应该选择一种对汉语、维吾尔语和哈萨克语均支持的字符集。Unicode字符集对每一种语言的每个字符制定了统一且唯一的用两个字节(也有4字节的)来表示的编码,满足跨语言、跨平台的字符解码和转换处理,但是由于Unicode字符集不兼容IS088:59-1字符集,占用的空间大(对于英文字母,Unicode也需要两个字节来表示),因而产生了UTF字符集。UTF字符集包括2种:UTF-8和UTF-16,其中,UTF-16和Unicode本身的编码规范是一致的,而UTF-8则不同,它定义了一种“区间规则”,这种规则可以和IS088:59-1编码保持最大程度的兼容,同时也可以用来表示所有语言的字符,所以在设计和开发维、哈、汉多语言版本网站时,UTF-8是最理想的选择。通过UTF-8的编码格式,多语言网站的静态网页展现的是没有乱码的正常网页,客户端选择维吾尔语版本模式,系统将加载维吾尔语模板主题。There are many kinds of identical character sets in the field of computer application, and when users of different languages browse webpages in different languages, garbled characters often appear because of different character sets used. Simplified (GB2312) character set is used in general Chinese website system, but for Uyghur and Kazakh website system, the character set of Chinese website does not support their language. Therefore, for websites that want to provide Uyghur, Kazakh and Chinese multilingual versions, a character set that supports Chinese, Uyghur and Kazakh should be selected. The Unicode character set has a unified and unique encoding represented by two bytes (also 4 bytes) for each character of each language, which satisfies cross-language and cross-platform character decoding and conversion processing, but due to The Unicode character set is not compatible with the IS088:59-1 character set, and occupies a large space (for English letters, Unicode also needs two bytes to represent), thus resulting in the UTF character set. There are two types of UTF character sets: UTF-8 and UTF-16. Among them, the encoding specifications of UTF-16 and Unicode are consistent, while UTF-8 is different. It defines an "interval rule". This rule It can maintain maximum compatibility with IS088:59-1 encoding, and can also be used to represent characters in all languages, so UTF-8 is the most ideal choice when designing and developing multi-language versions of websites in Wei, Kazakhstan, and Han. Through the UTF-8 encoding format, the static webpage of the multilingual website displays the normal webpage without garbled characters. The client selects the Uyghur version mode, and the system will load the Uyghur template theme.
静态网页显示模块用于根据客户端选择的语言版本模式显示相应模板主题的静态网页;The static webpage display module is used to display the static webpage of the corresponding template theme according to the language version mode selected by the client;
网站数据库模块用于存储多语言网站中动态加入的汉语数据;The website database module is used to store the Chinese data dynamically added in the multilingual website;
机器翻译模块用于封装机器翻译接口成webAPI,调用机器翻译接口,对网站数据库模块中动态加入的汉语数据进行多语种翻译处理;其中,本发明实施例的多语言网站模板开发包括静态网页的开发和动态数据加载时调用机器翻译接口的调用过程,开发维哈等多语言的电子商务平台除了需要开发静态网页,批量内容的动态加入同样需要翻译成多语种版本,本发明实施例封装机器翻译接口,调用机器翻译接口高效解决批量加入时的多语种翻译。针对静态网页编码预处理之后,对于动态加入的汉语数据,存储在数据库模块中,并在数据库模块中添加汉语数据对应的维吾尔语字段或哈萨克语字段,用于存储翻译后的维吾尔语或哈萨克语,供动态网页渲染加载。The machine translation module is used to encapsulate the machine translation interface into a webAPI, call the machine translation interface, and perform multilingual translation processing on the Chinese data dynamically added in the website database module; wherein, the multilingual website template development in the embodiment of the present invention includes the development of static web pages Unlike the calling process of calling the machine translation interface when loading dynamic data, the development of multi-language e-commerce platforms such as Weiha requires the development of static web pages, and the dynamic addition of batch content also needs to be translated into multilingual versions. The embodiment of the present invention encapsulates the machine translation interface. , call the machine translation interface to efficiently solve the multilingual translation when adding batches. After coding preprocessing for static web pages, store the dynamically added Chinese data in the database module, and add the Uyghur or Kazakh fields corresponding to the Chinese data in the database module to store the translated Uyghur or Kazakh , for dynamic web page rendering and loading.
具体地,机器翻译模块包括翻译单元、纠错单元和存储单元;Specifically, the machine translation module includes a translation unit, an error correction unit and a storage unit;
翻译单元用于批量取出数据库模块中的汉语数据,将汉语数据存储在文档中,编写程序对文档中的汉语数据按行读取,每读取一行调用机器翻译接口自动翻译为维吾尔语或哈萨克语数据,并将翻译之后的维吾尔语或哈萨克语数据采取unicode编码格式存储在结果文档;其中,首先对翻译接口进行封装,输入一个个字符串,取出数据库模块中的汉语数据存储在一个文档中,按行读取依次调用翻译接口,返回结果列表存储在结果文档,再依次插入数据库模块对应字段中,这个过程是一个自动化定期执行的过程,所有数据传输和调用翻译接口的过程中强调字符编码格式的统一。The translation unit is used to take out the Chinese data in the database module in batches, store the Chinese data in the document, and write a program to read the Chinese data in the document line by line, and call the machine translation interface for each line read to automatically translate it into Uyghur or Kazakh data, and store the translated Uyghur or Kazakh data in the result document in unicode encoding format; among them, the translation interface is firstly encapsulated, a string is input, and the Chinese data in the database module is taken out and stored in a document. The translation interface is called in sequence by line reading, the returned result list is stored in the result document, and then inserted into the corresponding fields of the database module in turn. This process is an automated and regular process. In the process of all data transmission and translation interface invocation, the character encoding format is emphasized. unity.
纠错单元用于对自动翻译后的维吾尔语或哈萨克语数据进行人工纠正处理;其中,由于维哈词库的数量局限,并不能对所有的汉语词组做到百分百准确翻译,为了提高翻译的准确度,本发明通过人工参与来纠正翻译过程中的少量误差,大大减少翻译误差,使网页展示效果准确率更高。The error correction unit is used to manually correct the automatically translated Uyghur or Kazakh data; among them, due to the limited number of Viha thesaurus, it is not possible to achieve 100% accurate translation of all Chinese phrases. In order to improve translation The invention corrects a small amount of errors in the translation process through manual participation, greatly reduces the translation errors, and makes the display effect of the webpage more accurate.
存储单元用于对人工纠正过的维吾尔语或哈萨克语数据,按对应格式读取存储到数据库模块对应的字段中,按周期对整个操作流程做自动处理,完成所有静态和动态的双向渲染过程。The storage unit is used to read and store the manually corrected Uyghur or Kazakh data in the corresponding format into the corresponding fields of the database module, automatically process the entire operation process periodically, and complete all static and dynamic two-way rendering processes.
网页渲染模块用于从网站数据库模块中读取翻译数据,根据翻译数据加载并渲染相应模板主题的动态网页;其中,由于计算机在维吾尔语的限制,所以维吾尔语网站在网页渲染时,就存在着一个Unicode编码和对应的操作系统,浏览器及数据库支持的编码格式转换输入、输出的问题,对于几乎所有的数据库的驱动程序,默认在程序和数据库之间传递数据时都采用ISO-8859-1的编码格式。于是,网站平台将维吾尔语数据存储在数据库模块时,数据库驱动程序将把Unicode编码格式转化为ISO-8859-1格式进行存储。在网页渲染时,从数据库模块中读出的维吾尔语数据就成为了乱码。为了解决了维吾尔语、哈萨克语与汉语的读写和存储方式不兼容造成的乱码问题,本发明实施例提出了一种维吾尔语、哈萨克语的编码转换方法,在存储翻译数据时,将维吾尔语、哈萨克语的每个字符编码转换成四位的16进制字符串(如:编码转换后:“062A 0648 064A”),在网页渲染时,对从数据库模块中读出的维吾尔语或哈萨克语再做一次编码转换,这样就不存在乱码问题了。The webpage rendering module is used to read the translation data from the website database module, load and render the dynamic webpage of the corresponding template theme according to the translation data; among them, due to the limitation of the computer in the Uyghur language, when the Uyghur language website renders the webpage, there is a A Unicode encoding and the corresponding operating system, the encoding format supported by the browser and the database to convert the input and output problems, for almost all database drivers, the default is to use ISO-8859-1 when transferring data between the program and the database. encoding format. Therefore, when the website platform stores Uyghur data in the database module, the database driver will convert the Unicode encoding format into ISO-8859-1 format for storage. When the webpage is rendered, the Uyghur data read from the database module becomes garbled. In order to solve the problem of garbled characters caused by the incompatibility of Uyghur, Kazakh and Chinese in reading, writing and storage methods, the embodiment of the present invention proposes a code conversion method for Uyghur and Kazakh. , each character encoding of Kazakh is converted into a four-digit hexadecimal string (such as: After encoding conversion: "062A 0648 064A"), when the webpage is rendered, perform encoding conversion on the Uyghur or Kazakh language read from the database module, so that there is no problem of garbled characters.
数据缓存模块用于对加载网页进行缓存处理;缓存处理通常包括:(1)数据缓存:是指数据库查询PHP缓存机制,每次访问页面的时候,都会先检测相应的缓存数据是否存在,如果不存在,就连接数据库,得到数据,并把查询结果序列化后保存到文件中,以后同样的查询结果就直接从缓存表或文件中获得。(2)页面缓存:每次访问页面的时候,都会先检测相应的缓存页面文件是否存在,如果不存在,就连接数据库,得到数据,显示页面并同时生成缓存页面文件,这样下次访问的时候页面文件就发挥作用了(模板引擎和网上常见的一些PHP缓存机制类通常有此功能)。(3)时间触发缓存:检查文件是否存在并且时间戳小于设置的过期时间,如果文件修改的时间戳比当前时间戳减去过期时间戳大,那么就用缓存,否则更新缓存。(4)内容触发缓存:当插入数据或更新数据时,强制更新PHP缓存机制。(5)静态缓存:静态缓存是指静态化,直接生成HTML或XML等文本文件,有更新的时候重生成一次,适合于不太变化的页面。(6)内存缓存:Memcached是高性能的,分布式的内存对象PHP缓存机制系统,用于在动态应用中减少数据库负载,提升访问速度。(7)php缓存(8)MYSQL缓存(9)基于反向代理的web缓存(10)DNS轮询。本发明实施例这的缓存处理主要包括文件缓存和内存缓存。缓存的主要作用是降低数据库和php运算器的压力,减少网页渲染时调用机器翻译数据库数据带来的延迟,解决实时翻译载入过程中,每次需要重新调用机器翻译接口造成的资源消耗问题,同时也能减少人工的干预。查询到的数据直接存储在缓存里面,不用重复查询数据库,mysql的压力会减轻;而php的运算主要体现在,比如对一个复杂的递归运算得到的结果进行缓存,不用每次都浪费CPU进行复杂的运算。进行缓存处理的过程中,维吾尔文的编码处理过程仍然采取上述Unicode编码规范。The data cache module is used to cache the loaded web page; the cache process usually includes: (1) Data cache: refers to the database query PHP cache mechanism, each time a page is accessed, it will first check whether the corresponding cache data exists, if not If it exists, connect to the database, get the data, and serialize the query result and save it to the file. In the future, the same query result will be directly obtained from the cache table or file. (2) Page caching: Every time you visit a page, it will first detect whether the corresponding cached page file exists. If it does not exist, it will connect to the database, get the data, display the page and generate the cached page file at the same time, so that the next time you visit The page file comes into play (template engines and some common PHP caching mechanism classes on the Internet usually have this function). (3) Time-triggered cache: Check whether the file exists and the timestamp is less than the set expiration time. If the timestamp of the file modification is greater than the current timestamp minus the expiration timestamp, then use the cache, otherwise update the cache. (4) Content-triggered caching: When data is inserted or updated, the PHP caching mechanism is forced to be updated. (5) Static cache: Static cache refers to static, directly generating text files such as HTML or XML, and regenerating it once when there is an update, which is suitable for pages that do not change very much. (6) Memory cache: Memcached is a high-performance, distributed memory object PHP cache mechanism system, which is used to reduce database load and improve access speed in dynamic applications. (7) php cache (8) MYSQL cache (9) web cache based on reverse proxy (10) DNS polling. The cache processing in this embodiment of the present invention mainly includes file cache and memory cache. The main function of the cache is to reduce the pressure on the database and the PHP calculator, reduce the delay caused by calling the machine translation database data when the web page is rendered, and solve the problem of resource consumption caused by the need to re-call the machine translation interface each time during the real-time translation loading process. It also reduces manual intervention. The queried data is directly stored in the cache, and there is no need to repeatedly query the database, and the pressure on mysql will be reduced; and the operation of php is mainly reflected in, for example, caching the result obtained by a complex recursive operation, without wasting the CPU every time for complex operation. During the caching process, the Uyghur encoding process still adopts the above-mentioned Unicode encoding standard.
本发明实施例的多语言网站开发方法及系统采取静态网页的模板开发和动态数据调用机器翻译接口的结合方式,大大减少人工互译的成本和时间;采用机器翻译和人工干预纠正处理方式,大大减少翻译误差,使网页展示效果准确率更高;通过选择utf-8的Unicode编码格式,避免网页渲染时产生的乱码情况;通过动态载入的缓存机制,解决实时翻译载入过程中,每次需要重新调用机器翻译接口造成的资源消耗问题及加载延迟问题,同时减少人工干预。The multilingual website development method and system of the embodiment of the present invention adopts the combination of template development of static webpage and dynamic data calling machine translation interface, which greatly reduces the cost and time of manual translation; Reduce translation errors and make web page display more accurate; by choosing utf-8 Unicode encoding format to avoid garbled characters during web page rendering; Resource consumption and loading delay problems caused by the need to re-call the machine translation interface, while reducing manual intervention.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (5)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610958116.5A CN106372065B (en) | 2016-10-27 | 2016-10-27 | A method and system for developing a multilingual website |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610958116.5A CN106372065B (en) | 2016-10-27 | 2016-10-27 | A method and system for developing a multilingual website |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN106372065A CN106372065A (en) | 2017-02-01 |
| CN106372065B true CN106372065B (en) | 2020-07-21 |
Family
ID=57893794
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201610958116.5A Expired - Fee Related CN106372065B (en) | 2016-10-27 | 2016-10-27 | A method and system for developing a multilingual website |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN106372065B (en) |
Families Citing this family (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108021423B (en) * | 2017-12-15 | 2021-05-04 | 语联网(武汉)信息技术有限公司 | Method, system and computer-readable storage medium for generating a multilingual website |
| CN108280219B (en) * | 2018-02-07 | 2021-06-22 | 深圳壹账通智能科技有限公司 | Text translation method, apparatus, computer equipment and storage medium |
| CN108563645B (en) * | 2018-04-24 | 2022-03-22 | 成都智信电子技术有限公司 | Metadata translation method and device of HIS (hardware-in-the-system) |
| CN108664247B (en) * | 2018-04-26 | 2022-02-01 | 微梦创科网络科技(中国)有限公司 | Page template data interaction method and device |
| CN109088995B (en) * | 2018-10-17 | 2020-11-13 | 永德利硅橡胶科技(深圳)有限公司 | Method and mobile phone for supporting global language translation |
| CN109828775B (en) * | 2018-12-06 | 2021-12-07 | 中国电子进出口有限公司 | WEB management system and method for multilingual translation text content |
| CN109684096A (en) * | 2018-12-29 | 2019-04-26 | 北京超图软件股份有限公司 | A kind of software program recycling processing method and device |
| CN109783579B (en) * | 2019-01-22 | 2020-06-02 | 南京焦点领动云计算技术有限公司 | Method for quickly copying and translating website |
| CN114756795A (en) * | 2022-04-07 | 2022-07-15 | 平安资产管理有限责任公司 | Webpage translation method and device, computer equipment and storage medium |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101957815A (en) * | 2009-07-13 | 2011-01-26 | 白劲实 | Automatic translation method and system based on correct translation result and corresponding relation |
| CN102193914A (en) * | 2011-05-26 | 2011-09-21 | 中国科学院计算技术研究所 | Computer aided translation method and system |
| CN102929865A (en) * | 2012-10-12 | 2013-02-13 | 广西大学 | PDA (Personal Digital Assistant) translation system for inter-translating Chinese and languages of ASEAN (the Association of Southeast Asian Nations) countries |
| CN103823796A (en) * | 2014-02-25 | 2014-05-28 | 武汉传神信息技术有限公司 | System and method for translation |
| CN104375808A (en) * | 2013-07-11 | 2015-02-25 | 携程计算机技术(上海)有限公司 | Method and device for displaying interfaces |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2000330992A (en) * | 1999-05-17 | 2000-11-30 | Nec Software Shikoku Ltd | Multilinguistic www server system and its processing method |
| US7016977B1 (en) * | 1999-11-05 | 2006-03-21 | International Business Machines Corporation | Method and system for multilingual web server |
| CN102567384B (en) * | 2010-12-29 | 2017-02-01 | 上海掌门科技有限公司 | Webpage multi-language dynamic switching method and system based on webpage browser engine |
| CN102508878A (en) * | 2011-10-18 | 2012-06-20 | 深圳市共进电子股份有限公司 | Method for generating standard foreign language page by means of machine translation system |
-
2016
- 2016-10-27 CN CN201610958116.5A patent/CN106372065B/en not_active Expired - Fee Related
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101957815A (en) * | 2009-07-13 | 2011-01-26 | 白劲实 | Automatic translation method and system based on correct translation result and corresponding relation |
| CN102193914A (en) * | 2011-05-26 | 2011-09-21 | 中国科学院计算技术研究所 | Computer aided translation method and system |
| CN102929865A (en) * | 2012-10-12 | 2013-02-13 | 广西大学 | PDA (Personal Digital Assistant) translation system for inter-translating Chinese and languages of ASEAN (the Association of Southeast Asian Nations) countries |
| CN104375808A (en) * | 2013-07-11 | 2015-02-25 | 携程计算机技术(上海)有限公司 | Method and device for displaying interfaces |
| CN103823796A (en) * | 2014-02-25 | 2014-05-28 | 武汉传神信息技术有限公司 | System and method for translation |
Also Published As
| Publication number | Publication date |
|---|---|
| CN106372065A (en) | 2017-02-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN106372065B (en) | A method and system for developing a multilingual website | |
| CN1815477B (en) | Method and system for providing semantic subjects based on mark language | |
| Dombrowski | Preparing non-English texts for computational analysis | |
| CN102929865B (en) | PDA (Personal Digital Assistant) translation system for inter-translating Chinese and languages of ASEAN (the Association of Southeast Asian Nations) countries | |
| CN109828775B (en) | WEB management system and method for multilingual translation text content | |
| TWI685759B (en) | Method and system for intelligent learning word editing and multi-language translating | |
| CN101551798A (en) | Translating input method and word stock | |
| Boitet et al. | An evaluation of UNL usability for high quality multilingualization and projections for a future UNL++ language | |
| CN116702747A (en) | PDF online reader design method, device, computer equipment and medium | |
| Chakrawarti et al. | Phrase-Based Statistical Machine Translation of Hindi Poetries into English | |
| Li et al. | Intelligent braille conversion system of Chinese characters based on Markov model | |
| Poupard | Attention is all low-resource languages need | |
| WO1999052041A1 (en) | Opening and holographic template type of language translation method having man-machine dialogue function and holographic semanteme marking system | |
| CN104133854A (en) | MySQL multi-language mixed text fulltext retrieval realization method | |
| Patel et al. | Cross-lingual information retrieval: Application and challenges for Indian languages | |
| Simons et al. | Multilingual data processing in the CELLAR environment | |
| Singh et al. | Hindi to English Transfer Based Machine Translation System | |
| Chakrawarti et al. | Phrase-Based Statistical Machine Translation of Hindi Poetries into English by incorporating Word Sense Disambiguation | |
| Lee et al. | Less for More: Enhancing Preference Learning in Generative Language Models with Automated Self-Curation of Training Corpora | |
| Liu | Research on Web Application Technology for Building a Chinese-French Parallel Corpus of the Four Great Chinese Classical Novels | |
| Chakrawarti et al. | Translation of Hindi Poetries into English | |
| Tedla | amLite: Amharic Transliteration Using Key Map Dictionary | |
| Li | A pinyin input method editor with English-Chinese aided translation function | |
| Watanabe et al. | Language Translation Tools Drive Productivity Improvements for Global Delivery of Services | |
| Sachs | Word processing and the independent translator: A revolution in working procedures |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200721 |
|
| CF01 | Termination of patent right due to non-payment of annual fee |