CN104408047A

CN104408047A - Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server

Info

Publication number: CN104408047A
Application number: CN201410584207.8A
Authority: CN
Inventors: 房体盈; 辛国茂
Original assignee: Langchao Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2014-10-28
Filing date: 2014-10-28
Publication date: 2015-03-11

Abstract

The present invention provides a method for uploading text files to HDFS in parallel by multiple machines based on NFS file servers. It selects N hosts in the HDFS cluster, then selects any node as the master node, and other N-1 nodes as slave nodes. On the node, the NFS file server to be uploaded needs to upload the files in the directory. For each file, the parallel upload method is adopted, that is, all machines in the cluster participate in the upload, and each host in the cluster is responsible for uploading each file 1/N in size Continuous data blocks, to achieve the purpose of parallel upload, thereby increasing the upload speed.

Description

A method of parallel uploading text files to HDFS by multiple machines based on NFS file server

技术领域 technical field

本发明涉及大数据存储技术领域, 具体地说是一种基于NFS文件服务器的文本文件多机并行上传到HDFS方法。 The present invention relates to the technical field of big data storage, and specifically relates to a method for uploading text files to HDFS in parallel by multiple machines based on an NFS file server.

背景技术 Background technique

伴随着计算机网络的发展,海量数据的时代已经到来。互联网数据中心预测全球的数据使用量到2020年将会增长44倍,达到35.2ZB。 With the development of computer network, the era of massive data has come. The Internet Data Center predicts that the global data usage will increase by 44 times by 2020, reaching 35.2ZB.

对于如此大数据集的存储、分析、管理和挖掘,传统技术(包括传统关系数据库)是无法胜任的,如何最快最好的分析和理解这些数据是摆在大家面前的当务之急。而在现在已拥有的技术和工具中,最成熟也最成功的一套大数据解决方案为Hadoop文件存储计算框架及构架于其上的相关组件。对于每天生成的大量文本文件，如果快速的上传到HDFS用于后续的处理，是当前面临的一个问题。为解决文本文件快速上传的问题，本文提出了一种基于基于NFS文件服务器的文本文件多机并行上传到HDFS方法。 For the storage, analysis, management and mining of such a large data set, traditional technologies (including traditional relational databases) are incapable. How to analyze and understand these data in the fastest and best way is an urgent task in front of everyone. Among the existing technologies and tools, the most mature and successful set of big data solutions is the Hadoop file storage computing framework and related components built on it. For a large number of text files generated every day, if they are quickly uploaded to HDFS for subsequent processing, it is currently a problem. In order to solve the problem of fast uploading of text files, this paper proposes a method based on NFS file server to upload text files to HDFS in parallel.

HDFS默认采用三副本机制，对于HDFS的客户端来说，当某一个用户正在用一个客户端来向HDFS中写数据，如果该客户端上有DataNode节点，NameNode最优先考虑把正在写入的数据的一个副本保存在这个客户端的DataNode节点上，另外两个副本保存到集群其他DataNode节点上，这样在整个集群中，如果仅有一个客户端写入操作的话，集群中只有3个DataNode节点工作，其他DataNode节点是空闲的，不能发挥整个集群的性能。 HDFS adopts the three-copy mechanism by default. For the HDFS client, when a user is using a client to write data to HDFS, if there is a DataNode node on the client, the NameNode will give priority to the data being written. One copy of the client is saved on the DataNode node of the client, and the other two copies are saved on other DataNode nodes in the cluster, so that in the entire cluster, if there is only one client write operation, only 3 DataNode nodes in the cluster work. Other DataNodes are idle and cannot perform the performance of the entire cluster.

发明内容 Contents of the invention

本发明的目的是提供一种基于NFS文件服务器的文本文件多机并行上传到HDFS方法。 The purpose of the present invention is to provide a kind of text file multi-computer parallel upload method to HDFS based on NFS file server.

本发明的目的是按以下方式实现的，选取HDFS集群中N个主机，然后选择任一节点作为主节点，其他N-1个节点作为从节点，在主节点上，获取要上传的NFS文件服务器要上传目录下文件，对于每一个文件，采用并行上传方法，即集群中所有机器都参与上传，集群中每一台主机负责上传每一个文件1/N大小的连续的数据块,达到并行上传的目的，从而提高上传速度，具体步骤流程为： The purpose of the present invention is achieved in the following manner, select N hosts in the HDFS cluster, then select any node as the master node, and other N-1 nodes as slave nodes, on the master node, obtain the NFS file server to be uploaded To upload files in the directory, for each file, adopt the parallel upload method, that is, all machines in the cluster participate in the upload, and each host in the cluster is responsible for uploading continuous data blocks of 1/N size for each file, achieving parallel uploading purpose, so as to improve the upload speed, the specific steps are as follows:

1）主节点上MainPut程序计算N个节点每个节点待上传数据块起止字节流,并启动N个节点上BlockPut程序并行上传；如果第一次运行，会在每一个节点上安装一个可执行程序BlockPut，用于上传本节点所负责上传的数据块，然后向每一个从节点发起命令启动BlockPut程序； 1) The MainPut program on the master node calculates the start and end byte streams of the data blocks to be uploaded on each of the N nodes, and starts the BlockPut program on the N nodes to upload in parallel; if it is run for the first time, an executable will be installed on each node The program BlockPut is used to upload the data block that the node is responsible for uploading, and then initiates a command to each slave node to start the BlockPut program;

2）每个节点上BlockPut程序负责将待上传数据块上传到HDFS，BlockPut打开一个待上传文件输入流InputStream，InputStream定位到起始字节流,后在HDFS上创建一个独立文件，将起止字节流写入到HDFS独立文件中。 2) The BlockPut program on each node is responsible for uploading the data blocks to be uploaded to HDFS. BlockPut opens an input stream InputStream of the file to be uploaded. The InputStream locates the start byte stream, and then creates an independent file on HDFS, and the start and end bytes Streams are written to HDFS standalone files.

将待上传目录挂载到N个节点的默认统一目录。 Mount the directory to be uploaded to the default unified directory of N nodes.

N不大于NFS文件服务器并行读时可达最大带宽时的客户端数， N is not greater than the number of clients that can reach the maximum bandwidth when the NFS file server reads in parallel,

本发明的目的有益效果是：本分明选取了集群中N个节点作为客户端，将一个文件分成N个数据块同时上传，每个客户端负责一块，每个分块在HDFS上保存为一个独立的文件，能最大限度的利用整个集群的性能。将一个文本文件分块并行上传，最大限度的发挥集群的性能，提高上传效率。 The beneficial effects of the purpose of the present invention are: the invention clearly selects N nodes in the cluster as clients, divides a file into N data blocks and uploads them at the same time, each client is responsible for one block, and each block is saved as an independent block on HDFS. files, which can maximize the performance of the entire cluster. Upload a text file in blocks in parallel to maximize the performance of the cluster and improve upload efficiency.

附图说明 Description of drawings

图1 是基于多机并行上传处理框架图。 Figure 1 is a framework diagram based on multi-machine parallel upload processing.

具体实施方式 Detailed ways

参照说明书附图对本发明的方法作以下详细地说明。 The method of the present invention is described in detail below with reference to the accompanying drawings.

选取HDFS集群中N个主机，然后选择任一节点作为主节点，其他N-1个节点作为从节点，在主节点上，获取要上传的NFS文件服务器要上传目录下文件，对于每一个文件，采用并行上传方法，即集群中所有机器都参与上传，集群中每一台主机负责上传每一个文件1/N大小的连续的数据块,达到并行上传的目的，从而提高上传速度，本发明的一种基于NFS文件服务器的文本文件多机并行上传到HDFS方法, 整个流程为： Select N hosts in the HDFS cluster, then select any node as the master node, and the other N-1 nodes as slave nodes. On the master node, obtain the files in the directory to be uploaded by the NFS file server. For each file, The parallel upload method is adopted, that is, all machines in the cluster participate in the upload, and each host in the cluster is responsible for uploading continuous data blocks of 1/N size for each file, so as to achieve the purpose of parallel upload, thereby increasing the upload speed, an aspect of the present invention A method of uploading text files to HDFS in parallel on multiple machines based on NFS file server. The whole process is as follows:

1）主节点上MainPut程序计算N个节点每个节点待上传数据块起止字节流，,并启动N个节点上BlockPut程序并行上传；如果第一次运行，会在每一个节点上安装一个可执行程序BlockPut，用于上传本节点所负责上传的数据块，然后向每一个从节点发起命令启动BlockPut程序； 1) The MainPut program on the master node calculates the start and end byte streams of the data blocks to be uploaded on each of the N nodes, and starts the BlockPut program on the N nodes to upload in parallel; if it is run for the first time, it will install a Execute the program BlockPut, which is used to upload the data blocks that the node is responsible for uploading, and then initiate a command to each slave node to start the BlockPut program;

除说明书所述的技术特征外，均为本专业技术人员的已知技术。 Except for the technical features described in the instructions, all are known technologies by those skilled in the art.

Claims

1. one kind uploads to HDFS method based on the text multi-host parallel of NFS file server, it is characterized in that, choose N number of main frame in HDFS cluster, then select any node as host node, other N-1 node is as from node, on the primary node, the NFS file server that acquisition will be uploaded will upload file under catalogue, for each file, adopt parallel method for uploading, namely in cluster, all machines all participate in uploading, in cluster, the continuous print data block uploading each file 1/N size is responsible for by each main frame, reach the parallel object uploaded, thus raising uploading speed, concrete steps flow process is:

The each node of the N number of node of MainPut program computation data block start-stop to be uploaded byte stream on host node, and start BlockPut program parallelization on N number of node and upload; If first time is run, on each node an executable program BlockPut being installed, being responsible for for uploading this node institute the data block uploaded, then initiate order startup BlockPut program to each from node;

2) on each node, BlockPut program is responsible for data block to be uploaded to upload to HDFS, BlockPut opens a file input stream InputStream to be uploaded, InputStream navigates to banner word throttling, after on HDFS create a unique file, start-stop byte stream is written in HDFS unique file.

2. method according to claim 1, is characterized in that acquiescence unified directory catalogue to be uploaded being mounted to N number of node.

3. method according to claim 1, is characterized in that, N is not more than the number clients that NFS file server walks abreast when can to reach maximum bandwidth when reading.