CN104408047A - Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server - Google Patents
Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server Download PDFInfo
- Publication number
- CN104408047A CN104408047A CN201410584207.8A CN201410584207A CN104408047A CN 104408047 A CN104408047 A CN 104408047A CN 201410584207 A CN201410584207 A CN 201410584207A CN 104408047 A CN104408047 A CN 104408047A
- Authority
- CN
- China
- Prior art keywords
- node
- file
- hdfs
- uploaded
- uploading
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/1824—Distributed file systems implemented using Network-attached Storage [NAS] architecture
- G06F16/183—Provision of network file services by network file servers, e.g. by using NFS, CIFS
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Document Processing Apparatus (AREA)
Abstract
本发明提供一种基于NFS文件服务器的文本文件多机并行上传到HDFS方法,是选取HDFS集群中N个主机,然后选择任一节点作为主节点,其他N-1个节点作为从节点,在主节点上,获取要上传的NFS文件服务器要上传目录下文件,对于每一个文件,采用并行上传方法,即集群中所有机器都参与上传,集群中每一台主机负责上传每一个文件1/N大小的连续的数据块,达到并行上传的目的,从而提高上传速度。
The present invention provides a method for uploading text files to HDFS in parallel by multiple machines based on NFS file servers. It selects N hosts in the HDFS cluster, then selects any node as the master node, and other N-1 nodes as slave nodes. On the node, the NFS file server to be uploaded needs to upload the files in the directory. For each file, the parallel upload method is adopted, that is, all machines in the cluster participate in the upload, and each host in the cluster is responsible for uploading each file 1/N in size Continuous data blocks, to achieve the purpose of parallel upload, thereby increasing the upload speed.
Description
技术领域 technical field
本发明涉及大数据存储技术领域, 具体地说是一种基于NFS文件服务器的文本文件多机并行上传到HDFS方法。 The present invention relates to the technical field of big data storage, and specifically relates to a method for uploading text files to HDFS in parallel by multiple machines based on an NFS file server.
背景技术 Background technique
伴随着计算机网络的发展,海量数据的时代已经到来。互联网数据中心预测全球的数据使用量到2020年将会增长44倍,达到35.2ZB。 With the development of computer network, the era of massive data has come. The Internet Data Center predicts that the global data usage will increase by 44 times by 2020, reaching 35.2ZB.
对于如此大数据集的存储、分析、管理和挖掘,传统技术(包括传统关系数据库)是无法胜任的,如何最快最好的分析和理解这些数据是摆在大家面前的当务之急。而在现在已拥有的技术和工具中,最成熟也最成功的一套大数据解决方案为Hadoop文件存储计算框架及构架于其上的相关组件。对于每天生成的大量文本文件,如果快速的上传到HDFS用于后续的处理,是当前面临的一个问题。为解决文本文件快速上传的问题,本文提出了一种基于基于NFS文件服务器的文本文件多机并行上传到HDFS方法。 For the storage, analysis, management and mining of such a large data set, traditional technologies (including traditional relational databases) are incapable. How to analyze and understand these data in the fastest and best way is an urgent task in front of everyone. Among the existing technologies and tools, the most mature and successful set of big data solutions is the Hadoop file storage computing framework and related components built on it. For a large number of text files generated every day, if they are quickly uploaded to HDFS for subsequent processing, it is currently a problem. In order to solve the problem of fast uploading of text files, this paper proposes a method based on NFS file server to upload text files to HDFS in parallel.
HDFS默认采用三副本机制,对于HDFS的客户端来说,当某一个用户正在用一个客户端来向HDFS中写数据,如果该客户端上有DataNode节点,NameNode最优先考虑把正在写入的数据的一个副本保存在这个客户端的DataNode节点上,另外两个副本保存到集群其他DataNode节点上,这样在整个集群中,如果仅有一个客户端写入操作的话,集群中只有3个DataNode节点工作,其他DataNode节点是空闲的,不能发挥整个集群的性能。 HDFS adopts the three-copy mechanism by default. For the HDFS client, when a user is using a client to write data to HDFS, if there is a DataNode node on the client, the NameNode will give priority to the data being written. One copy of the client is saved on the DataNode node of the client, and the other two copies are saved on other DataNode nodes in the cluster, so that in the entire cluster, if there is only one client write operation, only 3 DataNode nodes in the cluster work. Other DataNodes are idle and cannot perform the performance of the entire cluster.
发明内容 Contents of the invention
本发明的目的是提供一种基于NFS文件服务器的文本文件多机并行上传到HDFS方法。 The purpose of the present invention is to provide a kind of text file multi-computer parallel upload method to HDFS based on NFS file server.
本发明的目的是按以下方式实现的,选取HDFS集群中N个主机,然后选择任一节点作为主节点,其他N-1个节点作为从节点,在主节点上,获取要上传的NFS文件服务器要上传目录下文件,对于每一个文件,采用并行上传方法,即集群中所有机器都参与上传,集群中每一台主机负责上传每一个文件1/N大小的连续的数据块,达到并行上传的目的,从而提高上传速度,具体步骤流程为: The purpose of the present invention is achieved in the following manner, select N hosts in the HDFS cluster, then select any node as the master node, and other N-1 nodes as slave nodes, on the master node, obtain the NFS file server to be uploaded To upload files in the directory, for each file, adopt the parallel upload method, that is, all machines in the cluster participate in the upload, and each host in the cluster is responsible for uploading continuous data blocks of 1/N size for each file, achieving parallel uploading purpose, so as to improve the upload speed, the specific steps are as follows:
1)主节点上MainPut程序计算N个节点每个节点待上传数据块起止字节流,并启动N个节点上BlockPut程序并行上传;如果第一次运行,会在每一个节点上安装一个可执行程序BlockPut,用于上传本节点所负责上传的数据块,然后向每一个从节点发起命令启动BlockPut程序; 1) The MainPut program on the master node calculates the start and end byte streams of the data blocks to be uploaded on each of the N nodes, and starts the BlockPut program on the N nodes to upload in parallel; if it is run for the first time, an executable will be installed on each node The program BlockPut is used to upload the data block that the node is responsible for uploading, and then initiates a command to each slave node to start the BlockPut program;
2)每个节点上BlockPut程序负责将待上传数据块上传到HDFS,BlockPut打开一个待上传文件输入流InputStream,InputStream定位到起始字节流,后在HDFS上创建一个独立文件,将起止字节流写入到HDFS独立文件中。 2) The BlockPut program on each node is responsible for uploading the data blocks to be uploaded to HDFS. BlockPut opens an input stream InputStream of the file to be uploaded. The InputStream locates the start byte stream, and then creates an independent file on HDFS, and the start and end bytes Streams are written to HDFS standalone files.
将待上传目录挂载到N个节点的默认统一目录。 Mount the directory to be uploaded to the default unified directory of N nodes.
N不大于NFS文件服务器并行读时可达最大带宽时的客户端数, N is not greater than the number of clients that can reach the maximum bandwidth when the NFS file server reads in parallel,
本发明的目的有益效果是:本分明选取了集群中N个节点作为客户端,将一个文件分成N个数据块同时上传,每个客户端负责一块,每个分块在HDFS上保存为一个独立的文件,能最大限度的利用整个集群的性能。将一个文本文件分块并行上传,最大限度的发挥集群的性能,提高上传效率。 The beneficial effects of the purpose of the present invention are: the invention clearly selects N nodes in the cluster as clients, divides a file into N data blocks and uploads them at the same time, each client is responsible for one block, and each block is saved as an independent block on HDFS. files, which can maximize the performance of the entire cluster. Upload a text file in blocks in parallel to maximize the performance of the cluster and improve upload efficiency.
附图说明 Description of drawings
图1 是基于多机并行上传处理框架图。 Figure 1 is a framework diagram based on multi-machine parallel upload processing.
具体实施方式 Detailed ways
参照说明书附图对本发明的方法作以下详细地说明。 The method of the present invention is described in detail below with reference to the accompanying drawings.
选取HDFS集群中N个主机,然后选择任一节点作为主节点,其他N-1个节点作为从节点,在主节点上,获取要上传的NFS文件服务器要上传目录下文件,对于每一个文件,采用并行上传方法,即集群中所有机器都参与上传,集群中每一台主机负责上传每一个文件1/N大小的连续的数据块,达到并行上传的目的,从而提高上传速度, 本发明的一种基于NFS文件服务器的文本文件多机并行上传到HDFS方法, 整个流程为: Select N hosts in the HDFS cluster, then select any node as the master node, and the other N-1 nodes as slave nodes. On the master node, obtain the files in the directory to be uploaded by the NFS file server. For each file, The parallel upload method is adopted, that is, all machines in the cluster participate in the upload, and each host in the cluster is responsible for uploading continuous data blocks of 1/N size for each file, so as to achieve the purpose of parallel upload, thereby increasing the upload speed, an aspect of the present invention A method of uploading text files to HDFS in parallel on multiple machines based on NFS file server. The whole process is as follows:
1)主节点上MainPut程序计算N个节点每个节点待上传数据块起止字节流,,并启动N个节点上BlockPut程序并行上传;如果第一次运行,会在每一个节点上安装一个可执行程序BlockPut,用于上传本节点所负责上传的数据块,然后向每一个从节点发起命令启动BlockPut程序; 1) The MainPut program on the master node calculates the start and end byte streams of the data blocks to be uploaded on each of the N nodes, and starts the BlockPut program on the N nodes to upload in parallel; if it is run for the first time, it will install a Execute the program BlockPut, which is used to upload the data blocks that the node is responsible for uploading, and then initiate a command to each slave node to start the BlockPut program;
2)每个节点上BlockPut程序负责将待上传数据块上传到HDFS,BlockPut打开一个待上传文件输入流InputStream,InputStream定位到起始字节流,后在HDFS上创建一个独立文件,将起止字节流写入到HDFS独立文件中。 2) The BlockPut program on each node is responsible for uploading the data blocks to be uploaded to HDFS. BlockPut opens an input stream InputStream of the file to be uploaded. The InputStream locates the start byte stream, and then creates an independent file on HDFS, and the start and end bytes Streams are written to HDFS standalone files.
除说明书所述的技术特征外,均为本专业技术人员的已知技术。 Except for the technical features described in the instructions, all are known technologies by those skilled in the art.
Claims (3)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410584207.8A CN104408047A (en) | 2014-10-28 | 2014-10-28 | Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410584207.8A CN104408047A (en) | 2014-10-28 | 2014-10-28 | Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN104408047A true CN104408047A (en) | 2015-03-11 |
Family
ID=52645679
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201410584207.8A Pending CN104408047A (en) | 2014-10-28 | 2014-10-28 | Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN104408047A (en) |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105357280A (en) * | 2015-10-19 | 2016-02-24 | 福建新大陆软件工程有限公司 | Hadoop distributed file system (HDFS) based file tracing file transfer protocol (FTP) system |
| CN105357317A (en) * | 2015-12-07 | 2016-02-24 | 金蝶软件(中国)有限公司 | Data uploading method and system based on multi-client polling queuing |
| CN105610899A (en) * | 2015-12-10 | 2016-05-25 | 浪潮(北京)电子信息产业有限公司 | Text file parallel uploading method and device |
| CN106339473A (en) * | 2016-08-29 | 2017-01-18 | 北京百度网讯科技有限公司 | Method and device for copying file |
| CN107800691A (en) * | 2017-10-12 | 2018-03-13 | 云巅(上海)网络科技有限公司 | The system and method for building application program on demand and accessing data trnascription is realized based on distributed storage mechanism |
| CN108280214A (en) * | 2017-02-02 | 2018-07-13 | 马志强 | Rapid I/O system applied to distributed genome analysis |
| CN109325002A (en) * | 2018-09-03 | 2019-02-12 | 北京京东金融科技控股有限公司 | Text file processing method, device, system, electronic equipment, storage medium |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030110237A1 (en) * | 2001-12-06 | 2003-06-12 | Hitachi, Ltd. | Methods of migrating data between storage apparatuses |
| CN101227460A (en) * | 2007-01-19 | 2008-07-23 | 秦晨 | Method for uploading and downloading distributed document and apparatus and system thereof |
| CN103530388A (en) * | 2013-10-22 | 2014-01-22 | 浪潮电子信息产业股份有限公司 | Performance improving data processing method in cloud storage system |
| CN103544285A (en) * | 2013-10-28 | 2014-01-29 | 华为技术有限公司 | Data loading method and device |
| CN103970881A (en) * | 2014-05-16 | 2014-08-06 | 浪潮(北京)电子信息产业有限公司 | Method and system for achieving file uploading |
| CN103971066A (en) * | 2014-05-20 | 2014-08-06 | 浪潮电子信息产业股份有限公司 | Verification method for integrity of big data migration in HDFS |
-
2014
- 2014-10-28 CN CN201410584207.8A patent/CN104408047A/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030110237A1 (en) * | 2001-12-06 | 2003-06-12 | Hitachi, Ltd. | Methods of migrating data between storage apparatuses |
| CN101227460A (en) * | 2007-01-19 | 2008-07-23 | 秦晨 | Method for uploading and downloading distributed document and apparatus and system thereof |
| CN103530388A (en) * | 2013-10-22 | 2014-01-22 | 浪潮电子信息产业股份有限公司 | Performance improving data processing method in cloud storage system |
| CN103544285A (en) * | 2013-10-28 | 2014-01-29 | 华为技术有限公司 | Data loading method and device |
| CN103970881A (en) * | 2014-05-16 | 2014-08-06 | 浪潮(北京)电子信息产业有限公司 | Method and system for achieving file uploading |
| CN103971066A (en) * | 2014-05-20 | 2014-08-06 | 浪潮电子信息产业股份有限公司 | Verification method for integrity of big data migration in HDFS |
Non-Patent Citations (1)
| Title |
|---|
| 杨锋 等: "基于Hadoop 的海量农业数据资源管理平台", 《计算机工程》 * |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105357280A (en) * | 2015-10-19 | 2016-02-24 | 福建新大陆软件工程有限公司 | Hadoop distributed file system (HDFS) based file tracing file transfer protocol (FTP) system |
| CN105357280B (en) * | 2015-10-19 | 2019-02-19 | 福建新大陆软件工程有限公司 | A kind of file based on HDFS is traced to the source FTP system |
| CN105357317A (en) * | 2015-12-07 | 2016-02-24 | 金蝶软件(中国)有限公司 | Data uploading method and system based on multi-client polling queuing |
| CN105357317B (en) * | 2015-12-07 | 2019-06-07 | 金蝶软件(中国)有限公司 | A kind of data uploading method and system based on multi-client repeating query queuing |
| CN105610899A (en) * | 2015-12-10 | 2016-05-25 | 浪潮(北京)电子信息产业有限公司 | Text file parallel uploading method and device |
| CN105610899B (en) * | 2015-12-10 | 2019-09-24 | 浪潮(北京)电子信息产业有限公司 | A kind of parallel method for uploading of text file and device |
| CN106339473A (en) * | 2016-08-29 | 2017-01-18 | 北京百度网讯科技有限公司 | Method and device for copying file |
| CN108280214A (en) * | 2017-02-02 | 2018-07-13 | 马志强 | Rapid I/O system applied to distributed genome analysis |
| CN107800691A (en) * | 2017-10-12 | 2018-03-13 | 云巅(上海)网络科技有限公司 | The system and method for building application program on demand and accessing data trnascription is realized based on distributed storage mechanism |
| CN109325002A (en) * | 2018-09-03 | 2019-02-12 | 北京京东金融科技控股有限公司 | Text file processing method, device, system, electronic equipment, storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN104408047A (en) | Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server | |
| Li et al. | Communication efficient distributed machine learning with the parameter server | |
| CN105144121B (en) | Cache content-addressable blocks for storage virtualization | |
| US10069909B1 (en) | Dynamic parallel save streams for block level backups | |
| CN105610899B (en) | A kind of parallel method for uploading of text file and device | |
| CN105404652A (en) | Mass small file processing method based on HDFS | |
| WO2016202123A1 (en) | File pushing method, apparatus, and system | |
| CN106657248A (en) | Docker container based network load balancing system and establishment method and operating method thereof | |
| WO2017028690A1 (en) | File processing method and system based on etl | |
| CN104915407A (en) | Resource scheduling method under Hadoop-based multi-job environment | |
| CN109726004B (en) | Data processing method and device | |
| CN105320773A (en) | Distributed duplicated data deleting system and method based on Hadoop platform | |
| WO2017101591A1 (en) | Method for constructing knowledge base, and controller | |
| CN103761146A (en) | Method for dynamically setting quantities of slots for MapReduce | |
| CN102780769A (en) | Cloud computing platform-based disaster recovery storage method | |
| CN103577245B (en) | Lightweight class virtual machine migration method | |
| CN104156381A (en) | Copy access method and device for Hadoop distributed file system and Hadoop distributed file system | |
| CN106101710A (en) | A kind of distributed video transcoding method and device | |
| CN104125165A (en) | Job scheduling system and method based on heterogeneous cluster | |
| CN103051673B (en) | A kind of construction method of cloud storage platform based on Xen and Hadoop | |
| Song et al. | Distributed video transcoding based on MapReduce | |
| CN103124295A (en) | Large attachment uploading and managing method based on cloud computing | |
| US10083121B2 (en) | Storage system and storage method | |
| US11558455B2 (en) | Capturing data in data transfer appliance for transfer to a cloud-computing platform | |
| US11481168B2 (en) | Data streams of production intents |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WD01 | Invention patent application deemed withdrawn after publication | ||
| WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150311 |