CN109800208B

CN109800208B - Network traceability system and its data processing method, computer storage medium

Info

Publication number: CN109800208B
Application number: CN201910046934.1A
Authority: CN
Inventors: 张武斌; 彭闯; 袁敏洵; 袁小坊
Original assignee: Hunan Tomomichi Information Technology Co Ltd
Current assignee: Hunan Tomomichi Information Technology Co Ltd
Priority date: 2019-01-18
Filing date: 2019-01-18
Publication date: 2019-09-27
Anticipated expiration: 2039-01-18
Also published as: CN109800208A

Abstract

The present invention relates to network tracing technology field, a kind of network traceability system and its data processing method, computer storage medium are disclosed, to improve the resource utilization of HDFS and further increase the access efficiency of data.The method of the present invention includes: that network traceability system is divided into client layer, pretreatment layer and accumulation layer, pretreatment layer is set between client layer and accumulation layer, and in the storing process of data file, at least two small data files are merged into one big file by pretreatment layer and generate corresponding SMI index and MDI index, accordingly, the location metadata information for merging blocks of files is transferred on each corresponding distribution DataNode memory by accumulation layer from NameNode memory；When reading data, object content is quickly determined based on the location metadata information of the merging blocks of files of SMI index, MDI index and transfer.

Description

Network traceability system and its data processing method, computer storage medium

Technical field

The present invention relates to network tracing technology field more particularly to a kind of network traceability system and its data processing method, Computer storage medium.

Background technique

As internet scale constantly expands and that applies gos deep into, network safety event shows hidden, complicated and more The trend of sample, therefore monitoring to magnanimity real-time network data and analysis become a kind of application to become more and more important.

In general, network traceability system need to have long-time mass data storage ability, original number can be saved in real time for a long time According to packet, and the various statistical data such as data flow, session and application log are saved simultaneously；Have quick data retrieval capability, and Backtracking analysis is carried out to the network behavior occurred, using data and host data；It can classify at any time when checking and calling any Between section data, when finding the problem provide certain time within the scope of backtracking analyze (depending on device memory), be Rapid orientation problem occurrence cause provides more fully analysis foundation, while providing strong data point for network security Analysis ensures.

Network traceability system has mass data storage, convenient and fast showing interface, the fault location of finishing, comprehensive data The advantages such as backtracking, real-time performance monitoring.In order to guarantee the decoding efficiency of system data file, each data text of designing system Part size is between 40-50MB.Because data volume is very big, HDFS can be used to save data.But HDFS is initially designed as locating Manage big file (a typically larger than HDFS block block 120MB)；Meanwhile HDFS is in order to make the transmission speed and hard disk of data Transmission speed is close, then design will tracking time (Seek) it is relatively minimized, the size of block is arranged it is bigger, in this way The time for reading and writing data block will be much larger than the tracking time, close to the transmission speed of hard disk.Whereby, if by network traceability system portion Administration causes inefficiency in HDFS.The reason of specifically including following several respects:

Metadata management in HDFS is a time-consuming task, and for small files I/O, the most of the time is for managing Metadata, and the time spent in data transmission is seldom.A large amount of small documents increase the expense of metadata operation in HDFS. On the other hand, meta-data preservation is in name node, and the information preservation of block is in back end.In addition, all these letters Breath is all loaded into physical memory.As a result, sharply increasing with small documents quantity, memory usage increases sharply.

The relevant technologies of small documents problem of existing optimization HDFS include:

" a kind of storage optimization side of the small documents hierarchical index based on Hadoop entitled disclosed in CN105183839A The patent of method ".And

It is entitled disclosed in CN106909651A " a method of based on HDFS small documents be written and read " patent.

Above-mentioned two patent discloses the mechanism for merging small documents and establishing index to file.But reading file When, client still will be interacted with NameNode, by inquiring NameNode cache file distributed meta data information, obtain text Then the actual position of part obtains file data with DataNode interaction again, and also as the small documents added in interactive process Processing server is simultaneously equipped with two-stage index and prefetches mechanism, and interacting for complexity file reading is gone back while improving cost Journey, such as: when the file of inquiry is always different, it is necessary to go the content for frequently replacing memory that the efficiency of system is caused to reduce；By This needs to be further increased the data reading performance using redundancy arranged！

Summary of the invention

Present invention aims at disclosing a kind of network traceability system and its data processing method, computer storage medium, with It improves the resource utilization of HDFS and further increases the access efficiency of data.

In order to achieve the above object, the present invention discloses a kind of data processing method of network traceability system, the network, which is traced to the source, is System carries out data storage based on HDFS, which comprises

The network traceability system is divided into client layer, pretreatment layer and accumulation layer, the client layer is for generating institute State the data file of traceability system bottom crawl；The accumulation layer includes NameNode and at least two based on HDFS DataNode；

The pretreatment layer is set between the client layer and the accumulation layer, and in the storing process of data file In, following step is executed by the pretreatment layer:

Step S1, at least two small data files for grabbing the client layer are merged into one big file and generate correspondence SMI index, SMI index characterization merge after the titles of big file name and each small documents being merged, size and Relationship between offset；And

Step S2, after the big file uploading success after merging, believed according to the location metadata for merging blocks of files Breath generates MDI index, and the MDI index characterizes the corresponding pass of the big file name with the DataNode for storing the big file System；

Step S3, by the SMI index and the MDI indexed cache on the NameNode, and by the SMI rope Draw and is buffered on corresponding DataNode；

Correspondingly, in the storing process of data file, the also mating execution following step of the accumulation layer:

Step S10, it is each right to be transferred to the location metadata information for merging blocks of files from the NameNode memory On the distributed DataNode memory answered；

When reading data, the method also includes:

Step S100, the described client layer sends first to the NameNode and reads file request, and described first reads text Part request carries the title of target small documents；

Step S200, the SMI index and MDI index that the described NameNode reads file request, caching according to described first DataNode address information corresponding to target small documents is returned to client layer；

Step S300, the described client layer sends second according to the DataNode address information and reads file request, described Second reads the title that file request carries target small documents；

Step S400, the described DataNode reads file request, the SMI index of caching and merging file according to described second The content of the location metadata information searching target small documents of block, and return to the client layer.

Correspondingly, invention additionally discloses a kind of network traceability system, the network traceability system is based on HDFS and is counted According to storage, comprising:

Client layer, for generating the data file of the traceability system bottom crawl；

Accumulation layer, including NameNode (host node) and at least two DataNode based on HDFS (from node)；And

Pretreatment layer between the client layer and the accumulation layer, in the storing process of data file, Execute following step:

Correspondingly, in the storing process of data file, the accumulation layer is also used to mating execution following step:

When reading data, the network traceability system is also used to execute following step:

In order to achieve the above object, invention additionally discloses a kind of network traceability system, the network traceability system be based on HDFS into The storage of row data including memory, processor and stores the computer program that can be run on a memory and on a processor, It is characterized in that, the step of processor realizes the above method when executing the computer program.

In order to achieve the above object, it is stored thereon with computer program invention additionally discloses a kind of computer storage medium, it is special Sign is, the step in the above method is realized when described program is executed by processor.

The invention has the following advantages:

On the one hand, being all loaded into memory for all file indexes is not had to go to prefetch again to waste time.On the other hand, The present invention also improves the storage organization of HDFS, and the location metadata information for merging blocks of files is transferred to from NameNode memory On each corresponding distribution DataNode memory, NameNode memory consumption is reduced；And pass through SMI index and MDI index Cooperation effectively prevent merging it is inconvenient brought by the location metadata information displacement of blocks of files.At the same time, it is deposited in data During storage, it is additionally arranged pretreatment layer；And in data read process, then it does not need pretreatment layer and participates in interaction.Thus from more A dimension enables the network traceability system based on small documents efficiently to operate on HDFS, improves the utilization of resources of HDFS Rate and the access efficiency for further increasing data.

Below with reference to accompanying drawings, the present invention is described in further detail.

Detailed description of the invention

The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is system structure diagram disclosed by the embodiments of the present invention.

Fig. 2 is SMI index structure schematic diagram disclosed by the embodiments of the present invention.

Fig. 3 is DMI index structure schematic diagram disclosed by the embodiments of the present invention.

Specific embodiment

The embodiment of the present invention is described in detail below in conjunction with attached drawing, but the present invention can be defined by the claims Implement with the multitude of different ways of covering.

Embodiment one

The present embodiment discloses a kind of network traceability system, as shown in Figure 1, comprising:

Client layer, for generating the data file of the traceability system bottom crawl.

Accumulation layer, including NameNode and at least two DataNode based on HDFS.And

Pretreatment layer between the client layer and the accumulation layer.

Wherein, pretreatment layer is used in the storing process of data file, executes following step:

Step S1, at least two small data files for grabbing the client layer are merged into one big file and generate correspondence SMI index.SMI index characterization merge after the titles of big file name and each small documents being merged, size and Relationship between offset.

As shown in Fig. 2, the specific form of SMI index may is that " hash:<key, value>".Wherein key essential record The title of file, the content of value are megerFileName_offset_length (i.e. sizes and offset).When we need When reading the content in some file, we first find corresponding key value from caching, corresponding so as to obtain Value value.We can determine the beginning and end position of some file according to value value is obtained.

Step S2, after the big file uploading success after merging, believed according to the location metadata for merging blocks of files Breath generates MDI index, and the MDI index characterizes the corresponding pass of the big file name with the DataNode for storing the big file System.

As shown in figure 3, the concrete form of MDI may is that " hash:<merged file name, DataNode IP>". MDI instruction merges the relationship between file and DataNode.It is as described later: when we obtain MDI information from NameNode, this Mean that we obtain the collection body positions for merging file.

Step S3, by the SMI index and the MDI indexed cache on the NameNode, and by the SMI rope Draw and is buffered on corresponding DataNode.

Step S10, it is each right to be transferred to the location metadata information for merging blocks of files from the NameNode memory On the distributed DataNode memory answered.

Step S100, the described client layer sends first to the NameNode and reads file request, and described first reads text Part request carries the title of target small documents.

Step S200, the SMI index and MDI index that the described NameNode reads file request, caching according to described first DataNode address information corresponding to target small documents is returned to client layer.When necessary, SMI of the NameNode also according to caching Small documents are returned to customer side and merge file match information.

Step S300, the described client layer sends second according to the DataNode address information and reads file request, described Second reads the title that file request carries target small documents.

Step S400, the described DataNode reads file request, the SMI index of caching and merging file according to described second The content of the location metadata information searching target small documents of block, and return to the client layer.In this process, according to small text Part and the match information for merging file, the offset and size of available target small documents, to quickly determine client layer Required target small documents content simultaneously returns to client layer.

Whereby, in the present embodiment, when the request of one reading small documents of client layer client initiation, a request is sent It goes to obtain NameNodeSMI and MDI, is not to go NameNode Querying Distributed document location metadata by original HDFS, adds Fast file reading rate.

Based on the present embodiment system, optionally, as shown in Figure 1, above-mentioned pretreatment layer includes:

File judging unit, for judging whether the size of big file after merging meets upload threshold value；If it is satisfied, by literary Part is sent to HDFS Client, otherwise, fat file is sent to document handling unit；

Document handling unit, for calculating the size of the file from the file judging unit, according to the conjunction of small documents And sequence obtains offset, while generating interim index file, and index file and data file are then passed to file Combining unit；

File mergences unit, for according to the sequence from the document handling unit by file mergences a to special form The file of formula, meanwhile, merge interim index file to generate SMI index；

HDFS Client, for combined file to be written in HDFS cluster, by distributed file system example with NameNode and DataNode establishes connection, and notifies NameNode that distribution is used for the DataNode of writing data blocks, obtains and closes And the location metadata information of blocks of files and MDI index is generated, and by the SMI index and the MDI indexed cache in institute It states on NameNode, and by the SMI indexed cache on corresponding DataNode.

Based on the present embodiment system, file writing process concretely:

Step 1: setting threshold value (the size 128MB of HDFS block).When pretreatment layer receives the text from client layer client When part write request, file judging unit first determines whether the size of current merging file.If it is not empty for merging file size And it is less than the size of HDFS block, then jump to step 2.If current combined sequence is sky, step 4 is jumped to；If The file size currently merged is greater than the size of HDFS block, jumps to step 5；

Step 2: the size of calculation document and foundation are literary when document handling unit is received from when client file The interim index of part, then jumps to step 3；

Step 3: file content is merged into current combining unit by combining unit, while file index is merged into rope In quotation part, step 1 is then branched to；

Step 4: making it a complete data using the file of a blank as current merging file Block and it is submitted into HDFS client.One new interim index file of creation and merging file, it can update this A interim index and small documents to index file and merge in file, then branch to step 2.

Step 5: current merging file is transmitted to HDFS client, HDFS client passes through distributed file system It is connected with HDFS cluster and stores current combined file to File Store layer, while deleting current merging file, then Jump to step 4.

Embodiment two

Corresponding with the above system embodiment, the present embodiment discloses a kind of data processing method of network traceability system, Include:

When reading data, the method also includes:

Preferably, the present embodiment method further include:

The pretreatment layer is locally deleting the big file after the big file uploading success after merging.

Further, the present embodiment method further include:

The preprocessing module names the big file after merging with creation time, and protects in HBase database The timestamp information for depositing and updating the big file, when the corresponding timestamp information of the big file reaches preset storage timeliness When, the big file and relevant SMI index and MDI index information are deleted in the accumulation layer.

Such as: the data retention over time of setting the present embodiment network traceability system is 1 week, we are by combined file first It names to reach according to the time of creation orderly, at the end of the time, merge on the day of being recorded in HBase database One matched timestamp of addition is 1 while big file name, then one day timestamp+1 of every mistake, and introduces a timestamp Automatic detection module, when timestamp be greater than 7 when, then trigger corresponding deleting mechanism.

Further, the present embodiment method further include:

The SMI index and the MDI index are backed up in the HBase database, backup format are as follows: key, vaule；Wherein, key indicates the title for merging big file, and value includes small documents title, the size, conjunction come by merging sequence And the title of big file, the location metadata for merging blocks of files.For efficient reduction index and use space can be saved whereby It is greatly convenient to provide.

Further, the present embodiment method further include:

The pretreatment layer is equipped with cache pool, and is equipped with the balance policy merged from the cache pool extraction document.

Such as: in order to ensure file can be merged in time, the size of our cache pools is usually arranged as 5, then basis The current size for merging file makes after merging from suitable file mergences is chosen in cache pool closest to HDFS block block Size 128MB, at the same when a file in cache pool more than 3 minutes, then next merging is exactly it, in this way can be true The balance for protecting file storage, so that file reading efficiency is higher.

Embodiment three

The present embodiment discloses a kind of network traceability system, and the network traceability system is based on HDFS and carries out data storage, packet The computer program that includes memory, processor and storage on a memory and can run on a processor, the processor are held The step of realizing two the method for embodiment when the row computer program.

Example IV

The present embodiment discloses a kind of computer storage medium, is stored thereon with computer program, described program is by processor The step in two the method for above-described embodiment is realized when execution.

To sum up, network traceability system and its data processing method, calculating disclosed in the various embodiments described above institute of the present invention difference Machine storage medium, at least have it is following the utility model has the advantages that

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of data processing method of network traceability system, the network traceability system is based on HDFS and carries out data storage, It is characterized in that, which comprises

The network traceability system is divided into client layer, pretreatment layer and accumulation layer, the client layer is for generating described trace back The data file of source system bottom crawl；The accumulation layer includes NameNode and at least two DataNode based on HDFS；

The pretreatment layer is set between the client layer and the accumulation layer, and in the storing process of data file, by The pretreatment layer executes following step:

Step S1, at least two small data files for grabbing the client layer are merged into one big file and generate corresponding SMI index, SMI index characterization merge after the titles of big file name and each small documents being merged, size and partially Relationship between shifting amount；And

Step S2, raw according to the location metadata information for merging blocks of files after the big file uploading success after merging At MDI index, the MDI index characterizes the big file name and stores the corresponding relationship of the DataNode of the big file；

Step S3, by the SMI index and the MDI indexed cache on the NameNode, and the SMI index is slow There are on corresponding DataNode；

Step S10, the location metadata information for merging blocks of files is transferred to from the NameNode memory each corresponding On distributed DataNode memory；

When reading data, the method also includes:

Step S100, the described client layer sends first to the NameNode and reads file request, and the first reading file is asked Seek the title for carrying target small documents；

Step S200, the described NameNode according to it is described first read file request, caching SMI index and MDI index to Family layer returns to DataNode address information corresponding to target small documents；

Step S300, the described client layer according to the DataNode address information send second read file request, described second Read the title that file request carries target small documents；

Step S400, the described DataNode reads file request, the SMI index of caching and merging blocks of files according to described second The content of location metadata information searching target small documents, and return to the client layer.

2. the data processing method of network traceability system according to claim 1, which is characterized in that further include:

3. the data processing method of network traceability system according to claim 1 or 2, which is characterized in that further include:

The preprocessing module names the big file after merging with creation time, and in HBase database save and The timestamp information for updating the big file, when the corresponding timestamp information of the big file reaches preset storage timeliness, The big file and relevant SMI index and MDI index information are deleted in the accumulation layer.

4. the data processing method of network traceability system according to claim 3, which is characterized in that further include:

The SMI index and the MDI index are backed up in the HBase database, backup format are as follows: key, vaule；Its In, key indicates the title for merging big file, and value includes the small documents title come by merging sequence, size, merges big file Title, merge blocks of files location metadata.

5. the data processing method of network traceability system according to claim 4, which is characterized in that further include:

6. a kind of network traceability system, the network traceability system is based on HDFS and carries out data storage characterized by comprising

Accumulation layer, including NameNode and at least two DataNode based on HDFS；And

Pretreatment layer between the client layer and the accumulation layer, for executing in the storing process of data file Following step:

7. network traceability system according to claim 6, which is characterized in that the pretreatment layer includes:

File judging unit, for judging whether the size of big file after merging meets upload threshold value；If it is satisfied, file is sent out HDFS Client is given, otherwise, fat file is sent to document handling unit；

Document handling unit, it is suitable according to the merging of small documents for calculating the size of the file from the file judging unit Sequence obtains offset, while generating interim index file, and index file and data file are then passed to file mergences Unit；

File mergences unit, for according to the sequence from the document handling unit by file mergences to special shape File, meanwhile, merge interim index file to generate SMI index；

8. a kind of network traceability system, the network traceability system is based on HDFS and carries out data storage, including memory, processor And store the computer program that can be run on a memory and on a processor, which is characterized in that the processor executes institute The step of any the method for the claims 1 to 5 is realized when stating computer program.

9. a kind of computer storage medium, is stored thereon with computer program, which is characterized in that described program is executed by processor Step in any the method for Shi Shixian the claims 1 to 5.