CN118363527B

CN118363527B - Distributed storage-based data intelligent management method and system

Info

Publication number: CN118363527B
Application number: CN202410398097.XA
Authority: CN
Inventors: 张腾; 谢作斌; 怀丹阳
Original assignee: Shenzhen Ai Rui Good Technology Co ltd
Current assignee: Shenzhen Ai Rui Good Technology Co ltd
Priority date: 2024-04-03
Filing date: 2024-04-03
Publication date: 2024-10-25
Anticipated expiration: 2044-04-03
Also published as: CN118363527A

Abstract

The invention discloses a data intelligent management method and system based on distributed storage, which relate to the technical field of data storage and comprise the steps of acquiring data information and distributed storage partition information, acquiring data block storage planning information according to data block hot spot data information and distributed storage node information, and dynamically adjusting distributed storage according to distributed storage node load information. The invention improves the intelligentization degree of data management by classifying standard data, improves the distributed storage efficiency of data through the hot spot index of the data block, ensures the access efficiency of high-frequency access data, improves the response rate of the data block copy while accessing the data block copy when ensuring the data block fault through the node matching index of the data block copy, dynamically adjusts the distributed storage through the load evaluation index of the distributed storage node, and avoids the overhigh node load and the influence on the access speed of the hot spot data.

Description

Distributed storage-based data intelligent management method and system

Technical Field

The invention relates to the technical field of data storage, in particular to an intelligent data management method and system based on distributed storage.

Background

With the advent of the big data age, the data volume has been increased explosively, and the traditional network storage system adopts a centralized storage server to store all data, and the storage server becomes a bottleneck of system performance, is also a focus of reliability and security, and cannot meet the requirements of large-scale storage application. Distributed storage is the decentralized storage of data on multiple independent devices. The distributed network storage system adopts an expandable system structure, utilizes a plurality of storage servers to share the storage load, and utilizes the position servers to position the storage information, thereby improving the reliability, availability and access efficiency of the system and being easy to expand.

The distributed storage technology is widely applied to the processing and management of mass data, but the intelligent degree of data management is limited, and manual intervention is still needed for configuration and management. Therefore, a method and a system for intelligently managing data based on distributed storage are needed to realize automatic optimization and management of data.

At present, the data management of distributed storage in the market also has the problems that the data cannot be accurately classified, the hot spot data in the data cannot be selected according to the access information of the data, the hot spot data cannot be distributed to the node with high response speed, the access efficiency is reduced, the distributed storage cannot be dynamically adjusted according to the load of the distributed storage node, the spare resources of the distributed storage node cannot be utilized, and the resource waste is caused.

Disclosure of Invention

In order to solve the technical problems, the technical scheme solves the problems that the data cannot be accurately classified, hot spot data in the data cannot be selected according to access information of the data, so that the hot spot data cannot be distributed to nodes with high response speed, the access efficiency is reduced, the distributed storage cannot be dynamically adjusted according to the load of the distributed storage nodes, the spare resources of the distributed storage nodes cannot be utilized, and the resource waste is caused.

In order to achieve the above purpose, the invention adopts the following technical scheme:

A data intelligent management method based on distributed storage comprises the following steps:

Acquiring data information, wherein the data information comprises data attribute information and data characteristic information;

acquiring distributed storage partition information, wherein the distributed storage partition information comprises distributed storage buffer layer information and distributed storage node information;

Acquiring data block information based on data classification according to the data information;

Acquiring data block access information according to the data block information, wherein the data block access information comprises data access mode information and data access frequency information corresponding to the data access mode;

Acquiring hot spot data information of the data block according to the access information of the data block;

Acquiring data block storage planning information according to the data block hot spot data information and the distributed storage node information;

storing the data in a distributed mode according to the data block storage planning information;

acquiring distributed storage node load information, wherein the distributed storage node load information comprises distributed storage node load state information, distributed storage node load spare resource information and distributed storage node response speed information;

Dynamically adjusting the distributed storage according to the load information of the distributed storage nodes;

acquiring response information of the distributed storage nodes;

acquiring data response fault information based on a response time threshold according to the distributed storage node response information;

Acquiring node information of the data copy according to the data response fault information;

And outputting response data according to the node information of the data copy.

Preferably, the acquiring the data block information based on the data classification according to the data information specifically includes:

Acquiring data attribute information according to the data information, wherein the data attribute information comprises data type information and data format information;

according to the data attribute information, unifying data formats of the data to obtain corrected data information;

obtaining duplication removal data information based on a hash duplication removal method according to the correction data information;

obtaining duplication removal data missing information according to the duplication removal data information;

Acquiring a data missing threshold based on the data distributed storage requirement;

Judging whether the deduplication data missing information exceeds the data missing threshold according to the deduplication data missing information and the data missing threshold, if so, the deduplication data does not accord with the distributed storage standard, and if not, according to the deduplication data information, based on data standardization, obtaining standard data information;

Acquiring standard data characteristic information according to standard data information, wherein the standard data characteristic information comprises standard data keyword information and standard data timestamp information;

and classifying the standard data according to the standard data characteristic information to obtain data block information.

Preferably, the obtaining the data block storage planning information according to the data block hot spot data information and the distributed storage node information specifically includes:

Classifying the data blocks based on the data block hotspot indexes according to the data block hotspot data information to obtain data block classification information;

acquiring first hot spot data block information according to the data block classification information;

Acquiring distributed storage buffer layer information according to the distributed storage partition information;

Distributing the first hot spot data block to a distributed storage buffer layer according to the first hot spot data block information and the distributed storage buffer layer information, and obtaining buffer layer storage planning information;

acquiring second hot spot data block information and third hot spot data block information according to the data block classification information;

according to the second hot spot data block information and the third hot spot data block information, ordering the data blocks based on the order of the hot spot indexes of the data blocks from big to small, and obtaining data block ordering information;

planning the data block storage according to the data block ordering information to obtain node storage information;

and acquiring data block storage planning information according to the distributed storage buffer layer information and the node storage information.

Preferably, the classifying the data blocks based on the data block hotspot indexes according to the data block hotspot data information to obtain data block classification information specifically includes:

Acquiring a data block hotspot index according to the data block hotspot data information;

Acquiring a first threshold value of a data block hotspot index and a second threshold value of the data block hotspot index based on the data block access requirement;

classifying the data blocks according to the data block hotspot indexes, the first data block hotspot index threshold and the second data block hotspot index threshold to obtain data block classification information;

If the data block hotspot index is higher than the first threshold value of the data block hotspot index, dividing the data block into first hotspot data blocks;

if the data block hotspot index is lower than the first data block hotspot index threshold and higher than the second data block hotspot index threshold, dividing the data block into second hotspot data blocks;

if the data block hotspot index is lower than the second threshold value of the data block hotspot index, dividing the data block into a third hotspot data block;

the calculation formula of the data block hotspot index is as follows:

wherein, Q is a hotspot index of the data block, S _i is the size of the ith data of the data block, S is the size of the data block, ω _ij is the access frequency of the jth access mode of the ith data of the data block, w _j is a hotspot coefficient of the jth access mode of the data block, n is the total number of data of the data block, and m is the total number of access modes of the data block.

Preferably, the planning the data block storage according to the data block ordering information, to obtain node storage information, specifically includes:

Acquiring distributed storage node information according to the distributed storage partition information;

acquiring a distributed storage node matching index based on a distributed storage node matching evaluation model according to the second hot spot data block information, the third hot spot data block information and the distributed storage node information;

Planning data block storage according to the distributed storage node matching index and the data block ordering information, and obtaining node storage information;

according to the first hot spot data block information, the second hot spot data block information and the third hot spot data block information, ordering the data blocks from the big order to the small order based on the hot spot indexes of the data blocks, and obtaining the copy information of the data blocks;

acquiring spare state information of the distributed nodes according to node storage information;

Acquiring a node matching index of a data block copy based on a distributed storage node matching evaluation model according to the spare state information, the first hot spot data block information, the second hot spot data block information and the third hot spot data block information of the distributed node;

according to the node matching index of the data block copy and the information of the data block copy, the data block copy is stored in a distributed mode;

the distributed storage node matching evaluation model is as follows:

Where R (h, g) is a matching index of the h data block and the g distributed storage node, x _g represents the available capacity size of the g distributed storage node, x _h represents the size of the h data block, T _j (h, g) represents response time of the h data block to the j access mode when the h data block is stored to the g distributed storage node, And m is the total number of access modes of the data block, wherein m is the access frequency of the j-th access mode.

Preferably, the dynamically adjusting the distributed storage according to the load information of the distributed storage node specifically includes:

acquiring a distributed storage node load evaluation index according to the distributed storage node load information;

acquiring a load evaluation index threshold of a distributed storage node based on the distributed storage requirement;

Judging whether the load evaluation index of the distributed storage node exceeds the load evaluation index threshold of the distributed storage node according to the load evaluation index of the distributed storage node and the load evaluation index threshold of the distributed storage node, if not, the state of the distributed storage node is normal, if so, the load of the distributed storage node is too high, and obtaining the information of the data block to be responded according to the load information of the distributed storage node;

acquiring load vacant resource information of the distributed storage nodes according to the load information of the distributed storage nodes;

According to the information of the data block to be responded and the load spare resource information of the distributed storage node, acquiring a node matching index of the data block to be responded based on a distributed storage node matching evaluation model;

Dynamically adjusting the distributed storage according to the node matching index of the data block to be responded;

the calculation formula of the distributed storage node load evaluation index is as follows:

Wherein D is a distributed storage node load evaluation index, α, β, γ are distributed storage node load evaluation coefficients, μ is a CPU utilization rate of the distributed storage node, μ ₀ is a CPU standard utilization rate of the distributed storage node, τ is a disk utilization rate of the distributed storage node, τ ₀ is a disk standard utilization rate of the distributed storage node, ρ is a network bandwidth of the distributed storage node, σ _k is an access frequency of a kth data block of the distributed storage node, θ _k is an access load coefficient of the kth data block to the distributed storage node, and E is a total number of data blocks of the distributed storage node.

Further, a data intelligent management system based on distributed storage is provided, which is used for implementing the intelligent management method, and includes:

The main control module is used for classifying standard data according to standard data characteristic information, acquiring data block information, distributing a first hot data block to a distributed storage buffer layer according to the first hot data block information and distributed storage buffer layer information, acquiring buffer layer storage planning information, planning data block storage according to data block ordering information, acquiring node storage information, judging whether the load of a distributed storage node is too high according to a distributed storage node load evaluation index and a distributed storage node load evaluation index threshold, acquiring data block information to be responded according to the distributed storage node load information, acquiring spare resource information of the distributed storage node load according to the distributed storage node load information, dynamically adjusting the distributed storage according to node matching index of the data block to be responded, and acquiring data copy node information according to data response fault information;

The information acquisition module is used for acquiring data information, data attribute information, data characteristic information, distributed storage partition information, distributed storage buffer layer information and distributed storage node information, acquiring data block access information according to the data block information, acquiring data block hot spot data information according to the data block access information, acquiring distributed storage node load information, distributed storage node load state information, distributed storage node load spare resource information and distributed storage node response speed information, and transmitting the data block hot spot data information to the calculation module;

The computing module is used for acquiring a data block hotspot index according to the data block hotspot data information, classifying the data block according to the data block hotspot index, a first threshold value of the data block hotspot index and a second threshold value of the data block hotspot index, acquiring data block classification information, acquiring a distributed storage node matching index according to the second hotspot data block information, the third hotspot data block information and the distributed storage node information, acquiring a data block copy node matching index according to the distributed node spare state information, the first hotspot data block information, the second hotspot data block information and the third hotspot data block information, acquiring a data block node matching index to be responded according to the data block information to be responded and the distributed storage node load spare resource information, and acquiring a distributed storage node load assessment index according to the distributed storage node load information;

And the display module is used for displaying the data block information, the data block hot spot data information, the data block storage planning information, the distributed storage node load assessment index and the data response fault information.

Optionally, the main control module specifically includes:

the control unit is used for classifying standard data according to standard data characteristic information, acquiring data block information, distributing first hot spot data blocks to the distributed storage buffer layers according to the first hot spot data block information and the distributed storage buffer layer information, acquiring buffer layer storage planning information, planning data block storage according to data block ordering information, acquiring node storage information, and acquiring data copy node information according to data response fault information;

The information receiving unit is interacted with the information acquisition module and the calculation module and is used for acquiring data and transmitting the data to the dynamic adjustment unit;

The dynamic adjustment unit is used for judging whether the load of the distributed storage nodes is too high according to the load evaluation index of the distributed storage nodes and the load evaluation index threshold of the distributed storage nodes, acquiring data block information to be responded according to the load information of the distributed storage nodes, acquiring spare resource information of the load of the distributed storage nodes according to the load information of the distributed storage nodes, and dynamically adjusting the distributed storage according to the node matching index of the data block to be responded.

Optionally, the information acquisition module specifically includes:

The first acquisition unit is used for acquiring data information, data attribute information, data characteristic information, distributed storage partition information, distributed storage buffer layer information and distributed storage node information, and acquiring data block access information according to the data block information;

The second acquisition unit is used for acquiring data block hot spot data information according to the data block access information, acquiring distributed storage node load information, distributed storage node load state information, distributed storage node load spare resource information and distributed storage node response speed information, and transmitting the information to the calculation module.

Optionally, the computing module specifically includes:

The hot spot index unit is used for acquiring a data block hot spot index according to the data block hot spot data information, classifying the data block according to the data block hot spot index, the first threshold value of the data block hot spot index and the second threshold value of the data block hot spot index, and acquiring data block classification information;

The node matching unit is used for acquiring a distributed storage node matching index according to the second hot spot data block information, the third hot spot data block information and the distributed storage node information, acquiring a data block copy node matching index according to the distributed node spare state information, the first hot spot data block information, the second hot spot data block information and the third hot spot data block information, and acquiring a data block node matching index to be responded according to the data block information to be responded and the distributed storage node load spare resource information;

the load evaluation unit is used for acquiring a distributed storage node load evaluation index according to the distributed storage node load information and transmitting the distributed storage node load evaluation index to the main control module.

Compared with the prior art, the invention has the beneficial effects that:

The invention provides a data intelligent management method and system based on distributed storage, which improves the intelligent degree of data management by classifying standard data, improves the distributed storage efficiency of data by classifying data blocks through data block hot spot indexes, ensures the access efficiency of high-frequency access data, improves the response rate of the data block copies while ensuring the access of the data block copies when the data block fails through data block copy node matching indexes, dynamically adjusts the distributed storage through distributed storage node load evaluation indexes, and avoids overhigh node load and influence on the access speed of hot spot data.

Drawings

FIG. 1 is a flow chart of a method for intelligently managing data based on distributed storage;

FIG. 2 is a flow chart of data block acquisition in the present invention;

FIG. 3 is a flow chart of a data block storage planning in accordance with the present invention;

FIG. 4 is a flow chart of distributed storage of copies of data blocks in accordance with the present invention;

Fig. 5 is a block diagram of a distributed storage-based intelligent data management system according to the present invention.

Detailed Description

The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art.

Referring to fig. 1-4, an intelligent data management method based on distributed storage in an embodiment of the present invention includes:

specifically, according to the data information, based on the data classification, the data block information is acquired, specifically including:

According to the scheme, the data format is unified, the influence of different data formats on later data classification and data access is avoided, the data distributed storage efficiency is reduced, the repeated data in the data are removed through a hash deduplication method, excessive missing data is avoided by judging the missing data in the data, the reliability and accuracy of the data are reduced, the data are classified according to standard data keyword information and standard data timestamp information, the data are stored in a distributed mode later, and the storage efficiency is improved.

specifically, according to the hot spot data information of the data block and the distributed storage node information, the data block storage planning information is obtained, which specifically comprises:

In the scheme, a caching layer is arranged at the front end of the distributed storage system, and hot spot data with high-frequency access is cached in a cache, so that the number of access times to the rear end storage is reduced. Therefore, the user request can be responded quickly, and the access speed of the hot spot data is improved.

Specifically, according to the data block hot spot data information, classifying the data blocks based on the data block hot spot indexes to obtain data block classification information, specifically including:

the calculation formula of the data block hotspot index is as follows:

Still further, according to the data block ordering information, planning the data block storage to obtain node storage information, which specifically includes:

the distributed storage node matching evaluation model is as follows:

In the scheme, the data blocks are classified according to the data block hotspot indexes, the first data block hotspot index threshold and the second data block hotspot index threshold, the data blocks are divided into first hot data blocks, second hot data blocks and third hot data blocks, the first hot data blocks are distributed to a buffer layer, the access rate is ensured, the second hot data blocks and the third hot data blocks are distributed and stored according to the data block ordering information and the sequence of the hot indexes from large to small, the distributed storage node matching indexes of the data blocks and each node are calculated, after the distribution of one data block is completed, the distributed storage node information is updated, the resource occupation information of the node is changed, and then the distributed storage is carried out until the whole data block ordering is traversed.

Meanwhile, for data blocks of different types, different numbers of data block copies are generated, for example, in the embodiment, a first hot spot data block generates three data block copies, a second hot spot data block generates two data block copies, a third hot spot data block generates one data block copy, when a certain node fails, the system can still acquire data from other nodes through a data copying mechanism, reliability and durability of the data are guaranteed, multiple nodes store the same data copy, reading performance can be improved, because the data can be read from different nodes in parallel, pressure of a single node is reduced, and load balancing is achieved.

specifically, according to the load information of the distributed storage nodes, the distributed storage is dynamically adjusted, and the method specifically comprises the following steps:

In the scheme, the load of the distributed storage nodes is judged to be too high through the load evaluation index of the distributed storage nodes and the load evaluation index threshold of the distributed storage nodes, so that the load abnormality of the distributed storage nodes is ensured to be found timely, the load spare resource information of the distributed storage nodes is acquired through the load information of the distributed storage nodes, the distributed storage is dynamically adjusted according to the node matching index of the data block to be responded, the flexibility and the expandability of the system are improved, the spare resources of the distributed storage nodes are fully utilized, and the influence on the access speed of hot spot data due to the fact that the load of certain nodes is too high is avoided.

Acquiring response information of the distributed storage nodes;

In the scheme, when a certain node fails, the system can quickly copy data from other nodes for recovery, so that the risk of data loss is reduced, and the availability of the system is ensured.

Referring to fig. 5, further, in combination with the above-mentioned method for intelligently managing data based on distributed storage, an intelligent system for intelligently managing data based on distributed storage is provided, which includes:

The main control module specifically comprises:

The information acquisition module specifically comprises:

The computing module specifically comprises:

In summary, the invention has the advantages that: the standard data is classified according to the standard data characteristic information, the data block information is obtained, the intelligent degree of data management is improved, the data access efficiency is improved, the data block hot spot data information is obtained according to the data block access information, the data block hot spot index is obtained according to the data block hot spot data information, the data block is classified according to the data block hot spot index, the data distributed storage efficiency is improved, meanwhile, the access efficiency of high-frequency access data is ensured, the data block copy is stored according to the data block copy node matching index, the response rate of the data block copy is improved while the data block copy is accessed when the data block fails is ensured, the distributed storage is dynamically adjusted according to the distributed storage node load evaluation index, and the access speed of hot spot data is prevented from being influenced.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The intelligent data management method based on distributed storage is characterized by comprising the following steps of:

acquiring response information of the distributed storage nodes;

outputting response data according to the node information of the data copy;

the step of acquiring the data block storage planning information according to the data block hot spot data information and the distributed storage node information specifically comprises the following steps:

Acquiring data block storage planning information according to the distributed storage buffer layer information and the node storage information;

classifying the data blocks based on the data block hotspot indexes according to the data block hotspot data information to obtain data block classification information, wherein the method specifically comprises the following steps of:

the calculation formula of the data block hotspot index is as follows:

where Q is a data block hotspot index, For the size of the ith data of the data block,For the size of the data block,For the access frequency of the jth access mode of the ith data of the data block,The hot spot coefficient of the j-th access mode of the data block is n, the total number of the data block is n, and m is the total number of the access modes of the data block;

planning data block storage according to the data block ordering information to obtain node storage information, wherein the method specifically comprises the following steps:

the distributed storage node matching evaluation model is as follows:

In the formula, For the matching index of the h data block to the g distributed storage node,Representing the available capacity size of the g-th distributed storage node,Indicating the size of the h-th data block,Representing the response time to the jth access mode when the jth data block is stored to the jth distributed storage node,M is the total number of access modes of the data block, wherein m is the access frequency of the j-th access mode;

the dynamic adjustment of the distributed storage according to the load information of the distributed storage nodes specifically comprises:

where D is a distributed storage node load assessment index, 、、The coefficients are evaluated for the distributed storage node load,For CPU utilization of the distributed storage node,CPU standard usage for distributed storage nodes,For disk usage of distributed storage nodes,Disk standard usage for distributed storage nodes,For the network bandwidth of the distributed storage nodes,For the access frequency of the kth data block of the distributed storage node,And (3) an access load coefficient of the distributed storage node for the kth data block, wherein E is the total number of the data blocks of the distributed storage node.

2. The intelligent data management method based on distributed storage according to claim 1, wherein the acquiring data block information based on data classification according to data information specifically comprises:

3. An intelligent data management system based on distributed storage, for implementing the intelligent management method according to any one of claims 1-2, comprising:

4. The intelligent data management system based on distributed storage according to claim 3, wherein the main control module specifically comprises:

5. The intelligent data management system based on distributed storage according to claim 3, wherein the information acquisition module specifically comprises:

6. A distributed storage-based data intelligent management system according to claim 3, wherein the computing module specifically comprises: