CN107330094B

CN107330094B - Bloom filter tree structure and key-value pair storage method for dynamically storing key-value pairs

Info

Publication number: CN107330094B
Application number: CN201710542207.5A
Authority: CN
Inventors: 潘海娜; 凌纯清; 谢鲲
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2017-07-05
Filing date: 2017-07-05
Publication date: 2020-06-16
Anticipated expiration: 2037-07-05
Also published as: CN107330094A

Abstract

The invention discloses a bloom filter tree structure for dynamically storing key value pairs and a key value pair storage method, wherein the bloom filter tree structure comprises a complete d-branch tree; the method is characterized in that each node of each complete d-ary tree is a bloom filter; each leaf node of each full d-ary tree represents a value; the storage unit size of each node is half of that of the parent node of the node, and the root node comprises d × k different hash functions, that is, the root node comprises d hash groups, and each group comprises k hash functions. The invention can greatly reduce the time for collecting query, reduce resource consumption, process dynamically arrived data and adapt to network environment in the application fields of generating a large amount of data and needing key value pair query, such as database interactive query, resource positioning in high-speed network, computer network monitoring and the like.

Description

Bloom filter tree structure and key-value pair storage method for dynamically storing key-value pairs

技术领域technical field

本发明涉及计算机网络领域和计算机系统存储领域，特别是高性能、高吞吐量的交互查询的应用领域，具体是一种可扩展的布鲁姆树对于键值对的存储结构及方法。The invention relates to the field of computer network and computer system storage, in particular to the application field of high-performance and high-throughput interactive query, in particular to an extensible Bloom tree storage structure and method for key-value pairs.

背景技术Background technique

近年来，随着计算机的飞速发展，数据库，网络和其他应用中的集合规模呈几何增长。存储和查询键值对(key，value)是计算机系统中常见的任务，这就需要设计对应的键值对存储数据结构，支持快速的键值对查询。键值对操作经常出现在网络和存储系统中，如key-value数据库MongoDB，CouchDB。每个放入到键值对存储系统中的唯一键key都对应一个value，例如(3，5)是一个键key为3，值value为5的键值对，将(3，5)存储到键值对存储系统后，可以通过查询键(key)为3，得到值(value)为5。In recent years, with the rapid development of computers, the size of collections in databases, networks and other applications has grown exponentially. Storing and querying key-value pairs (key, value) is a common task in computer systems, which requires designing a corresponding key-value pair storage data structure to support fast key-value pair query. Key-value pair operations often appear in networks and storage systems, such as key-value databases MongoDB and CouchDB. Each unique key key put into the key-value pair storage system corresponds to a value. For example, (3, 5) is a key-value pair with key key 3 and value value 5. Store (3, 5) in After the key-value pair is stored in the system, you can query the key (key) as 3 and get the value (value) as 5.

设计高效的键值对存储及查询结构带来了巨大的挑战。在一个2层交换机中，一个MAC地址关联一个唯一的端口。当要转发一帧时，搜索引擎会查询这一帧要转发的目的地址的MAC表，因此，将一个MAC地址映射到一个端口的问题就转换成了一个键值对查询问题，这个时候，MAC地址就被看成了key，而要查询的端口号就变成了value。由于MAC地址是持续添加到列表中的，因此，元素的大小是未知的。如果采用cell的结构来存储这些键值对，需要消耗大量的空间，而且在查找对应key的value时会消耗大量的时间；如果采用静态的布鲁姆过滤器结构存储键值对，只能处理静态数据，这在实际应用中很不现实。因此，在高速的计算机网络中，如何高效存储这些信息，快速查询对应的键值对成为挑战。Designing efficient key-value pair storage and query structures brings great challenges. In a Layer 2 switch, a MAC address is associated with a unique port. When a frame is to be forwarded, the search engine will query the MAC table of the destination address of the frame to be forwarded. Therefore, the problem of mapping a MAC address to a port is transformed into a key-value pair query problem. At this time, the MAC address The address is regarded as the key, and the port number to be queried becomes the value. Since MAC addresses are continuously added to the list, the size of the element is unknown. If the cell structure is used to store these key-value pairs, it will consume a lot of space, and it will consume a lot of time to find the value of the corresponding key; if the static Bloom filter structure is used to store the key-value pairs, it can only process Static data, which is very unrealistic in practical applications. Therefore, in a high-speed computer network, how to efficiently store this information and quickly query the corresponding key-value pair becomes a challenge.

布鲁姆过滤器(Bloom Filter)是一种空间节俭、查询高效的数据结构，它可以满足现如今生活中高效资源交互需求及查找需求，能够有效的表示数据集合。布鲁姆过滤器自1970年由B.Bloom提出以来，就被广泛应用于各种各样的计算机系统之中，用来表示庞大的数据集合，提高查询效率。布鲁姆过滤器结构实质是将集合中元素通过k个哈希函数映射到位向量中。布鲁姆过滤器在达到其高效表示集合的同时，进行元素查询时却存在一定的假阳性(某元素不属于集合而误判为属于集合中)误判率，而不存在假阴性(某元素属于集合而误判为不属于集合中)误判，在查询和存储上有较高的效率。Bloom filter is a space-saving and query-efficient data structure, which can meet the needs of efficient resource interaction and search in today's life, and can effectively represent data sets. Bloom filter has been widely used in various computer systems since it was proposed by B.Bloom in 1970 to represent huge data sets and improve query efficiency. The essence of the Bloom filter structure is to map the elements in the set to a bit vector through k hash functions. While the Bloom filter achieves its efficient representation of the set, there is a certain false positive (an element does not belong to the set but is misjudged as belonging to the set) false positive rate when performing element query, but there is no false negative (an element does not belong to the set). It belongs to the set and is misjudged as not belonging to the set) misjudgment, which has higher efficiency in query and storage.

但是传统的布鲁姆过滤器仅仅只能支持元素是否存在于集合的从属查询。如果元素为key，那么只能支持key是否存在于集合的从属查询，而无法支持(key,value)操作。因为布鲁姆过滤器无法直接存储value，因此并不能使用传统的布鲁姆过滤器对键值对进行操作。要使得布鲁姆过滤器支持键值对的基本操作，必须要改进传统的布鲁姆过滤器，设计新的布鲁姆过滤器结构。But traditional bloom filters can only support subordinate queries of whether an element exists in a collection. If the element is key, it can only support the subordinate query of whether the key exists in the collection, but cannot support the (key, value) operation. Because Bloom filters cannot store values directly, you cannot use traditional Bloom filters to operate on key-value pairs. To make the Bloom filter support the basic operations of key-value pairs, it is necessary to improve the traditional Bloom filter and design a new Bloom filter structure.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是，针对现有技术不足，提供一种动态存储键值对的布鲁姆过滤器树结构及键值对存储方法。The technical problem to be solved by the present invention is to provide a Bloom filter tree structure and a key-value pair storage method for dynamically storing key-value pairs in view of the deficiencies of the prior art.

为解决上述技术问题，本发明所采用的技术方案是：一种动态存储键值对的布鲁姆过滤器树结构，包括完全d叉树；每一棵完全d叉树的每一个节点都是一个布鲁姆过滤器；每一棵完全d叉树的每个叶子节点表示一个值value；每一个节点的存储单元大小是该节点的父节点的存储单元大小的一半，根节点包括d×k个不相同的哈希函数，即根节点包括d个哈希组，每组中包含k个哈希函数。In order to solve the above-mentioned technical problems, the technical solution adopted in the present invention is: a Bloom filter tree structure for dynamically storing key-value pairs, including a complete d-ary tree; each node of each complete d-ary tree is A Bloom filter; each leaf node of each complete d-ary tree represents a value value; the storage unit size of each node is half of the storage unit size of the parent node of the node, and the root node includes d×k There are different hash functions, that is, the root node includes d hash groups, and each group includes k hash functions.

相应地，本发明还提供了一种布鲁姆过滤器树结构进行键值对存储的方法，其包括：Correspondingly, the present invention also provides a method for storing key-value pairs in a Bloom filter tree structure, comprising:

插入操作：当需要插入一个键值对(key,value)时，首先查看value是否已经插入到布鲁姆过滤器中，如果未插入，则需要进行后续的增加新value操作；如果value已经存在于布鲁姆过滤器树中，则直接进行键值对插入，根据value查找该value对应的叶节点，得到此叶节点位置编码，确定一条从根节点到此叶节点的唯一路径，计算根节点的两组哈希函数，即对key使用k个哈希函数h_(i,1),h_(i,2),...,h_(i,k)，计算h_(i,1)(key),h_(i,2)(key),...,h_(i,k)(key)，；其中i表示选取哈希函数的组号，i＝1或2；将i＝1计算的哈希值保存在数组A中，将i＝2计算的哈希值保存在数组B中；接着，根据叶节点的第一位编码，在根节点选取A、B数组中的一组值进行插入，即对A或B数组进行右移零位操作，然后根据叶节点的第一位编码，得到下一个需要进行插入操作的布鲁姆过滤器，再根据叶节点的第二位编码，选取A、B数组中的一组值进行右移一位操作，得到插入位置，进行插入；继续进行上述操作，对每一层布鲁姆过滤器插入key，直到插到value对应的叶节点为止；Insertion operation: When a key-value pair (key, value) needs to be inserted, first check whether the value has been inserted into the Bloom filter. If it has not been inserted, a subsequent operation of adding a new value is required; if the value already exists in the Bloom filter In the Bloom filter tree, the key-value pair is inserted directly, the leaf node corresponding to the value is searched according to the value, the position code of the leaf node is obtained, a unique path from the root node to the leaf node is determined, and the root node is calculated. Two sets of hash functions, that is, use k hash functions h _(i,1) ,h _(i,2) ,...,h _(i,k) for the key, and calculate h _(i,1) (key) ,h _(i,2) (key),...,h _(i,k) (key),; where i represents the group number of the selected hash function, i=1 or 2; The hash value is stored in the array A, and the hash value calculated by i=2 is stored in the array B; then, according to the first bit code of the leaf node, a set of values in the A and B arrays are selected at the root node for insertion, That is, perform a right-shift zero-bit operation on the A or B array, and then obtain the next Bloom filter that needs to be inserted according to the first bit code of the leaf node, and then select A, A group of values in the B array is shifted to the right by one bit to obtain the insertion position and inserted; continue the above operation, and insert the key into each layer of Bloom filter until the leaf node corresponding to the value is inserted;

查询操作：需要查询一个键值key对应的value时，首先，计算根节点的两组哈希函数h_(i,1)(key),h_(i,2)(key),...,h_(i,k)(key)(i＝1或2)，将它们的值分别保存在A、B两组数组中；在根节点分别对A、B两数组值对应的位置单元进行查询操作，查到第一位编码值；根据得到的编码值，转到下一个节点继续查询，此时，对A、B两数组进行右移操作，在对应的位置单元进行查询，得到一个编码值；继续进行上述操作，直到查找到叶节点为止，最后得到一个完整的编码值，根据这个叶节点编码进行解码操作得到对应的value；Query operation: When you need to query the value corresponding to a key value key, first, calculate the two sets of hash functions h _(i,1) (key),h _(i,2) (key),...,h of the root node. _(i,k) (key) (i=1 or 2), save their values in the two arrays of A and B respectively; at the root node, query the location units corresponding to the values of the two arrays of A and B, respectively, Find the first code value; according to the obtained code value, go to the next node to continue the query, at this time, perform a right-shift operation on the two arrays A and B, and query at the corresponding position unit to obtain a code value; continue Carry out the above operations until the leaf node is found, and finally obtain a complete encoded value, and perform the decoding operation according to the encoding of the leaf node to obtain the corresponding value;

增加新value操作：要添加一个新的value到布鲁姆过滤器中，分以下两种情况：如果原来的布鲁姆过滤器是一棵未满二叉树，当需要添加新的value时，直接在布鲁姆过滤器的末尾添加新的叶节点，这个叶节点表示新添加的value；如果原来的布鲁姆过滤器是一棵满二叉树，则在根节点的上方添加一个新布鲁姆过滤器，这个新添加的布鲁姆过滤器的大小是原布鲁姆过滤器树的根节点的2倍，此时，这个新的布鲁姆过滤器就变成了新的布鲁姆过滤器树的根节点，原来的布鲁姆过滤器就变成了这个新根节点的左子树，再创建一棵比原布鲁姆过滤器树高少一层的满二叉树作为新的根节点的右子树，把原根节点上置1的位置全部左移一位，插入到新的根节点中，此时新的根节点的两组H₃哈希函数比原来的两组H₃哈希函数多一层，即在基函数中多选取一行，继续在布鲁姆过滤器的末尾添加新的叶节点。Add new value operation: To add a new value to the Bloom filter, there are two cases: if the original Bloom filter is a binary tree that is not full, when a new value needs to be added, directly in Add a new leaf node at the end of the Bloom filter, which represents the newly added value; if the original Bloom filter is a full binary tree, add a new Bloom filter above the root node , the size of the newly added bloom filter is twice the root node of the original bloom filter tree, at this point, the new bloom filter becomes a new bloom filter tree The root node of the original Bloom filter becomes the left subtree of the new root node, and then a full binary tree that is one level less than the original Bloom filter tree is created as the right subtree of the new root node. Subtree, move all the positions set to 1 on the original root node to the left by one place, and insert them into the new root node. At this time, the two sets of _H3 hash functions of the new root node are more than the original two sets of _H3 hash functions. One more layer, that is, select one more row in the basis function, and continue to add new leaf nodes at the end of the Bloom filter.

所述插入操作中，在根节点选取待插入的数组的方法为：如果编码值为0，则选取A数组，如果编码值为1，则选取B数组。In the inserting operation, the method for selecting the array to be inserted at the root node is as follows: if the encoded value is 0, select the A array, and if the encoded value is 1, select the B array.

所述插入操作中，根据叶节点的第一位编码，得到下一个需要进行插入操作的布鲁姆过滤器的方法为：如果编码值为0，则操作左节点，如果编码值为1，则操作右节点。In the insertion operation, according to the first code of the leaf node, the method for obtaining the next Bloom filter that needs to be inserted into the operation is: if the code value is 0, the left node is operated; if the code value is 1, then Operate the right node.

与现有技术相比，本发明所具有的有益效果为：本发明在数据库交互查询、高速网络中资源定位、计算机网络监控等产生大量数据、需要进行键值对查询的应用领域，可以大大减少集合查询的时间，降低资源消耗，处理动态到达的数据，适应网络环境。Compared with the prior art, the present invention has the beneficial effects as follows: the present invention can greatly reduce the application fields in which a large amount of data is generated and the key-value pair query needs to be performed, such as database interactive query, resource location in high-speed network, computer network monitoring, etc. Set query time, reduce resource consumption, process dynamically arriving data, and adapt to the network environment.

附图说明Description of drawings

图1是布鲁姆树的结构示意图，它将布鲁姆过滤器与二叉树的结构相结合，每一个节点都是一个布鲁姆过滤器。根节点有两组H₃哈希函数

每组有k个哈希函数。一个叶节点就表示一个value，根据叶节点的编码可以得到一条由根节点到叶节点的唯一路径。根据叶节点编码，如果编码值为0，则对第一组哈希函数进行移位操作；如果编码值为1，则对第二组哈希函数进行移位操作。图1中

就是

得来的，

就是

得来的，而

是

得来的，

是

得来的。Figure 1 is a schematic diagram of the structure of a Bloom tree, which combines a Bloom filter with the structure of a binary tree, and each node is a Bloom filter. The root node has two sets of _H3 hash functions

Each group has k hash functions. A leaf node represents a value, and a unique path from the root node to the leaf node can be obtained according to the encoding of the leaf node. According to the leaf node encoding, if the encoded value is 0, the first group of hash functions is shifted; if the encoded value is 1, the second group of hash functions is shifted. In Figure 1

that is

come,

that is

come, and

Yes

come,

Yes

come.

图2是在未满二叉树中增加新的value操作示意图。当一个未存在于叶节点的value需要插入时，如果二叉树不是满二叉树，则可以直接在树的末尾添加叶节点，表示插入的value。如图中编码为11的叶节点就是新添加的value。Figure 2 is a schematic diagram of the operation of adding a new value to an underfilled binary tree. When a value that does not exist in a leaf node needs to be inserted, if the binary tree is not a full binary tree, a leaf node can be added directly at the end of the tree to represent the inserted value. The leaf node coded as 11 in the figure is the newly added value.

图3是在满二叉树中增加新的value操作示意图。当一个未存在于叶节点的value需要插入时，如果二叉树是满二叉树，则不可以在原树上添加叶节点，此时需要往上扩展一层，如图3中新添加的level 3。原二叉树就变成了level 3的左子树，再构建一棵比原二叉树少一层的满二叉树作为level 3的右子树。此时，level 3也存在两组H₃哈希函数

其中

是由

得到，将原二叉树根节点中所有置向左移一位，插入到level 3中；而

则是从原来的基矩阵中比

多选取一行即可。此时构建了一棵新的未满二叉树，可以直接在树的末尾添加叶节点，表示插入的value。如图中编码为100的叶节点就是新添加的value。Figure 3 is a schematic diagram of the operation of adding a new value to a full binary tree. When a value that does not exist in a leaf node needs to be inserted, if the binary tree is a full binary tree, a leaf node cannot be added to the original tree. At this time, it is necessary to extend one level up, as shown in the newly added level 3 in Figure 3. The original binary tree becomes the left subtree of level 3, and a full binary tree with one level less than the original binary tree is constructed as the right subtree of level 3. At this time, there are also two sets of H3 hash functions at level ₃

in

By

Get, move all the positions in the root node of the original binary tree one bit to the left, and insert it into level 3; and

is compared from the original basis matrix

Just select more than one row. At this point, a new under-full binary tree is constructed, and a leaf node can be added directly at the end of the tree to represent the inserted value. The leaf node coded as 100 in the figure is the newly added value.

图4是实施环境数据详细信息，包括数据来源，数据时间，包大小，本实施例与最近几年研究成果进行了对比，其中包括：Bloom Tree是2014年INFOCOM中的论文“Bloom tree:A search tree based on bloom filters for multiple-set membership testing”提出的一种将树和布鲁姆过滤器结构结合起来进行键值对查询的结构，该算法在每一个布鲁姆过滤器节点上都使用了d组哈希函数，从根节点开始计算，直到找到value对应的叶节点为止，该算法只能对静态数据进行处理；COMB是2012年IEEE/ACM Transactions onNetworking中的论文“Fast dynamic multiple-set membership testing usingcombinatorial bloom filters”提出的一种基于布鲁姆过滤器的键值对查询算法，该算法将value值进行编码后，对已知的value组数选取合适的布鲁姆过滤器数目和需要插值的布鲁姆过滤器数目，查询时则需要对所有布鲁姆过滤器进行查询，得到相应的编码。Figure 4 is the detailed information of the implementation environment data, including data source, data time, and packet size. The present embodiment is compared with the research results in recent years, including: Bloom Tree is a paper in INFOCOM in 2014 "Bloom tree: A search tree based on bloom filters for multiple-set membership testing", a structure that combines tree and bloom filter structures for key-value pair queries, the algorithm uses d on each bloom filter node The group hash function is calculated from the root node until the leaf node corresponding to the value is found. This algorithm can only process static data; COMB is the paper "Fast dynamic multiple-set membership testing" in IEEE/ACM Transactions on Networking in 2012 A key-value pair query algorithm based on Bloom filter proposed by usingcombinatorial bloom filters", the algorithm encodes the value value, selects the appropriate number of Bloom filters for the known number of value groups and the number of bloom filters that need to be interpolated. The number of Bloom filters. When querying, you need to query all Bloom filters to get the corresponding codes.

图5(a)～图5(f)是固定大小的布鲁姆过滤器树m＝2²⁰bit，在不同哈希函数个数下，使用三种不同算法SBFT，Bloom Tree和COMB，处理1024组数据，在对数据的平均处理时间上进行比较。图5(a)为使用数据集MAWI 1进行仿真实验得到的实验结果示意图，图5(b)为使用数据集MAWI 2进行仿真实验得到的实验结果示意图，图5(c)为使用数据集MAWI 3进行仿真实验得到的实验结果示意图，图5(d)为使用数据集ClarkNet-HTTP进行仿真实验得到的实验结果示意图，图5(e)为使用数据集UMass进行仿真实验得到的实验结果示意图，图5(f)为使用数据集TKN进行仿真实验得到的实验结果示意图。图5(a)～图5(f)中所有数据结果显示，针对每种数据集，我们所提出的算法SBFT平均处理数据耗时最少。与Bloom Tree相比，我们的算法只需要在根节点选取d组哈希函数计算，然后进行移位操作，而BloomTree则需要每个节点都选取d组哈希函数计算，耗时较大；与COMB相比也类似，COMB对每一个布鲁姆过滤器都是需要选取多个哈希函数，耗时大，但COMB需要操作的布鲁姆过滤器数目比Bloom Tree要少，因此，耗时会比Bloom Tree小。Figures 5(a) to 5(f) are Bloom filter trees of fixed size m=2 ²⁰ bits. Under different numbers of hash functions, three different algorithms SBFT, Bloom Tree and COMB are used to process 1024 Group data, compared on the average processing time of the data. Figure 5(a) is a schematic diagram of the experimental results obtained by using the data set MAWI 1 for simulation experiments, Figure 5(b) is a schematic diagram of the experimental results obtained by using the data set MAWI 2 for simulation experiments, and Figure 5(c) is using the data set MAWI. 3. The schematic diagram of the experimental results obtained by the simulation experiment. Figure 5(d) is a schematic diagram of the experimental results obtained by the simulation experiment using the data set ClarkNet-HTTP, and Figure 5(e) is a schematic diagram of the experimental results obtained by the simulation experiment using the data set UMass. Figure 5(f) is a schematic diagram of the experimental results obtained by using the data set TKN for simulation experiments. All the data results in Fig. 5(a)-Fig. 5(f) show that for each dataset, our proposed algorithm SBFT takes the least time to process the data on average. Compared with Bloom Tree, our algorithm only needs to select d groups of hash functions at the root node for calculation, and then perform the shift operation, while BloomTree requires each node to select d groups of hash functions for calculation, which is time-consuming; and COMB is also similar. COMB needs to select multiple hash functions for each Bloom filter, which takes a lot of time. However, the number of Bloom filters that COMB needs to operate is less than that of Bloom Tree, so it takes a long time. Will be smaller than Bloom Tree.

图6是使用三种算法处理静态数据时内存消耗状态示意图。数据结果显示，三种算法消耗的内存是一样的，也就是说，我们的算法在处理数据耗时最少的情况下也不会占用多余的内存。Figure 6 is a schematic diagram of the memory consumption state when three algorithms are used to process static data. The data results show that the memory consumption of the three algorithms is the same, that is to say, our algorithm does not consume excess memory when processing the data with the least amount of time.

图7(a)～图7(f)是数据包不断增加时，三种算法处理数据包耗时情况示意图。每组使用k＝3个哈希函数。由于Bloom Tree和COMB两种算法是在已知数据量情况下进行操作的，因此当数据包还未插入时，就需要把结构搭建好。而我们的算法则可以对未知数据量进行操作，随着数据包的到来处理数据，这是动态处理数据的过程。图7(a)为使用数据集MAWI1进行仿真实验得到的实验结果示意图，图7(b)为使用数据集MAWI 2进行仿真实验得到的实验结果示意图，图7(c)为使用数据集MAWI 3进行仿真实验得到的实验结果示意图，图7(d)为使用数据集ClarkNet-HTTP进行仿真实验得到的实验结果示意图，图7(e)为使用数据集UMass进行仿真实验得到的实验结果示意图，图7(f)为使用数据集TKN进行仿真实验得到的实验结果示意图。图7中所有数据结果显示，针对每种数据集，我们的算法处理数据包是一个动态的过程，并且处理数据耗时比Bloom Tree和COMB都要少，而Bloom Tree和COMB处理数据包则是一个静态过程，并且耗时较多。Figures 7(a) to 7(f) are schematic diagrams of the time-consuming situation of the three algorithms for processing data packets when the data packets increase continuously. Each group uses k=3 hash functions. Since the two algorithms of Bloom Tree and COMB operate under the condition of known data volume, the structure needs to be set up when the data packet has not been inserted. Our algorithm, on the other hand, can operate on an unknown amount of data and process data as packets arrive, which is a process of dynamically processing data. Figure 7(a) is a schematic diagram of the experimental results obtained by using the data set MAWI1 for simulation experiments, Figure 7(b) is a schematic diagram of the experimental results obtained by using the data set MAWI 2 for simulation experiments, and Figure 7(c) is a schematic diagram of the experimental results using the data set MAWI 3 Figure 7(d) is a schematic diagram of the experimental results obtained by the simulation experiment using the data set ClarkNet-HTTP, and Figure 7(e) is a schematic diagram of the experimental results obtained by using the data set UMass to carry out the simulation experiment. 7(f) is a schematic diagram of the experimental results obtained from the simulation experiment using the dataset TKN. All data results in Figure 7 show that for each data set, our algorithm processes data packets is a dynamic process, and processing data takes less time than Bloom Tree and COMB, while Bloom Tree and COMB process data packets by It is a static process and takes more time.

图8(a)～图8(f)是数据包不断增加时，三种算法处理数据包消耗内存情况示意图。每组使用k＝3个哈希函数。图8(a)为使用数据集MAWI 1进行仿真实验得到的实验结果示意图，图8(b)为使用数据集MAWI 2进行仿真实验得到的实验结果示意图，图8(c)为使用数据集MAWI 3进行仿真实验得到的实验结果示意图，图8(d)为使用数据集ClarkNet-HTTP进行仿真实验得到的实验结果示意图，图8(e)为使用数据集UMass进行仿真实验得到的实验结果示意图，图8(f)为使用数据集TKN进行仿真实验得到的实验结果示意图。图8(a)～图8(f)中所有数据结果显示，针对每种数据集，我们的算法处理数据包是一个动态的过程，我们的算法是边处理数据边搭建结构的过程，而Bloom Tree和COMB处理数据包则是一个静态过程，在处理数据之前就已经把结构搭建好了，这也说明它们的结构只能处理静态数据。Figures 8(a) to 8(f) are schematic diagrams of the memory consumption of the three algorithms for processing data packets when the data packets continue to increase. Each group uses k=3 hash functions. Figure 8(a) is a schematic diagram of the experimental results obtained by using the data set MAWI 1 for simulation experiments, Figure 8(b) is a schematic diagram of the experimental results obtained by using the data set MAWI 2 for simulation experiments, and Figure 8(c) is using the data set MAWI. 3. The schematic diagram of the experimental results obtained by the simulation experiment. Figure 8(d) is a schematic diagram of the experimental results obtained by the simulation experiment using the data set ClarkNet-HTTP, and Figure 8(e) is a schematic diagram of the experimental results obtained by the simulation experiment using the data set UMass. Figure 8(f) is a schematic diagram of the experimental results obtained by using the data set TKN for simulation experiments. All the data results in Figures 8(a) to 8(f) show that for each data set, our algorithm is a dynamic process to process data packets. Our algorithm is a process of building structures while processing data, while Bloom Tree and COMB processing data packets is a static process, and the structure has been built before processing data, which also means that their structure can only process static data.

具体实施方式Detailed ways

本实施中处理静态数据时，选取的内存大小m＝2²⁰bit，根节点占用了95325bit，选取了18行8列的H₃哈希函数基矩阵，在根节点上提取了基矩阵的前16行，每组选取k＝3个哈希函数，处理组数g＝1024组的数据，此时树高h＝10。图1是静态布鲁姆树结构示意图，可以由此图分析键值对插入，查询过程。When processing static data in this implementation, the selected memory size is m=2 ²⁰ bits, the root node occupies ₉₅₃₂₅ bits, the H3 hash function base matrix with 18 rows and 8 columns is selected, and the first 16 bits of the base matrix are extracted from the root node. row, k=3 hash functions are selected for each group, and the data of the group number g=1024 groups are processed, and the tree height h=10 at this time. Figure 1 is a schematic diagram of a static Bloom tree structure, from which the key-value pair insertion and query process can be analyzed.

插入操作Insert(key,value)过程：根据value查找该value对应的叶节点，得到此叶节点位置编码，就可以确定一条从根节点到此叶节点的唯一路径。计算根节点的两组哈希函数，对key使用k个哈希函数h_(i,1),h_(i,2),...,h_(i,k)(i表示选取哈希函数的组号)计算h_(i,1)(key),h_(i,2)(key),...,h_(i,k)(key)，将它们的值分别保存在两个数组A、B中。接着，根据叶节点的第一位编码，在根节点选取A、B数组中的一组值进行插入(如果编码值为0，则选取A数组，如果编码值为1，则选取B数组)，即相当于对A、B数组进行了右移零位操作(A>>0或B>>0)。然后根据叶节点的第一位编码，得到下一个需要进行插入操作的布鲁姆过滤器(如果编码值为0，则操作左节点，如果编码值为1，则操作右节点)。再根据叶节点的第二位编码，选取A、B数组中的一组值进行右移一位操作(A>>1或B>>1)，得到插入位置。继续进行类似的操作，对每一层布鲁姆过滤器插入key，直到插到value对应的叶节点为止。如图1，假设插入的键值对中value对应的编码为01，计算根节点的两组哈希函数，将它们的值分别保存在两个数组A、B中；由于编码第一位为0，则选择数组A在根节点右移0位A>>0插入根节点中，下一步到达左节点；由于编码第二位为1，则选择数组B在此节点右移一位B>>1插入节点中，下一步到达右节点；最后直接对数组B右移两位B>>2插入节点，完成插入操作。Insert (key, value) process: Find the leaf node corresponding to the value according to the value, get the position code of the leaf node, and then determine a unique path from the root node to the leaf node. Calculate the two sets of hash functions of the root node, and use k hash functions h _(i,1) ,h _(i,2) ,...,h _(i,k) for the key (i represents the selected hash function group number) to calculate h _(i,1) (key),h _(i,2) (key),...,h _(i,k) (key), and store their values in two arrays A, in B. Then, according to the first code of the leaf node, select a set of values in the A and B arrays at the root node for insertion (if the code value is 0, select the A array, if the code value is 1, select the B array), That is, it is equivalent to right-shifting the zero-bit operation on the A and B arrays (A>>0 or B>>0). Then, according to the first bit code of the leaf node, the next Bloom filter that needs to be inserted is obtained (if the code value is 0, the left node is operated, and if the code value is 1, the right node is operated). Then, according to the second bit code of the leaf node, select a set of values in the A and B arrays to perform a right-shift operation (A>>1 or B>>1) to obtain the insertion position. Continue to perform similar operations, inserting the key into each layer of Bloom filters until the leaf node corresponding to the value is inserted. As shown in Figure 1, assuming that the code corresponding to the value in the inserted key-value pair is 01, calculate the two sets of hash functions of the root node, and store their values in the two arrays A and B respectively; since the first bit of the code is 0 , then select array A and move 0 bits to the right of the root node, A>>0, and insert it into the root node, and then reach the left node in the next step; since the second bit of the code is 1, then select the array B and shift one bit to the right of this node B>>1 Insert into the node, the next step is to reach the right node; finally, directly move the array B to the right by two bits B>>2 to insert the node to complete the insertion operation.

查询Query(key)过程：首先，计算根节点的两组哈希函数h_(i,1)(key),h_(i,2)(key),...,h_(i,k)(key)(i＝1或2)，其中i表示选取哈希函数的组号，将i＝1计算的哈希值保存在数组A中，将i＝2计算的哈希值保存在数组B中。在根节点分别对A、B两数组值对应的位置单元进行查询操作，如果对应位置的值都为1，即为满足，查到第一位编码值(如果A数组满足，则编码为0，如果B数组满足，则编码为1，如果都不满足，则该key值不存在，结束查询)。根据得到的编码值，转到下一个节点继续查询，此时，需要对A、B两数组进行右移操作(A>>1，B>>1)，在对应的位置单元进行查询，得到一个编码值。继续进行类似操作，直到查找到叶节点为止。最后得到一个完整的编码值，根据这个叶节点编码就可以得到对应的value。由图1举例说明：假设查询key，首先计算根节点的两组哈希函数，将它们的值分别保存在A、B两组数组中；将A、B两数组的值分别在根节点中查询，A数组中的值在此节点中对应位置单元都为1，则记录下第一位编码为0，下一个操作节点是左节点；将A、B两数组右移一位A>>1，B>>1，分别在此节点中进行查询，B数组中的值在此节点中对应位置单元都为1，则记录下第二位编码为1，下一个操作节点是右节点；最后一步，直接对B数组右移两位B>>2进行查询，即最后的验证，这一步不用记录编码值，验证通过则查询到的编码为01，在表中找到对应的value即可，完成查询。Query (key) process: First, calculate the two sets of hash functions h _(i,1) (key),h _(i,2) (key),...,h _(i,k) (key) of the root node ) (i=1 or 2), where i represents the group number of the selected hash function, the hash value calculated by i=1 is stored in array A, and the hash value calculated by i=2 is stored in array B. At the root node, query the position units corresponding to the two array values of A and B respectively. If the values of the corresponding positions are all 1, it is satisfied, and the first code value is found (if the A array is satisfied, the code is 0, If the B array is satisfied, the code is 1. If it is not satisfied, the key value does not exist, and the query ends). According to the obtained code value, go to the next node to continue the query. At this time, the two arrays A and B need to be shifted to the right (A>>1, B>>1), and the corresponding position unit is queried to get a encoded value. Continue similar operations until a leaf node is found. Finally, a complete encoded value is obtained, and the corresponding value can be obtained according to the encoding of this leaf node. Figure 1 illustrates as an example: Assuming to query the key, first calculate the two sets of hash functions of the root node, and store their values in the two arrays A and B respectively; query the values of the two arrays A and B in the root node respectively. , the values in the A array are all 1 in the corresponding position unit in this node, then the first bit code is recorded as 0, and the next operation node is the left node; move the A and B arrays to the right by one bit A>>1, B>>1, query in this node respectively, the value in the B array is 1 in the corresponding position unit in this node, then record the second bit code as 1, and the next operation node is the right node; the last step, Directly move the B array to the right by two bits B>>2 to query, that is, the final verification. This step does not need to record the code value. If the verification passes, the query code is 01, and the corresponding value can be found in the table to complete the query.

本实施例中处理动态数据时，增加新value操作分两种情况：如果原来的布鲁姆树是一棵未满二叉树，当需要添加新的value时，可以直接在树的末尾添加新的叶节点，这个叶节点就表示新添加的value，如图2编码为11的新的叶子节点；如果原来的布鲁姆树是一棵满二叉树，此时就需要在根节点的上方添加一个新布鲁姆过滤器，这个布鲁姆过滤器的大小是原根节点的2倍，此时，这个新的布鲁姆过滤器就是新的根节点，原来的布鲁姆树就变成了这个根节点的左子树，再创建一棵比原布鲁姆树高少一层的满二叉树作为新的根节点的右子树，把原根节点上置1的位置全部左移一位，插入到新的根节点中，此时根节点的两组H₃哈希函数要比原来的两组H₃哈希函数多一层，即在基函数中多选取一行，这时，就可以继续在树的末尾添加新的叶节点了，如图3中的编码为100的新的叶节点。When processing dynamic data in this embodiment, the operation of adding a new value can be divided into two cases: if the original Bloom tree is a binary tree that is not full, when a new value needs to be added, a new leaf can be added directly at the end of the tree node, this leaf node represents the newly added value, as shown in Figure 2, the new leaf node coded as 11; if the original Bloom tree is a full binary tree, then a new cloth needs to be added above the root node. Bloom filter, the size of this Bloom filter is twice the size of the original root node, at this time, the new Bloom filter is the new root node, and the original Bloom tree becomes this root The left subtree of the node, and then create a full binary tree that is one level less than the original Bloom tree as the right subtree of the new root node. In the new root node, the two sets of _H3 hash functions of the root node are one layer more than the original two sets of _H3 hash functions, that is, one more row is selected in the base function, and at this time, the tree can be continued. A new leaf node is added at the end of , such as the new leaf node with the code 100 in Figure 3.

Claims

1. A method for storing key-value pairs in a Bloom filter tree structure, the Bloom filter tree structure includes a complete d-ary tree, and each node of each complete d-ary tree is a Bloom Filter; each leaf node of each complete d-ary tree represents a value value; the storage unit size of each node is half of the storage unit size of the parent node of the node, and the root node includes d×k different The hash function, that is, the root node includes d hash groups, and each group includes k hash functions; it is characterized in that, it includes:

Insertion operation: When a key-value pair (key, value) needs to be inserted, first check whether the value has been inserted into the Bloom filter. If it has not been inserted, a subsequent operation of adding a new value is required; if the value already exists in the Bloom filter In the Bloom filter tree, the key-value pair is inserted directly, the leaf node corresponding to the value is searched according to the value, the position code of the leaf node is obtained, a unique path from the root node to the leaf node is determined, and the root node is calculated. Two sets of hash functions, that is, k hash functions h _i,1 ,hi _,2 ,...,hi _,k are used for the key to calculate h _i,1 key,hi _,2 key,... ,h _i,k key, where i represents the group number of the selected hash function, i=1 or 2; the hash value calculated by i=1 is stored in array A, and the hash value calculated by i=2 is stored in In the array B; then, according to the first bit code of the leaf node, select a set of values in the A and B arrays at the root node for insertion, that is, perform a right-shift zero-bit operation on the A or B array, and then according to the first bit of the leaf node. One-bit encoding to get the next Bloom filter that needs to be inserted, and then according to the second-bit encoding of the leaf node, select a set of values in the A and B arrays to perform a right-shift operation to obtain the insertion position, and perform Insert; continue the above operation, insert the key into each layer of Bloom filter until the leaf node corresponding to the value is inserted;

Query operation: When you need to query the value corresponding to a key value key, first, calculate the two sets of hash functions of the root node h _i,1 key,h _i,2 key,...,hi _,k key, i=1 Or 2, save their values in the two arrays of A and B respectively; at the root node, query the position units corresponding to the values of the two arrays A and B, and find the first code value; according to the obtained code value , go to the next node to continue the query, at this time, perform the right-shift operation on the two arrays A and B, and query the corresponding position unit to get a coded value; continue the above operation until the leaf node is found, and finally get A complete encoded value, decoded according to the leaf node code to obtain the corresponding value;

Add new value operation: To add a new value to the Bloom filter, there are two cases: if the original Bloom filter is a binary tree that is not full, when a new value needs to be added, directly in Add a new leaf node at the end of the Bloom filter, which represents the newly added value; if the original Bloom filter is a full binary tree, add a new Bloom filter above the root node , the size of the newly added bloom filter is twice the root node of the original bloom filter tree, at this point, the new bloom filter becomes a new bloom filter tree The root node of the original Bloom filter becomes the left subtree of the new root node, and then a full binary tree that is one level less than the original Bloom filter tree is created as the right subtree of the new root node. Subtree, move all the positions set to 1 on the original root node to the left by one place, and insert them into the new root node. At this time, the two sets of _H3 hash functions of the new root node are more than the original two sets of _H3 hash functions. One more layer, that is, select one more row in the basis function, and continue to add new leaf nodes at the end of the Bloom filter.

2. method according to claim 1, is characterized in that, in described inserting operation, the method that selects the array to be inserted at root node is: if the coding value is 0, then selects the A array, if the coding value is 1, Then select B array.

3. method according to claim 1, is characterized in that, in described insert operation, according to the first code of leaf node, the method that obtains the Bloom filter that needs to carry out insert operation next is: if coded value If it is 0, operate the left node, if the encoded value is 1, operate the right node.