CN116932850A

CN116932850A - Data indexing method, device, electronic equipment and storage medium

Info

Publication number: CN116932850A
Application number: CN202210372194.2A
Authority: CN
Inventors: 彭志成; 欧阳金华; 冯世恺
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2023-10-24
Anticipated expiration: 2042-04-08
Also published as: CN116932850B

Abstract

The application provides a data indexing method, a device, an electronic device and a storage medium, wherein the embodiment of the application can be applied to multimedia data indexing scenes such as video, voice and the like, and the method comprises the following steps: according to the inverted chain length of each index fragment, splicing inverted chain data with the inverted chain length not greater than a preset threshold value into a first index block, and splicing inverted chain data with the inverted chain length greater than the preset threshold value into a second index block; storing the first index block and the second index block in a storage block; establishing a hash table, wherein the hash table comprises a plurality of key values and code values corresponding to the key values, the key values respectively correspond to different index fragments, and the code values comprise index modes corresponding to the index fragments; and carrying out data indexing of the fragments to be indexed in the storage block according to the key values of the fragments to be indexed and the corresponding indexing modes. The technical scheme of the embodiment of the application rapidly acquires the inverted chain data of the fragments to be indexed.

Description

Data indexing method, device, electronic equipment and storage medium

技术领域Technical field

本申请涉及信息检索技术领域，具体而言，涉及一种数据索引方法、装置、电子设备及存储介质。The present application relates to the field of information retrieval technology, specifically, to a data indexing method, device, electronic equipment and storage medium.

背景技术Background technique

在搜索引擎中，倒排索引是搜索引擎中数据索引的常用方式，目前的倒排索引方法大多是一条拉链使用一种查找方式，在索引存储方面为了节省内存，会进行压缩，如对于postinglist(倒排链)的存储设计有无压缩的连续存储、带变长编码的压缩存储、带PForDelta(倒排索引压缩算法)的压缩存储等，同时采用分块的方式来拆分存储超长的倒排链，块之间使用skiplist(跳表)来进行查找加速，在查找的时候都是用统一的二分查找或者顺序查找来寻找给定的term(索引分片)的文档集合。In search engines, inverted index is a common method for data indexing in search engines. Most of the current inverted index methods use one search method for one zipper. In order to save memory in index storage, compression is performed, such as for postinglist( The storage design of the inverted chain includes uncompressed continuous storage, compressed storage with variable length encoding, compressed storage with PForDelta (inverted index compression algorithm), etc., and at the same time, the block method is used to split and store the extremely long inverted chain. Chain arrangement, use skiplist (skip list) between blocks to speed up search. When searching, a unified binary search or sequential search is used to find the document collection of a given term (index fragmentation).

上述这种单一倒排索引存储方式由于在前期进行压缩，后续查找时的在线解压缩耗时会增加查找的时延，且二分索引方式在某些情况下索引时延长，无法满足快速数据索引运算得到term文档集合的需求。The above-mentioned single inverted index storage method is compressed in the early stage, and the online decompression time during subsequent searches will increase the search delay. In addition, the binary index method prolongs the indexing time in some cases and cannot meet the needs of fast data index operations. Get the requirements for term document collection.

发明内容Contents of the invention

为解决上述技术问题，本申请的实施例提供了一种数据索引方法及装置、电子设备、计算机可读存储介质。In order to solve the above technical problems, embodiments of the present application provide a data indexing method and device, electronic equipment, and computer-readable storage media.

根据本申请实施例的一个方面，提供了一种数据索引方法，包括：根据各索引分片的倒排链长度，将倒排链长度不大于预设阈值的倒排链数据拼接为第一索引块，并将倒排链长度大于所述预设阈值的倒排链数据拼接为第二索引块；将所述第一索引块和所述第二索引块存储在存储块中；建立哈希表，所述哈希表包括多个关键值和各关键值对应的编码值，所述多个关键值分别对应不同的索引分片，所述编码值包括对应索引分片的索引方式，各索引分片的索引方式与对应索引分片的倒排链数据所在的索引块相关；根据待索引分片的关键值和对应的索引方式在所述存储块中进行所述待索引分片的数据索引。According to one aspect of the embodiment of the present application, a data indexing method is provided, including: according to the inverted chain length of each index fragment, splicing inverted chain data whose inverted chain length is not greater than a preset threshold into a first index block, and splice the inverted chain data whose inverted chain length is greater than the preset threshold into a second index block; store the first index block and the second index block in a storage block; establish a hash table , the hash table includes a plurality of key values and a coding value corresponding to each key value, the plurality of key values respectively correspond to different index fragments, the coding value includes an indexing method corresponding to the index fragment, and each index fragment The indexing method of a slice is related to the index block where the inverted chain data of the corresponding index fragment is located; the data of the fragment to be indexed is indexed in the storage block according to the key value of the fragment to be indexed and the corresponding indexing mode.

根据本申请实施例的一个方面，提供了一种数据索引装置，包括：索引块获取模块，配置为根据各索引分片的倒排链长度，将倒排链长度不大于预设阈值的倒排链数据拼接为第一索引块，并将倒排链长度大于所述预设阈值的倒排链数据拼接为第二索引块；存储模块，配置为将所述第一索引块和所述第二索引块存储在存储块中；哈希表建立模块，配置为建立哈希表，所述哈希表包括多个关键值和各关键值对应的编码值，所述多个关键值分别对应不同的索引分片，所述编码值包括对应索引分片的索引方式，各索引分片的索引方式与对应索引分片的倒排链数据所在的索引块相关；索引模块，配置为根据待索引分片的关键值和对应的索引方式在所述存储块中进行所述待索引分片的数据索引。According to one aspect of the embodiment of the present application, a data indexing device is provided, including: an index block acquisition module configured to, according to the length of the inversion chain of each index fragment, retrieve inversions whose length is not greater than a preset threshold. chain data is spliced into a first index block, and inverted chain data with an inverted chain length greater than the preset threshold is spliced into a second index block; a storage module is configured to combine the first index block and the second index block The index block is stored in the storage block; the hash table creation module is configured to create a hash table. The hash table includes a plurality of key values and a coding value corresponding to each key value. The plurality of key values respectively correspond to different Index fragmentation, the encoding value includes the indexing mode of the corresponding index fragmentation, and the indexing mode of each index fragmentation is related to the index block where the inverted chain data of the corresponding index fragmentation is located; the index module is configured to be based on the fragmentation to be indexed The key value and the corresponding indexing method are used to index the data of the fragment to be indexed in the storage block.

在一实施例中，索引块获取模块包括：In one embodiment, the index block acquisition module includes:

对齐单元，配置为将倒排链长度大于预设阈值的各索引分片的倒排链数据进行对齐处理；An alignment unit configured to align the inverted chain data of each index fragment whose inverted chain length is greater than a preset threshold;

第二索引块获取单元，配置为将进行对齐处理后的各索引分片的倒排链数据进行拼接，得到所述第二索引块。The second index block acquisition unit is configured to splice the aligned inverted chain data of each index fragment to obtain the second index block.

在一实施例中，存储块的数量为多个，存储模块包括：In one embodiment, the number of storage blocks is multiple, and the storage module includes:

第一存储单元，配置为将所述第一索引块的数据存储在第一存储块中；A first storage unit configured to store the data of the first index block in the first storage block;

第二存储单元，配置为按预设字节对齐的方式将所述第二索引块的数据存储在第二存储块中。The second storage unit is configured to store the data of the second index block in the second storage block in a preset byte alignment manner.

在一实施例中，所述哈希表的编码值还包括对应索引分片的倒排链数据在所述第一存储块或所述第二存储块中的存储位置；索引模块包括：In one embodiment, the coded value of the hash table also includes the storage location of the inverted chain data corresponding to the index fragment in the first storage block or the second storage block; the index module includes:

第一索引数据获取单元，配置为基于所述哈希表确认所述待索引分片对应的第一关键值以及所述第一关键值对应的索引方式和存储位置；A first index data acquisition unit configured to confirm, based on the hash table, the first key value corresponding to the fragment to be indexed and the indexing method and storage location corresponding to the first key value;

第一索引单元，配置为若所述第一关键值对应第一索引方式，则根据所述第一索引方式和对应存储位置在所述第一存储块中索引所述待索引分片的倒排链数据；A first indexing unit configured to, if the first key value corresponds to a first indexing method, index the inversion of the to-be-indexed fragment in the first storage block according to the first indexing method and the corresponding storage location. chain data;

第二索引单元，配置为若所述第一关键值对应第二索引方式，则根据所述第二索引方式和对应存储位置在所述第二存储块中索引所述待索引分片的倒排链数据。The second indexing unit is configured to, if the first key value corresponds to the second indexing method, index the inversion of the to-be-indexed fragment in the second storage block according to the second indexing method and the corresponding storage location. chain data.

在一实施例中，存储模块包括：In one embodiment, the storage module includes:

第三存储单元，配置为按预设字节对齐的方式将所述第一索引块和所述第二索引块存储在所述存储块中，其中，所述存储块的长度等于或大于所述第一索引块与所述第二索引块之间的长度和，且所述存储块中所述第一索引块的数据存储在所述第二索引块的数据之后。A third storage unit configured to store the first index block and the second index block in the storage block in a preset byte alignment manner, wherein the length of the storage block is equal to or greater than the The sum of the lengths between the first index block and the second index block, and the data of the first index block in the storage block is stored after the data of the second index block.

在一实施例中，所述哈希表的编码值还包括对应索引分片的倒排链数据在所述存储块中的存储位置；索引模块包括：In one embodiment, the coded value of the hash table also includes the storage location of the inverted chain data corresponding to the index fragment in the storage block; the index module includes:

第二索引数据获取单元，配置为基于所述哈希表确认所述待索引分片对应的第二关键值以及所述第二关键值对应的索引方式和存储位置；The second index data acquisition unit is configured to confirm the second key value corresponding to the fragment to be indexed and the indexing method and storage location corresponding to the second key value based on the hash table;

第三索引单元，配置为若所述第二关键值对应第三索引方式，则根据所述第三索引方式和对应存储位置在所述存储块中索引所述待索引分片的倒排链数据，所述第三索引方式为针对所述第一索引块中的倒排链数据进行索引而设定；The third indexing unit is configured to index the inverted chain data of the fragment to be indexed in the storage block according to the third indexing method and the corresponding storage location if the second key value corresponds to the third indexing method. , the third indexing method is set for indexing the inverted chain data in the first index block;

第四索引单元，配置为若所述第二关键值对应第四索引方式，则根据所述第四索引方式和对应存储位置在所述存储块中索引所述待索引分片的倒排链数据，所述第四索引方式为针对所述第二索引块中的倒排链数据进行索引而设定。The fourth indexing unit is configured to index the inverted chain data of the fragment to be indexed in the storage block according to the fourth indexing method and the corresponding storage location if the second key value corresponds to the fourth indexing method. , the fourth indexing method is set for indexing the inverted chain data in the second index block.

在一实施例中，哈希表建立模块包括：In one embodiment, the hash table creation module includes:

初始的哈希表建立单元，配置为预设初始的哈希表，所述初始的哈希表包括多个初始关键值和各初始关键值对应的初始编码值；The initial hash table creation unit is configured to preset an initial hash table, where the initial hash table includes a plurality of initial key values and an initial coding value corresponding to each initial key value;

关键值获取单元，配置为将各索引分片的哈希值分别覆盖所述初始关键值，得到多个关键值；The key value acquisition unit is configured to overwrite the initial key value with the hash value of each index fragment, and obtain multiple key values;

索引方式分配单元，配置为根据各倒排链数据所在的索引块为对应索引分片分配索引方式；The index mode allocation unit is configured to allocate an index mode to the corresponding index fragment according to the index block where each inverted chain data is located;

编码值获取单元，配置为将各索引分片的倒排链数据在所述存储块中的位置和对应的索引方式覆盖于对应的初始编码值，得到所述哈希表。The coded value acquisition unit is configured to overwrite the position of the inverted chain data of each index fragment in the storage block and the corresponding index mode with the corresponding initial coded value to obtain the hash table.

在一实施例中，编码值获取单元包括：In one embodiment, the coded value acquisition unit includes:

区域划分板块，配置为将各初始编码值进行区域划分，得到各初始编码值对应的索引方式区域、数据长度区域以及偏移区域；The area division section is configured to divide each initial encoding value into areas to obtain the index mode area, data length area and offset area corresponding to each initial encoding value;

索引方式区域覆盖板块，配置为将各索引分片对应的索引方式信息覆盖于对应索引方式区域中；The index mode area coverage section is configured to cover the index mode information corresponding to each index shard in the corresponding index mode area;

数据长度区域覆盖板块，配置为将各索引分片的倒排链长度信息覆盖于对应的数据长度区域；The data length area coverage section is configured to cover the inverted chain length information of each index fragment in the corresponding data length area;

偏移区域覆盖板块，配置为将各索引分片的倒排链数据在所述存储块中的起始位置信息覆盖于对应的偏移区域，以得到所述哈希表。The offset area covering block is configured to cover the starting position information of the inverted chain data of each index fragment in the storage block with the corresponding offset area to obtain the hash table.

在一实施例中，所述索引方式包括针对所述第一索引块中的倒排链数据进行索引而设定的二分索引方式，以及针对所述第二索引块中的倒排链数据进行索引而设定的指令集索引方式；该数据索引装置还包括：In one embodiment, the indexing method includes a binary indexing method set for indexing the inverted chain data in the first index block, and indexing for the inverted chain data in the second index block. The set instruction set indexing method; the data indexing device also includes:

索引速率获取模块，配置为对各索引分片的倒排链分别基于所述二分索引方式和所述指令集索引方式进行索引，相应得到各索引分片的第一索引速率和第二索引速率；An index rate acquisition module configured to index the inverted chain of each index fragment based on the binary indexing method and the instruction set indexing method, and obtain the first index rate and the second index rate of each index fragment accordingly;

预设阈值获取模块，配置为根据各索引分片的倒排链长度，按照预设长度顺序比较各索引分片对应的第一索引速率和第二索引速率之间的数值大小，并根据得到的比较结果来确定所述预设阈值。The preset threshold acquisition module is configured to compare the numerical size between the first index rate and the second index rate corresponding to each index fragment in the preset length order according to the inverted chain length of each index fragment, and calculate the value according to the obtained The results are compared to determine the preset threshold.

根据本申请实施例的一个方面，提供了一种电子设备，包括一个或多个处理器；存储装置，用于存储一个或多个计算机程序，当所述一个或多个计算机程序被所述一个或多个处理器执行时，使得所述电子设备实现如上所述的数据索引方法。According to an aspect of an embodiment of the present application, an electronic device is provided, including one or more processors; a storage device configured to store one or more computer programs. When the one or more computer programs are processed by the Or when executed by multiple processors, the electronic device implements the data indexing method as described above.

根据本申请实施例的一个方面，提供了一种计算机可读存储介质，其上存储有计算机可读指令，当所述计算机可读指令被计算机的处理器执行时，使计算机执行如上所述的数据索引方法。According to an aspect of an embodiment of the present application, a computer-readable storage medium is provided, with computer-readable instructions stored thereon. When the computer-readable instructions are executed by a processor of the computer, the computer is caused to perform the above-mentioned steps. Data indexing methods.

根据本申请实施例的一个方面，提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述各种可选实施例中提供的数据索引方法。According to an aspect of an embodiment of the present application, a computer program product or a computer program is provided, the computer program product or the computer program including computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the data indexing method provided in the above various optional embodiments.

根据本申请实施例的一个方面，提供了一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现如上所述的模板生成方法中的步骤。According to one aspect of an embodiment of the present application, a computer program product is provided, including a computer program that implements the steps in the template generation method as described above when executed by a processor.

在本申请的实施例所提供的技术方案中，基于倒排链长度信息将倒排链数据分为不同的索引块，且为不同索引块分配不同的索引方式，以利用哈希表确定待索引分片的索引方式，从而实现对待索引分片对应倒排链数据的快速索引。In the technical solution provided by the embodiments of the present application, the inverted chain data is divided into different index blocks based on the inverted chain length information, and different indexing methods are assigned to different index blocks to use a hash table to determine the index to be indexed The indexing method of shards enables fast indexing of inverted chain data corresponding to the index shards.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本申请。It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and do not limit the present application.

附图说明Description of the drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本申请的实施例，并与说明书一起用于解释本申请的原理。显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术者来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。在附图中：The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts. In the attached picture:

图1是本申请涉及的一种实施环境的示意图；Figure 1 is a schematic diagram of an implementation environment involved in this application;

图2是本申请的一示例性实施例示出的数据索引方法的流程图；Figure 2 is a flow chart of a data indexing method according to an exemplary embodiment of the present application;

图3是本申请的一示例性实施例示出的各索引分片的倒排链长度分布柱形图；Figure 3 is a histogram of inverted chain length distribution of each index shard shown in an exemplary embodiment of the present application;

图4是本申请的一示例性实施例示出的倒排链数据分布示意图；Figure 4 is a schematic diagram of inverted chain data distribution according to an exemplary embodiment of the present application;

图5是本申请的一示例性实施例示出的第一索引块与第二索引块的结构示意图；Figure 5 is a schematic structural diagram of a first index block and a second index block according to an exemplary embodiment of the present application;

图6是本申请的一示例性实施例示出的哈希表的结构示意图；Figure 6 is a schematic structural diagram of a hash table according to an exemplary embodiment of the present application;

图7是本申请的一示例性实施例示出的编码值的区域结构示意图；Figure 7 is a schematic diagram of the regional structure of coded values according to an exemplary embodiment of the present application;

图8是图2所示实施例中的步骤S210在一示例性实施例中的流程图；Figure 8 is a flow chart of step S210 in the embodiment shown in Figure 2 in an exemplary embodiment;

图9是本申请的一示例性实施例示出的补齐处理操作示意图；Figure 9 is a schematic diagram of the completion processing operation according to an exemplary embodiment of the present application;

图10是图2所示实施例中的步骤S230在一示例性实施例中的流程图；Figure 10 is a flow chart of step S230 in the embodiment shown in Figure 2 in an exemplary embodiment;

图11是图2所示实施例中的步骤S270在一示例性实施例中的流程图；Figure 11 is a flow chart of step S270 in the embodiment shown in Figure 2 in an exemplary embodiment;

图12是图2所示实施例中的步骤S270在另一示例性实施例中的流程图；Figure 12 is a flow chart of step S270 in the embodiment shown in Figure 2 in another exemplary embodiment;

图13是本申请的一示例性实施例示出的存储块结构示意图；Figure 13 is a schematic diagram of a memory block structure according to an exemplary embodiment of the present application;

图14是图2所示实施例中的步骤S250在另一示例性实施例中的流程图；Figure 14 is a flow chart of step S250 in the embodiment shown in Figure 2 in another exemplary embodiment;

图15是图14所示实施例中的步骤S1470在另一示例性实施例中的流程图；Figure 15 is a flowchart of step S1470 in the embodiment shown in Figure 14 in another exemplary embodiment;

图16是是本申请的另一示例性实施例示出的数据索引方法的流程图；Figure 16 is a flow chart of a data indexing method according to another exemplary embodiment of the present application;

图17是本申请的一示例性实施例示出的数据索引装置的结构示意图；Figure 17 is a schematic structural diagram of a data indexing device according to an exemplary embodiment of the present application;

图18示出了适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。FIG. 18 shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.

具体实施方式Detailed ways

这里将详细地对示例性实施例执行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the appended claims.

附图中所示的方框图仅仅是功能实体，不一定必须与物理上独立的实体相对应。即，可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software form, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices. entity.

附图中所示的流程图仅是示例性说明，不是必须包括所有的内容和操作/步骤，也不是必须按所描述的顺序执行。例如，有的操作/步骤还可以分解，而有的操作/步骤可以合并或部分合并，因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the drawings are only illustrative, and do not necessarily include all contents and operations/steps, nor must they be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be merged or partially merged, so the actual order of execution may change according to the actual situation.

还需要说明的是：在本申请中提及的“多个”是指两个或者两个以上。“和/或”描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。It should also be noted that the “multiple” mentioned in this application refers to two or more. "And/or" describes the relationship between related objects, indicating that there can be three relationships. For example, A and/or B can mean: A exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the related objects are in an "or" relationship.

人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technology of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.

人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence technology is a comprehensive subject that covers a wide range of fields, including both hardware-level technology and software-level technology. Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

机器学习(Machine Learning，ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习等技术。Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.

本申请实施例提出的数据索引方法及装置、电子设备、存储介质涉及人工智能技术、机器学习等技术，以下将对这些实施例进行详细说明。The data indexing methods and devices, electronic devices, and storage media proposed in the embodiments of this application involve artificial intelligence technology, machine learning, and other technologies. These embodiments will be described in detail below.

首先请参阅图1，图1是本申请涉及的一种实施环境的示意图。该实施环境包括终端100和服务器端200，终端100和服务器端200之间通过有线或者无线网络进行通信。First, please refer to Figure 1, which is a schematic diagram of an implementation environment involved in this application. The implementation environment includes a terminal 100 and a server 200. The terminal 100 and the server 200 communicate through a wired or wireless network.

应该理解，图1中的终端备100和服务器200的数目仅仅是是示意性的。根据实际需要，可以具有任意数目的终端100和服务器200。It should be understood that the number of terminal devices 100 and servers 200 in Figure 1 is only illustrative. According to actual needs, there can be any number of terminals 100 and servers 200 .

在本申请的一些实施例中，数据索引方法可以由服务器200执行，相应地，数据索引装置配置于服务器200中。终端100用于接收多个索引分片以及对应索引分片的倒排链，一个索引分片对应有一个倒排链，该倒排链中的倒排链数据为对应索引分片相关的索引数据，且不同的索引分片得到的索引数据不同，体现在倒排链数据上即为各索引分片的倒排链长度不同，然后终端100将各索引分片和倒排链发送至服务器端200。其中，服务器端200对索引分片和倒排链进行处理，先根据各索引分片的倒排链长度，将倒排链数据拼接在不同的索引块中，并索引块进行存储，然后为不同索引块中的倒排链数据设置不同的索引方式，以通过哈希表对确定待索引分片的索引方式在存储的数据中进行数据索引，然后服务器端200还将数据索引结果发送至终端100，可通过终端100自带的显示模块可视化待索引分片的索引结果。In some embodiments of the present application, the data indexing method may be executed by the server 200 , and accordingly, the data indexing device is configured in the server 200 . The terminal 100 is used to receive multiple index fragments and inverted chains corresponding to the index fragments. One index fragment corresponds to one inverted chain, and the inverted link data in the inverted chain is index data related to the corresponding index fragments. , and the index data obtained by different index fragments is different, which is reflected in the inverted chain data, that is, the inverted chain lengths of each index fragment are different, and then the terminal 100 sends each index fragment and the inverted chain to the server 200 . Among them, the server side 200 processes the index fragments and the inverted chain, first splices the inverted chain data into different index blocks according to the length of the inverted chain of each index fragment, and stores the index blocks, and then provides different data for different index fragments. Different indexing methods are set for the inverted chain data in the index block to index the data in the stored data using the hash table to determine the indexing method of the fragments to be indexed, and then the server 200 also sends the data indexing results to the terminal 100 , the indexing results of the fragments to be indexed can be visualized through the display module provided by the terminal 100 .

示例性的，终端100在接收到多个索引分片及其对应的倒排链后，将索引分片及其对应的倒排链发送至服务器端200，服务器端200根据各索引分片的倒排链长度，将倒排链长度不大于预设阈值的倒排链数据拼接为第一索引块，并将倒排链长度大于预设阈值的倒排链数据拼接为第二索引块；将第一索引块和第二索引块存储在存储块中；建立哈希表，哈希表包括多个关键值和各关键值对应的编码值，多个关键值分别对应不同的索引分片，编码值包括对应索引分片的索引方式，各索引分片的索引方式与对应索引分片的倒排链数据所在的索引块相关；然后服务器端200在需要对待索引分片的倒排链数据进行索引时，根据待索引分片的关键值和对应的索引方式在存储块中进行待索引分片的数据索引，以得到该待索引分片的倒排链数据。Exemplarily, after receiving multiple index fragments and their corresponding inverted links, the terminal 100 sends the index fragments and their corresponding inverted links to the server 200. The chain length is to splice the inverted chain data whose length is not greater than the preset threshold into the first index block, and the inverted chain data whose length is greater than the preset threshold is spliced into the second index block; The first index block and the second index block are stored in the storage block; a hash table is established. The hash table includes multiple key values and the encoding values corresponding to each key value. The multiple key values correspond to different index fragments and the encoding values. Including the indexing method of the corresponding index fragment, the indexing method of each index fragment is related to the index block where the inverted chain data of the corresponding index fragment is located; then the server 200 needs to index the inverted chain data of the pending index fragment. , index the data of the fragment to be indexed in the storage block according to the key value of the fragment to be indexed and the corresponding indexing method, to obtain the inverted chain data of the fragment to be indexed.

其中，终端100包括但不限于手机、电脑、智能语音交互设备、智能家电、车载终端、飞行器等，如可以是智能手机、平板、笔记本电脑、计算机等任意能够实现图片可视化的电子设备，本处不进行限制。本发明实施例可应用于数据索引领域，包括但不限于如视频搜索、文本搜索、音乐搜索等各种场景。服务器端200可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，其中多个服务器可组成一区块链，而服务器为区块链上的节点，服务器端200还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network，内容分发网络)以及大数据和人工智能平台等基础云计算服务的云服务器，本处也不对此进行限制。Among them, the terminal 100 includes but is not limited to mobile phones, computers, intelligent voice interaction devices, smart home appliances, vehicle-mounted terminals, aircraft, etc., such as smart phones, tablets, laptops, computers and other electronic devices that can realize image visualization. This document No restrictions are imposed. Embodiments of the present invention can be applied to the field of data indexing, including but not limited to various scenarios such as video search, text search, and music search. The server 200 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. Multiple servers can form a blockchain, and the server is a node on the blockchain. The server 200 It can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), as well as big data and artificial intelligence Cloud servers for basic cloud computing services such as intelligent platforms are not subject to any restrictions here.

图2是根据一示例性实施例示出的一种数据索引方法的流程图，该数据索引方法可应用于图1所示的实施环境，并由该实施环境中的服务器端200具体执行，应该理解的是，该方法也可以是用于其他的示例性实施环境，并由其它实施环境中的设备具体执行，本实施例不对该方法所适用的实施环境进行限制。Figure 2 is a flow chart of a data indexing method according to an exemplary embodiment. The data indexing method can be applied to the implementation environment shown in Figure 1 and is specifically executed by the server 200 in the implementation environment. It should be understood that It should be noted that this method can also be used in other exemplary implementation environments and specifically executed by devices in other implementation environments. This embodiment does not limit the implementation environments to which this method is applicable.

如图2所示，在一示例性实施例中，该方法可以包括步骤S210至步骤S270，详细介绍如下：As shown in Figure 2, in an exemplary embodiment, the method may include steps S210 to S270, which are described in detail as follows:

步骤S210：根据各索引分片的倒排链长度，将倒排链长度不大于预设阈值的倒排链数据拼接为第一索引块，并将倒排链长度大于预设阈值的倒排链数据拼接为第二索引块。Step S210: According to the inverted chain length of each index fragment, splice the inverted chain data whose inverted chain length is not greater than the preset threshold into the first index block, and combine the inverted chain data whose inverted chain length is greater than the preset threshold. The data is spliced into the second index block.

一个索引分片对应有一个倒排链，该倒排链中的倒排链数据包括符合索引分片条件的文档与相关信息，不同的索引分片对应文档与相关信息不同，因此不同的索引分片其倒排链长度也可能不一样。An index shard corresponds to an inverted chain. The inverted chain data in the inverted chain includes documents and related information that meet the index sharding conditions. Different index shards correspond to different documents and related information, so different index shards correspond to different documents and related information. The length of the inverted chain of the film may also be different.

本实施例中，可根据索引分片(即term)的倒排链长度将各索引分片对应的倒排链数据拼接在不同的索引块中以供后续进行待索引分片的倒排链数据索引。In this embodiment, the inverted chain data corresponding to each index fragment can be spliced into different index blocks according to the inverted chain length of the index fragment (ie, term) for subsequent inverted chain data to be indexed. index.

首先可预设阈值，该预设阈值可根据不同的倒排链长度在通过不同的方法进行倒排链索引时的速率来确认，如现通常使用二分索引方式来进行待索引分片的数据索引，但在倒排链长度过长时，该二分索引方式速度慢，则可选取其他索引方式进行索引速率的比较，如Intel SIMD(IntelSingle Instruction Multiple Data英特尔单指令多数据技术)中的各指令集、SISD(Single Instruction Single Data，单指令单数据技术)中的各指令集等，此处不对其他索引方式进行具体限制。First, a threshold can be preset. The preset threshold can be confirmed based on the speed of inverted chain indexing using different methods for different inverted chain lengths. For example, binary indexing is usually used to index the data of the shards to be indexed. , but when the inverted chain length is too long, the binary indexing method is slow, and other indexing methods can be selected to compare the indexing speed, such as each instruction set in Intel SIMD (IntelSingle Instruction Multiple DataIntel Single Instruction Multiple Data Technology) , each instruction set in SISD (Single Instruction Single Data, single instruction single data technology), etc. There are no specific restrictions on other indexing methods here.

然后，可通过不同的索引方法对不同长度的倒排链数据进行索引，根据不同倒排链长度的倒排链数据在不同索引方式下的索引速率确定预设阈值，倒排链长度大于该预设阈值的索引分片在通过索引方式A进行索引时的速率大于通过索引方式B进行索引时的速率，倒排链长度不大于该预设阈值的索引分片在通过索引方式A进行索引时的速率小于通过索引方式B进行索引时的速率，然后根据该预设阈值可将多个索引分片分别拼接成不同的索引块。Then, inverted chain data of different lengths can be indexed through different indexing methods, and a preset threshold is determined based on the indexing rate of inverted chain data of different inverted chain lengths in different indexing methods. The inverted chain length is greater than the preset threshold. The rate of index fragments with a set threshold when indexed through index method A is greater than the rate when indexed through index method B. The index fragment whose inverted chain length is not greater than the preset threshold has a faster rate when indexed through index method A. The rate is lower than the rate when indexing through index mode B, and then multiple index fragments can be spliced into different index blocks according to the preset threshold.

当然，不同的索引器中的预设阈值不同，其与索引器的运行的硬件环境、编译器、索引方式相关。Of course, the preset thresholds in different indexers are different, which are related to the hardware environment, compiler, and indexing method in which the indexer runs.

如表1为某一索引平台中的P1索引库中各索引分片(term)的长度分布，而图3为该P1索引中各索引分片的倒排链长度分布图，从表1与图3可知，97％的倒排链长度都集中在长度区间【1，720】范围内，然后通过C++标准库的二分索引方式以及基于SIMD加速的二分索引方式分别对P1索引库中各倒排链数据进行索引，得出：在倒排链长度小于720的时候，采用C++标准库的二分查找的查找速度比基于SIMD的二分索引方式(即指令集索引方式，该指令集索引方式可通过SIMD指令集中的各指令对二分索引方式进行加速，如AVX2加速指令实现对二分查找的加速)。For example, Table 1 shows the length distribution of each index fragment (term) in the P1 index library in a certain index platform, and Figure 3 shows the inverted chain length distribution diagram of each index fragment in the P1 index. From Table 1 and Figure 3 It can be seen that 97% of the inverted chain lengths are concentrated in the length interval [1,720], and then the binary index method of the C++ standard library and the binary index method based on SIMD acceleration are used to index each inverted chain in the P1 index library. The data is indexed and it is concluded that when the length of the inverted chain is less than 720, the search speed of the binary search using the C++ standard library is faster than the binary index method based on SIMD (that is, the instruction set index method. The instruction set index method can be used through the SIMD instruction The centralized instructions accelerate the binary index method, such as AVX2 acceleration instructions to accelerate binary search).

表1Table 1

在倒排链长度大于720的时候，采用基于SIMD加速的二分索引方式能获得最优的性能，即该P1索引库中进行索引时，可设预设阈值为720，以将该P1索引库中的索引分片划分为两个不同的索引块，由此可知，二分索引方式在倒排链长度较小的时候索引速率较快，但在倒排链长度大的时候，其索引速率不能满足快速数据索引的需求，而此时仍然使用二分索引方式，会造成时延增大。When the length of the inverted chain is greater than 720, the binary index method based on SIMD acceleration can obtain the optimal performance. That is, when indexing in the P1 index library, the preset threshold can be set to 720 to reduce the index in the P1 index library. The index fragments are divided into two different index blocks. It can be seen that the binary index method has a faster indexing rate when the inverted chain length is small, but when the inverted chain length is large, its indexing rate cannot meet the fast Data indexing requirements, and the binary index method is still used at this time, which will cause increased latency.

将P1索引库中的所有倒排链数据按如图4所示的方式进行分布，当然，图中只显示了部分索引分片以其倒排链数据，还存在倒排链长度相同的索引分片，此处为简化未展示倒排链长度相同的倒排链数据，图4中，term代表索引分片，docid为该索引分片相关文档的编号，docid组成对应term的倒排链数据。All inverted chain data in the P1 index library are distributed as shown in Figure 4. Of course, the figure only shows some index fragments with their inverted chain data, and there are also index fragments with the same inverted chain length. Fragment, here for simplicity, the inverted chain data with the same inverted chain length is not shown. In Figure 4, term represents the index fragment, docid is the number of the document related to the index fragment, and docid constitutes the inverted chain data corresponding to the term.

然后将图4中倒排链长度小于和等于720的term对应的倒排链数据进行拼接，得到如图5中所示的第一索引块，同时将倒排链长度大于720的term对应的倒排链数据进行拼接，得到如图5中所示的第二索引块，且在后续对索引块中的数据进行索引时，使用速率较高的对应索引方式，如第一索引块中的倒排链数据使用二分索引方式，第二索引块中的倒排链数据使用SIMD中的指令集索引方式。Then, the inverted chain data corresponding to the terms whose inverted chain length is less than or equal to 720 in Figure 4 are spliced to obtain the first index block as shown in Figure 5. At the same time, the inverted chain data corresponding to the term whose inverted chain length is greater than 720 is obtained. The row chain data is spliced to obtain the second index block as shown in Figure 5, and when subsequently indexing the data in the index block, use the corresponding indexing method with a higher speed, such as inversion in the first index block The chain data uses the binary index method, and the inverted chain data in the second index block uses the instruction set index method in SIMD.

当然，由于本处倒排链长度大于720的term使用基于SIMD的二分索引，因此，在将倒排链长度大于720的term对应的倒排链数据进行拼接之前，还将每个term对应的倒排链数据都凑整为1024的倍数，以满足SIMD指令的需求，提高SIMD指令的索引速度。Of course, since the terms whose inverted chain length is greater than 720 here use binary index based on SIMD, before splicing the inverted chain data corresponding to the terms whose inverted chain length is greater than 720, the inverted chain data corresponding to each term will also be spliced. The chaining data is rounded to multiples of 1024 to meet the needs of SIMD instructions and improve the indexing speed of SIMD instructions.

在将倒排链数据进行拼接时，各term对应倒排链数据的拼接顺序可以不确定，如图4中的term1对应的倒排链数据可以拼接在term3对应的倒排链数据之后，但各term对应的倒排链数据中各docid的顺序必须固定，如图4中的term3，其倒排链数据中的docid编号顺序必须固定，即在第一索引块中以docid6、docid7、docid8、docid9的顺序进行拼接。When splicing the inverted chain data, the splicing order of the inverted chain data corresponding to each term may be uncertain. In Figure 4, the inverted chain data corresponding to term1 can be spliced after the inverted chain data corresponding to term3, but each The order of docids in the inverted chain data corresponding to the term must be fixed. For term3 in Figure 4, the order of docid numbers in the inverted chain data must be fixed, that is, docid6, docid7, docid8, and docid9 in the first index block. splicing in order.

步骤S230：将第一索引块和第二索引块存储在存储块中。Step S230: Store the first index block and the second index block in the storage block.

本实施例中，第一索引块和第二索引块可存储在一个存储块中，也可以分别存储，但在进行索引块中的数据存储时需保证存储方式与第一索引块或第二索引块中预设的索引方式对应。In this embodiment, the first index block and the second index block can be stored in one storage block, or they can be stored separately. However, when storing data in the index block, it is necessary to ensure that the storage method is consistent with the first index block or the second index block. Corresponds to the preset indexing method in the block.

如针对第一索引块对应预设二分索引方式，第二索引块预设指令集索引方式时，由于SIMD的指令集索引方式在4字节时更未高效，因此为了保证对索引方式的正常使用，则在存储第二索引块时，需要按4字节对齐的方式进行存储，因此无论第一索引块与第二索引块是否存储在同一存储块，均需保证第二索引块以4字节对齐进行存储。For example, when the first index block corresponds to the preset binary index method and the second index block presets the instruction set index method, since the SIMD instruction set index method is not efficient at 4 bytes, in order to ensure the normal use of the index method , then when storing the second index block, it needs to be stored in a 4-byte aligned manner. Therefore, regardless of whether the first index block and the second index block are stored in the same storage block, it is necessary to ensure that the second index block is aligned with 4 bytes. Aligned for storage.

当然，以上仅为示例性的举出一索引方式，还可以是其他索引方式，不同的索引方式，其对应索引块的存储方式可不同，如还可使用8字节对齐、16字节对齐等方式进行存储，以保证后续基于该索引方式进行数据索引的准确性与高效性，此处不进行具体限制。Of course, the above is only an exemplary indexing method, and other indexing methods can also be used. Different indexing methods can have different storage methods for corresponding index blocks. For example, 8-byte alignment, 16-byte alignment, etc. can also be used. storage method to ensure the accuracy and efficiency of subsequent data indexing based on this indexing method. There are no specific restrictions here.

步骤S250：建立哈希表，哈希表包括多个关键值和各关键值对应的编码值，所述编码值包括对应索引分片的索引方式。Step S250: Create a hash table. The hash table includes a plurality of key values and a coded value corresponding to each key value. The coded value includes an indexing method corresponding to the index fragment.

本实施例中，通过哈希表可对存储块中的倒排链数据进行索引，如图6是示出的一哈希表的结构图，其包括多个关键值(key)和各关键值对应的编码值(value)，一个关键值对应一个索引分片，该关键值可以为索引分片对应的哈希值，还可以为对应索引分片的名称(如图6所示的各关键值即为索引分片的名称)，此处不进行具体限制，编码值用结构体数据表征，如uint64_t(64个比特的整值，8字节)、uint32_t(32个比特的整值，4字节)等结构体数据，如图6中即通过uint64_t来表示各编码值。In this embodiment, the inverted chain data in the storage block can be indexed through a hash table. Figure 6 is a structural diagram of a hash table, which includes multiple key values (keys) and each key value. Corresponding encoding value (value), a key value corresponds to an index shard, the key value can be the hash value corresponding to the index shard, or it can also be the name of the corresponding index shard (each key value as shown in Figure 6 That is the name of the index fragment), there are no specific restrictions here, the encoding value is represented by structural data, such as uint64_t (64-bit integer value, 8 bytes), uint32_t (32-bit integer value, 4 words) section) and other structure data, as shown in Figure 6, each encoding value is represented by uint64_t.

本实施例中的编码值需实现对其对应关键值(即索引分片)的倒排链数据的索引，其包括对应索引分片的索引方式以及在存储块中的存储位置，本实施例中可将编码值划分为3个区域，如图7中所示，索引方式区域、数据长度区域以及偏移区域，该索引方式区域为对应索引分片的索引方式，其与步骤S210中各索引块中设定的索引方式相同，数据长度区域以及偏移区域确定对应倒排链数据的存储位置，具体来说，偏移区域则为对应索引分片的倒排链数据在储存快中的偏移offset((offset方法返回或设置匹配元素相对于文档的偏移(位置)返回第一个匹配元素的偏移坐标)，数据长度区域则为对应索引分片中的倒排链数据在存储块中的长度，即从offset位置往后索引的长度，以保证对应索引分片的倒排链数据全部查找完成。The encoded value in this embodiment needs to implement the indexing of the inverted chain data corresponding to the key value (ie, index fragment), which includes the indexing method of the corresponding index fragment and the storage location in the storage block. In this embodiment The encoding value can be divided into three areas, as shown in Figure 7, the index mode area, the data length area and the offset area. This index mode area is the index mode corresponding to the index fragmentation, which is the same as each index block in step S210. The index method set in is the same. The data length area and offset area determine the storage location of the corresponding inverted chain data. Specifically, the offset area is the offset of the inverted chain data corresponding to the index fragment in the storage fast. offset((The offset method returns or sets the offset (position) of the matching element relative to the document and returns the offset coordinate of the first matching element). The data length area is the inverted chain data in the corresponding index fragment in the storage block. The length, that is, the length of the index from the offset position to ensure that all inverted chain data corresponding to the index fragments are searched.

步骤S270：根据待索引分片的关键值和对应的索引方式在存储块中进行待索引分片的数据索引。Step S270: Index the data of the fragment to be indexed in the storage block according to the key value of the fragment to be indexed and the corresponding indexing method.

本实施例中，在得到哈希表和存储块后，若需对待索引分片进行索引，则根据待索引分片在哈希表中确认其对应的关键值，以及关键值对应的编码值，然后根据编码值中的索引方式、偏移位置、数据长度即可索引到该待索引分片的倒排链数据。In this embodiment, after obtaining the hash table and storage block, if it is necessary to index the fragment to be indexed, the corresponding key value and the encoding value corresponding to the key value are confirmed in the hash table according to the fragment to be indexed, Then according to the indexing method, offset position, and data length in the encoding value, the inverted chain data of the fragment to be indexed can be indexed.

目前的索引中，经常存在多term共同索引的情况，而多term的求交操作非常高频，用户的每一次搜索都几乎会经历，而如何快速地完成求交运算得到包含多term的文档集合是搜索引擎所面临的共同的难题，本实施例中提出的数据索引方法还可用于倒排检索过程中的多term求交操作，即通过哈希表和存储块即可快速倒排索引到各索引分片的文档集合(倒排链数据)，然后通过跳表进行多term文档集合的交互，以快速得到索引结果。In current indexes, multiple terms are often co-indexed, and the intersection operation of multiple terms is very frequent. Users will experience it almost every time they search. How to quickly complete the intersection operation to obtain a document collection containing multiple terms? is a common problem faced by search engines. The data indexing method proposed in this embodiment can also be used for multi-term intersection operations in the inverted retrieval process, that is, through hash tables and storage blocks, inverted indexes can be quickly Index the fragmented document collection (inverted chain data), and then interact with the multi-term document collection through skip tables to quickly obtain the index results.

本实施例中通过倒排链长度分布信息，将不同长度的倒排链数据匹配不同的索引方式，并且结合不同索引平台的运行环境、索引方式的种类，确定倒排链长度的最优预设阈值，同时也为各索引分片的倒排链数据选择更高效的索引方式；同时基于索引方式，还设置不同的存储方式，以将索引方式与存储方式进行更优匹配，在这种存储和查找算法的最佳匹配下，能获得更优的性能，进而大大减少搜索链路上的latency(延迟)，给用户提供更好的搜索。In this embodiment, inverted chain length distribution information is used to match inverted chain data of different lengths to different indexing methods, and the optimal preset of inverted chain length is determined based on the operating environments of different indexing platforms and the types of indexing methods. Threshold, and also select a more efficient indexing method for the inverted chain data of each index shard; at the same time, based on the indexing method, different storage methods are also set to better match the indexing method and the storage method. In this storage and Under the best match of the search algorithm, better performance can be obtained, thereby greatly reducing the latency on the search link and providing users with better searches.

图8是图2所示实施例中步骤S210在一示例性实施例中的流程图。如图8所示，在一示例性实施例中，步骤S210根据各索引分片的倒排链长度，将倒排链长度不大于预设阈值的倒排链数据拼接为第一索引块，并将倒排链长度大于预设阈值的倒排链数据拼接为第二索引块的过程可以包括步骤S810至步骤S830，详细介绍如下：FIG. 8 is a flowchart of step S210 in the embodiment shown in FIG. 2 in an exemplary embodiment. As shown in Figure 8, in an exemplary embodiment, step S210 splices the inverted chain data whose length is not greater than the preset threshold into a first index block according to the inverted chain length of each index fragment, and The process of splicing inverted chain data with an inverted chain length greater than a preset threshold into a second index block may include steps S810 to S830, which are described in detail as follows:

步骤S810：将倒排链长度大于预设阈值的各索引分片的倒排链数据进行对齐处理。Step S810: Align the inverted chain data of each index fragment whose inverted chain length is greater than the preset threshold.

本实施例中针对需要进行特殊索引方式倒排链数据，在进行拼接得到索引块时，先进行对齐处理。In this embodiment, for inverted chain data that requires a special indexing method, alignment processing is first performed when splicing to obtain the index block.

如第二索引块中的倒排链数据使用非二分索引方式进行索引时的效率更高，则根据预设的索引方式需求，先进行对齐处理，如在通过指令集索引方式进行查找时，由于SIMD按1024个int32_t分块，为满足指令集索引方式的需求，则采用1024这个数值来对倒排链数据进行补齐，当倒排链长度最后一块不足1024个int32_t的时候，使用0x7FFFFFFF(int型的最大值)来补齐尾部的缺失(如图9所示)，在补齐后，由于倒排链数据为1024补齐，在通过指令集索引时，即可满足int32_的需求，同时，由于是1024的整数倍，可以大大提高索引的速度。If the inverted chain data in the second index block is indexed more efficiently using the non-binary index method, alignment processing is performed first according to the preset index method requirements. For example, when searching through the instruction set index method, due to SIMD is divided into blocks of 1024 int32_t. In order to meet the requirements of the instruction set index method, the value 1024 is used to complete the inverted chain data. When the length of the last block of the inverted chain is less than 1024 int32_t, 0x7FFFFFFF(int type) to fill in the missing tail (as shown in Figure 9). After filling in, since the inverted chain data is 1024, it can meet the needs of int32_ when indexing through the instruction set. At the same time , since it is an integer multiple of 1024, it can greatly improve the indexing speed.

当然，此处的0x7FFFFFFF仅为示例行的，由于0x7FFFFFFF为整型的最大值可进行快速补齐，当在其他情况下，还可使用其他的位数进行补齐，此处进行具体限制；同时，按照1024进行补齐也是示例性的，即针对本实施例提出的指令集索引方式，在其他索引方式中，还可根据索引方式的需求，通过其他对齐方式进行倒排链数据的补齐。Of course, the 0x7FFFFFFFF here is only for the example line. Since 0x7FFFFFFFF is the maximum value of an integer, it can be quickly filled. In other cases, other digits can also be used for filling, and there are specific restrictions here; at the same time , the completion according to 1024 is also exemplary, that is, for the instruction set indexing method proposed in this embodiment, in other indexing methods, the inverted chain data can also be completed through other alignment methods according to the requirements of the indexing method.

步骤S830：将进行对齐处理后的各索引分片的倒排链数据进行拼接，得到第二索引块。Step S830: Splice the aligned inverted chain data of each index fragment to obtain a second index block.

本实施例中，在将索引分片进行对齐处理后，将各索引分片的倒排链数据进行拼接，其拼接过程任可参考步骤S210，即两个索引分片对应的倒排链数据之间的顺序可以改变，但单个索引分片对应的倒排链数据内部docid之间的顺序不能改变。In this embodiment, after the index fragments are aligned, the inverted chain data of each index fragment is spliced. For the splicing process, please refer to step S210, that is, the inverted chain data corresponding to the two index fragments is spliced. The order between docids can be changed, but the order between docids within the inverted chain data corresponding to a single index shard cannot be changed.

本实施例中根据不同索引方式的需求，将对应索引方式下的倒排链数据先进行对齐处理再进行拼接，在满足索引方式需求的基础上，还能提高后续通过该方式进行索引时的速度与准确率。In this embodiment, according to the needs of different indexing methods, the inverted chain data in the corresponding indexing method is first aligned and then spliced. On the basis of meeting the needs of the indexing method, it can also improve the speed of subsequent indexing through this method. and accuracy.

图10是图2所示实施例中步骤S230在一示例性实施例中的流程图。如图10所示，在一示例性实施例中，存储块的数量为多个，步骤S230将第一索引块和第二索引块存储在存储块中的过程可以包括步骤S1010至步骤S1030，详细介绍如下：FIG. 10 is a flowchart of step S230 in the embodiment shown in FIG. 2 in an exemplary embodiment. As shown in Figure 10, in an exemplary embodiment, the number of storage blocks is multiple. The process of storing the first index block and the second index block in the storage block in step S230 may include steps S1010 to step S1030. Details The introduction is as follows:

步骤S1010：将第一索引块的数据存储在第一存储块中。Step S1010: Store the data of the first index block in the first storage block.

本实施例中，第一索引块中的倒排链长度较短，此时使用二分索引方式效率较高，二分索引方式无需特殊的存储方式，因此可以直接将第一索引块中的数据直接存储在一存储块中，该第一存储块的内存长度可以与第一索引块中的内存长度相同或是略大。In this embodiment, the length of the inverted chain in the first index block is short. In this case, the binary index method is more efficient. The binary index method does not require a special storage method, so the data in the first index block can be directly stored. In a storage block, the memory length of the first storage block may be the same as or slightly larger than the memory length of the first index block.

步骤S1030：按预设字节对齐的方式将第二索引块的数据存储在第二存储块中。Step S1030: Store the data of the second index block in the second storage block in a preset byte alignment manner.

第二索引块中的倒排链长度较长，此时使用二分索引方式进行索引时延较长，因此可设定其他索引方式对第二索引块中的倒排链数据进行索引，而不同的索引方式其需求的存储方式不同，如针对上述提出的指令集索引方式，其在索引时按照4字节进行处理，因此在对第二索引块的数据进行存储时，也按4字节对齐的方式将第二索引块的数据存储在第二存储块中。The length of the inverted chain in the second index block is longer. In this case, the indexing delay using the binary index method is longer. Therefore, other index methods can be set to index the inverted chain data in the second index block. Different The indexing method requires different storage methods. For example, for the instruction set indexing method proposed above, it is processed according to 4 bytes during indexing. Therefore, when the data of the second index block is stored, it is also aligned according to 4 bytes. The method stores the data of the second index block in the second storage block.

当然，以上仅为示例性了，对于指令集索引方式，也可按照其他字节对齐的方式进行存储，如8字节、16字节等等，其均能实现后续的索引功能，但使用4字节时，其性能会更好；且对于不同的索引方式，根据其处理数据的方式，可设置不同的字节对齐方式进行存储，以提高后续数据索引时的效率，此处不对字节对齐的具体方式进行限定。Of course, the above is just an example. For the instruction set indexing method, it can also be stored in other byte-aligned methods, such as 8 bytes, 16 bytes, etc., which can realize subsequent indexing functions, but using 4 Bytes, its performance will be better; and for different indexing methods, according to the way they process data, different byte alignments can be set for storage to improve the efficiency of subsequent data indexing. There is no byte alignment here. be limited in specific ways.

本实施例中，将不同的索引块存储在不同的存储块中，同时，根据索引块对应的索引方式不同，设置不同的数据存储方式，通过存储方式与索引方式的匹配，能得到更优的数据索引性能，大大减少数据索引的延迟。In this embodiment, different index blocks are stored in different storage blocks. At the same time, different data storage methods are set according to the different indexing methods corresponding to the index blocks. By matching the storage method and the indexing method, a better data storage method can be obtained. Data indexing performance, greatly reducing data indexing delay.

图11是图2所示实施例中步骤S270在一示例性实施例中的流程图。如图11所示，在一示例性实施例中，该方法中存储块的数量为多个，且其应用于图10所示的存储方式之后，步骤S270根据待索引分片的关键值和对应的索引方式在存储块中进行待索引分片的数据索引的过程可以包括步骤S1110至步骤S1150，详细介绍如下：FIG. 11 is a flowchart of step S270 in the embodiment shown in FIG. 2 in an exemplary embodiment. As shown in Figure 11, in an exemplary embodiment, the number of storage blocks in this method is multiple, and after it is applied to the storage method shown in Figure 10, step S270 is based on the key value of the fragment to be indexed and the corresponding The process of indexing the data of the fragments to be indexed in the storage block using the indexing method may include steps S1110 to S1150. The details are as follows:

步骤S1110：基于哈希表确认待索引分片对应的第一关键值以及第一关键值对应的索引方式和存储位置。Step S1110: Confirm the first key value corresponding to the fragment to be indexed and the indexing method and storage location corresponding to the first key value based on the hash table.

本实施例中，在需索引待索引分片的倒排链数据时，可通过待索引分片在哈希表中确认其对应的第一关键值，同时确定第一关键值对应的编码值，如图2所示，根据编码值中的索引方式区域确定待索引分片的第一索引方式，根据编码值中的数据长度区域以及偏移区域确定待索引分片的倒排链数据数在存储块中的存储位置，且对应的索引方式可指示待索引分片的倒排链数据所在的存储块；如待索引分片的倒排链数据位于第一索引块中，则第一索引方式对应第一索引块中的索引方式，且指示该待索引分片的倒排链数据存储在第一存储块中，若待索引分片的倒排链数据位于第二索引块中，则第一索引方式对应第二索引块中的索引方式，且指示该待索引分片的倒排链数据存储在第二存储块中。In this embodiment, when the inverted chain data of the fragment to be indexed needs to be indexed, its corresponding first key value can be confirmed in the hash table through the fragment to be indexed, and the encoding value corresponding to the first key value can be determined at the same time. As shown in Figure 2, the first index mode of the fragment to be indexed is determined according to the index mode area in the coded value, and the number of inverted chain data of the fragment to be indexed is determined according to the data length area and offset area in the coded value. The storage location in the block, and the corresponding indexing method can indicate the storage block where the inverted chain data of the fragment to be indexed is located; if the inverted chain data of the fragment to be indexed is located in the first index block, the first indexing method corresponds to The indexing mode in the first index block, and indicates that the inverted chain data of the fragment to be indexed is stored in the first storage block. If the inverted chain data of the fragment to be indexed is located in the second index block, then the first index The mode corresponds to the indexing mode in the second index block, and indicates that the inverted chain data of the fragment to be indexed is stored in the second storage block.

步骤S1130：若第一关键值对应第一索引方式，则根据第一索引方式和对应存储位置在第一存储块中索引待索引分片的倒排链数据。Step S1130: If the first key value corresponds to the first indexing method, index the inverted chain data of the fragments to be indexed in the first storage block according to the first indexing method and the corresponding storage location.

本实施例中，第一索引方式为针对第一索引块中的倒排链数据进行索引而设定，即为针对第一存储块中存储的倒排链数据进行索引而设定，因此，若第一关键值中的索引方式区域对应第一索引方式，则证明该待索引分片位于第一索引块与第一存储块中，则可根据该第一索引方式和第一关键值中指示的存储位置在第一存储块中索引待索引分片的倒排链数据。In this embodiment, the first indexing method is set for indexing the inverted chain data in the first index block, that is, it is set for indexing the inverted chain data stored in the first storage block. Therefore, if The index mode area in the first key value corresponds to the first index mode, which proves that the fragment to be indexed is located in the first index block and the first storage block. Then, according to the first index mode and the first key value indicated in The storage location indexes the inverted chain data of the shard to be indexed in the first storage block.

步骤S1150：若第一关键值对应第二索引方式，则根据第二索引方式和对应存储位置在第二存储块中索引待索引分片的倒排链数据。Step S1150: If the first key value corresponds to the second indexing method, index the inverted chain data of the fragments to be indexed in the second storage block according to the second indexing method and the corresponding storage location.

本实施例中，第二索引方式为针对第二索引块中的倒排链数据进行索引而设定，即为针对第二存储块中存储的倒排链数据进行索引而设定，因此，若第一关键值中的索引方式区域对应第二索引方式，则证明该待索引分片位于第二索引块与第二存储块中，则可根据该第二索引方式和第一关键值中指示的存储位置在第二存储块中索引待索引分片的倒排链数据。In this embodiment, the second indexing method is set for indexing the inverted chain data in the second index block, that is, it is set for indexing the inverted chain data stored in the second storage block. Therefore, if The index mode area in the first key value corresponds to the second index mode, which proves that the fragment to be indexed is located in the second index block and the second storage block. Then, according to the second index mode and the index indicated in the first key value, The storage location indexes the inverted chain data of the shard to be indexed in the second storage block.

本实施例中提出在将不同的索引块存储在不同的存储块中时，可根据待索引分片在哈希表中的关键值和编码值确认待索引分片数据索引的索引方式以及存储位置，以此，对于不同的待索引分片，通过不同的索引方式进行索引，大大提高索引效率。This embodiment proposes that when different index blocks are stored in different storage blocks, the indexing method and storage location of the data index for the fragment to be indexed can be confirmed based on the key value and coding value of the fragment to be indexed in the hash table. , In this way, different indexing methods are used to index different fragments to be indexed, which greatly improves the indexing efficiency.

图12是图2所示实施例中步骤S270在另一示例性实施例中的流程图。如图12所示，在一示例性实施例中，第一索引块和第二索引块存储在一个存储块中，步骤S270根据待索引分片的关键值和对应的索引方式在存储块中进行待索引分片的数据索引的过程可以包括步骤S1210至步骤S1250，详细介绍如下：FIG. 12 is a flowchart of step S270 in the embodiment shown in FIG. 2 in another exemplary embodiment. As shown in Figure 12, in an exemplary embodiment, the first index block and the second index block are stored in a storage block, and step S270 is performed in the storage block according to the key value of the fragment to be indexed and the corresponding indexing method. The process of indexing the data of the fragments to be indexed may include steps S1210 to S1250, which are described in detail as follows:

步骤S1210：基于哈希表确认待索引分片对应的第二关键值以及第二关键值对应的索引方式和存储位置。Step S1210: Confirm the second key value corresponding to the fragment to be indexed and the indexing method and storage location corresponding to the second key value based on the hash table.

本实施例中，第一索引块和第二索引块存储在一个存储块中，存储块的长度等于或大于第一索引块与第二索引块之间的长度和，且可参考图10中步骤S1030，第二索引块需按预设字节对齐的方式进行存储。In this embodiment, the first index block and the second index block are stored in one storage block, and the length of the storage block is equal to or greater than the sum of the lengths between the first index block and the second index block. Refer to the steps in Figure 10 S1030, the second index block needs to be stored in a preset byte alignment manner.

当为了保证存储空间不被浪费时，则可将存储块的长度设为第一索引块与第二索引块，此时，第一索引块与第二索引块中的数据恰好可以存储在存储块中，且为了保证第二索引块中的数据按照预设字节进行存储，则需先将第二索引块中的数据先进行存储，再将第一索引块中的数据进行存储，如图13所示的一实施例中存储块的结构意图。In order to ensure that the storage space is not wasted, the length of the storage block can be set to the first index block and the second index block. At this time, the data in the first index block and the second index block can be stored in the storage block. , and in order to ensure that the data in the second index block is stored according to the preset bytes, the data in the second index block needs to be stored first, and then the data in the first index block is stored, as shown in Figure 13 The structural diagram of the memory block in one embodiment is shown.

当然，该存储块的长度大小也可大于第一索引块与第二索引块之间的长度和，此时，不强迫第一索引块的数据存储在第二索引块的数据之后，只需满足第二索引块能按预设字节进行完整存储即可，保证第二索引块中数据以预设字节进行存储，以与第二索引块中的索引方式匹配，能在后续对第二索引块中的数据进行索引时提高数据索引的速率。Of course, the length of the storage block can also be greater than the sum of the lengths between the first index block and the second index block. In this case, the data of the first index block is not forced to be stored after the data of the second index block. It only needs to satisfy The second index block can be completely stored in preset bytes, ensuring that the data in the second index block is stored in preset bytes to match the indexing method in the second index block, and the second index can be subsequently updated. Increases the rate at which data is indexed when data in a block is indexed.

本实施例中，在需索引待索引分片的倒排链数据时，可通过待索引分片在哈希表中确认其对应的第二关键值，同时确定第二关键值对应的编码值，具体步骤可参考图11中的步骤S1110。In this embodiment, when it is necessary to index the inverted chain data of the fragment to be indexed, the corresponding second key value can be confirmed in the hash table through the fragment to be indexed, and the encoding value corresponding to the second key value can be determined at the same time. For specific steps, please refer to step S1110 in Figure 11.

步骤S1230：若第二关键值对应第三索引方式，则根据第三索引方式和对应存储位置在存储块中索引待索引分片的倒排链数据。Step S1230: If the second key value corresponds to the third indexing method, index the inverted chain data of the fragments to be indexed in the storage block according to the third indexing method and the corresponding storage location.

第三索引方式为针对第一索引块中的倒排链数据进行索引而设定，本实施例中，第三索引方式为针对第一索引块中的倒排链数据进行索引而设定，因此，若第二关键值中的索引方式区域对应第三索引方式，则证明该待索引分片位于第一索引块中，则可根据该第三索引方式和第二关键值中指示的存储位置在存储块中索引待索引分片的倒排链数据。The third indexing method is set for indexing the inverted chain data in the first index block. In this embodiment, the third indexing method is set for indexing the inverted chain data in the first index block. Therefore, , if the indexing mode area in the second key value corresponds to the third indexing mode, it proves that the fragment to be indexed is located in the first index block, and the storage location indicated in the third indexing mode and the second key value can be found in Index the inverted chain data of the shards to be indexed in the storage block.

步骤S1250：若第二关键值对应第四索引方式，则根据第四索引方式和对应存储位置在存储块中索引待索引分片的倒排链数据。Step S1250: If the second key value corresponds to the fourth indexing method, index the inverted chain data of the fragments to be indexed in the storage block according to the fourth indexing method and the corresponding storage location.

第四索引方式为针对第二索引块中的倒排链数据进行索引而设定，本实施例中，第四索引方式为针对第二索引块中的倒排链数据进行索引而设定，因此，若第二关键值中的索引方式区域对应第四索引方式，则证明该待索引分片位于第二索引块中，则可根据该第四索引方式和第二关键值中指示的存储位置在存储块中索引待索引分片的倒排链数据。The fourth indexing method is set for indexing the inverted chain data in the second index block. In this embodiment, the fourth indexing method is set for indexing the inverted chain data in the second index block. Therefore, , if the index mode area in the second key value corresponds to the fourth index mode, it proves that the fragment to be indexed is located in the second index block, and then it can be based on the fourth index mode and the storage location indicated in the second key value. Index the inverted chain data of the shards to be indexed in the storage block.

本实施例中提出在将不同的索引块存储在同一存储块中时，需根据索引块中对应的索引方式，保证索引方式与存储方式的匹配性，以提高索引效率，同时还可根据待索引分片在哈希表中的关键值和编码值确认待索引分片数据索引的索引方式以及存储位置，以此，对于不同的待索引分片，通过不同的索引方式进行索引，大大提高索引效率。In this embodiment, it is proposed that when different index blocks are stored in the same storage block, it is necessary to ensure the matching of the indexing method and the storage method according to the corresponding indexing method in the index block, so as to improve the indexing efficiency. The key value and coding value of the shard in the hash table confirm the indexing method and storage location of the shard data to be indexed. In this way, different shards to be indexed are indexed through different indexing methods, which greatly improves the indexing efficiency. .

图14是图2所示实施例中步骤S250在一示例性实施例中的流程图。如图14所示，在一示例性实施例中，步骤S270建立哈希表的过程可以包括步骤S1410至步骤S1470，详细介绍如下：FIG. 14 is a flowchart of step S250 in the embodiment shown in FIG. 2 in an exemplary embodiment. As shown in Figure 14, in an exemplary embodiment, the process of establishing the hash table in step S270 may include steps S1410 to step S1470, which are described in detail as follows:

步骤S1410：预设初始的哈希表。Step S1410: Default initial hash table.

本实施例中，首先创建初始的哈希表(hashmap)，该初始的哈希表中包括多个初始关键值和各初始关键值对应的初始编码值，初始关键值用于后续填充对应索引分片的数据，初始的编码值用于后续填充对应索引分片的索引方式以及存储位置。In this embodiment, an initial hash table (hashmap) is first created. The initial hash table includes multiple initial key values and initial encoding values corresponding to each initial key value. The initial key values are used to subsequently fill in the corresponding index points. The data of the slice, the initial encoding value is used to subsequently fill in the index method and storage location of the corresponding index slice.

步骤S1430：将各索引分片的哈希值分别覆盖初始关键值，得到多个关键值。Step S1430: Overwrite the initial key value with the hash value of each index fragment to obtain multiple key values.

本实施例中，一个初始关键值可对应一个索引分片，然后将索引分片的哈希值填充于对应初始关键值中，以得到可用于指示该索引分片的关键值。In this embodiment, an initial key value may correspond to an index fragment, and then the hash value of the index fragment is filled in the corresponding initial key value to obtain a key value that can be used to indicate the index fragment.

步骤S1450：根据各倒排链数据所在的索引块为对应索引分片分配索引方式。Step S1450: Allocate an index mode to the corresponding index fragment according to the index block where each inverted chain data is located.

本实施例中，根据不同的索引块，为索引块中的索引分片分配索引方式，同一索引块中的倒排链数据索引方式相同，不同索引块中的倒排链数据索引方式不同，具体可参考图2中步骤S210中的索引方式设定。In this embodiment, the indexing method is allocated to the index fragments in the index block according to different index blocks. The indexing method of the inverted chain data in the same index block is the same, and the indexing method of the inverted chain data in different index blocks is different. Specifically Reference may be made to the index mode setting in step S210 in FIG. 2 .

步骤S1470：将各索引分片的倒排链数据在存储块中的位置和对应的索引方式覆盖于对应的初始编码值，得到哈希表。Step S1470: Overwrite the position of the inverted chain data of each index fragment in the storage block and the corresponding index method with the corresponding initial encoding value to obtain a hash table.

如上述实施例所提出，索引块存储在存储块中，即索引块中的倒排链数据也存储在存储块中，因此，各索引分片的索引即可通过对应倒排数据在存储块中的位置进行，将各索引分片的倒排链数据在存储块中的位置和对应的索引方式覆盖于对应的初始编码值，即可得到对应的编码值，如此，通过步骤S1410至步骤S1470即可得到用于数据索引的哈希表。As proposed in the above embodiment, the index block is stored in the storage block, that is, the inverted chain data in the index block is also stored in the storage block. Therefore, the index of each index shard can be stored in the storage block through the corresponding inverted data. The position of the inverted chain data of each index fragment in the storage block and the corresponding index method are overwritten with the corresponding initial encoding value to obtain the corresponding encoding value. In this way, through steps S1410 to step S1470, that is A hash table is available for data indexing.

当然，存储块的数量可以是一个或多个，当储存块的数量为多个时，索引方式还可以指示对应关键值的倒排链数据所在的存储块，具体可参考图10至图13中所描述的方案。Of course, the number of storage blocks can be one or more. When the number of storage blocks is multiple, the index method can also indicate the storage block where the inverted chain data corresponding to the key value is located. For details, please refer to Figure 10 to Figure 13 The scheme described.

本实施例中提出哈希表的构建方式，以通过上述方式构建用于数据索引的哈希表，在需进行数据索引时，只需知道待索引分片，即可通过哈希表确认对应关键值与编码值，从而确定与待索引分片对应的索引方式与存储位置进行到排练数据的索引，以此大大提高索引速率。In this embodiment, a hash table construction method is proposed to construct a hash table for data indexing through the above method. When data indexing is required, only the fragments to be indexed are known, and the corresponding key can be confirmed through the hash table. value and encoding value, thereby determining the indexing method and storage location corresponding to the fragment to be indexed to index the rehearsal data, thereby greatly improving the indexing rate.

图15是图14所示实施例中步骤S1470在一示例性实施例中的流程图。如图15所示，在一示例性实施例中，步骤S1470将各索引分片的倒排链数据在存储块中的位置和对应的索引方式覆盖于对应的初始编码值，得到哈希表的过程可以包括步骤S1510至步骤S1570，详细介绍如下：FIG. 15 is a flowchart of step S1470 in the embodiment shown in FIG. 14 in an exemplary embodiment. As shown in Figure 15, in an exemplary embodiment, step S1470 overwrites the position of the inverted chain data of each index fragment in the storage block and the corresponding index mode with the corresponding initial encoding value to obtain the hash table. The process may include steps S1510 to S1570, which are described in detail as follows:

步骤S1510：将各初始编码值进行区域划分，得到各初始编码值对应的索引方式区域、数据长度区域以及偏移区域。Step S1510: Divide each initial encoding value into regions to obtain the index mode area, data length area and offset area corresponding to each initial encoding value.

本实施例中，将初始编码值划分区域，得到如图7所示的编码值区域结构，当然，图7中所示的为编码值使用uint64_t表示的一示例性结构，在其他实施例中，还以是其他的划分方式，如使用不同的结构体数据，对区域划分的比特值大小不同等，此处不进行具体限制。In this embodiment, the initial coded value is divided into regions to obtain the coded value area structure as shown in Figure 7. Of course, what is shown in Figure 7 is an exemplary structure in which the coded value is represented by uint64_t. In other embodiments, There are also other division methods, such as using different structure data, different bit value sizes for area division, etc., and there are no specific restrictions here.

在一具体实施例中，使用uint64_t表示初始编码值，则可将初始编码值中的低40bit(比特)划为偏移区域，用来存储倒排链数据在存储块的偏移offset，由于目前单机倒排索引的大小一般很少超过1TB，因此，这个offset可设置为1TB的大小，当然，对于不同的情况下，还可设置不同大小，此处不进行限制。In a specific embodiment, using uint64_t to represent the initial encoding value, the lower 40 bits of the initial encoding value can be divided into offset areas to store the offset offset of the inverted chain data in the storage block. Since the current The size of a single-machine inverted index generally rarely exceeds 1TB. Therefore, this offset can be set to a size of 1TB. Of course, different sizes can also be set for different situations, and there is no limit here.

初始编码值中最高4位划分为索引方式区域，用来表示索引方式，如上所示，目前由于只涉及2种索引方式，即针对第一索引块的索引方式以及针对第二索引块的索引方式，则最高4位可取值为0x0001和0x0002，分别表示两种索引方式，如前者代表标准库的二分索引方式，后者代表指令集索引方式，中间的20位划分为数据长度区域，用来表示偏移量后的数据长度，如当最高4位为0x0001的时候，中间20位的数值代表此offset后int32_t(4字节)数据的个数，当最高4位为0x0002的时候，中间20位的数值代表单次SIMD指令操作数据的段数目，比如AVX2指令(SIMD中的一个指令集)单次可以操作256bit，则这20位的数值代表有多少个这样的256bit。The highest 4 bits in the initial encoding value are divided into the index mode area, which is used to indicate the index mode. As shown above, currently there are only two index modes involved, namely the index mode for the first index block and the index mode for the second index block. , then the highest 4 bits can take values of 0x0001 and 0x0002, which represent two indexing methods respectively. For example, the former represents the binary indexing method of the standard library, and the latter represents the instruction set indexing method. The middle 20 bits are divided into data length areas for Represents the data length after the offset. For example, when the highest 4 bits are 0x0001, the middle 20 bits represent the number of int32_t (4 bytes) data after the offset. When the highest 4 bits are 0x0002, the middle 20 The value of the bit represents the number of segments of data operated by a single SIMD instruction. For example, the AVX2 instruction (an instruction set in SIMD) can operate 256 bits at a time, so the 20-bit value represents how many such 256 bits there are.

步骤S1530：将各索引分片对应的索引方式信息覆盖于对应索引方式区域中。Step S1530: Overwrite the index mode information corresponding to each index fragment in the corresponding index mode area.

本实施例中，各索引分片的索引方式与其所在的索引块相关，则将各索引分片对应的索引方式信息覆盖于对应索引方式区域中。In this embodiment, the indexing mode of each index fragment is related to the index block in which it is located, and the indexing mode information corresponding to each index fragment is overlaid in the corresponding index mode area.

步骤S1550：将各索引分片的倒排链长度信息覆盖于对应的数据长度区域。Step S1550: Overwrite the inverted chain length information of each index fragment into the corresponding data length area.

数据长度区域用于表示偏移量后的数据长度，即为对应倒排链数据在存储块中的长度，因此，可将各索引分片的倒排链长度信息覆盖于对应的数据长度区域。The data length area is used to represent the data length after the offset, which is the length of the corresponding inverted chain data in the storage block. Therefore, the inverted chain length information of each index fragment can be overwritten in the corresponding data length area.

步骤S1570：将各索引分片的倒排链数据在存储块中的起始位置信息覆盖于对应的偏移区域，以得到哈希表。Step S1570: Overwrite the starting position information of the inverted chain data of each index fragment in the storage block with the corresponding offset area to obtain a hash table.

偏移区域用于指示倒排链数据在存储块的偏移offset，即可对应倒排链数据在存储块中的存储起始位置，将将各索引分片的倒排链数据在存储块中的起始位置信息覆盖于对应的偏移区域，以此，得到各索引分片对应的编码值，从而得到用于数据索引的哈希表。The offset area is used to indicate the offset offset of the inverted chain data in the storage block, which corresponds to the starting position of the inverted chain data in the storage block. The inverted chain data of each index fragment will be stored in the storage block. The starting position information is covered in the corresponding offset area, thereby obtaining the coding value corresponding to each index fragment, thereby obtaining the hash table used for data indexing.

本实施例中，通过划分3个区域，并在3个区域中填充相应数据，以得到可用于数据索引的编码值，通过编码值中的索引方式区域可得到对应索引分片的索引方式，通过数据长度区域以及偏移区域可得到对应倒排链数据在存储块中的起始位置以及倒排链数据长度，以此可指示倒排链数据在存储块中的位置，通过以上方法得到编码值，可用于后续对不同的索引分片，进行不同索引方式的数据索引，提高索引效率。In this embodiment, by dividing three areas and filling the corresponding data in the three areas, a coded value that can be used for data indexing is obtained. The indexing method of the corresponding index fragment can be obtained through the index mode area in the coded value. The data length area and offset area can obtain the starting position of the corresponding inverted chain data in the storage block and the length of the inverted chain data, which can indicate the position of the inverted chain data in the storage block, and the encoded value can be obtained through the above method , which can be used to subsequently shard different indexes and index data in different indexing methods to improve indexing efficiency.

图16是根据另一示例性实施例示出的一种数据索引方法的流程图。如图16所示，在一示例性实施例中，该方法实施于图2中的步骤S210之前，可以包括步骤S1610至步骤S1630，详细介绍如下：Figure 16 is a flow chart of a data indexing method according to another exemplary embodiment. As shown in Figure 16, in an exemplary embodiment, the method is implemented before step S210 in Figure 2, and may include steps S1610 to S1630. The details are as follows:

步骤S1610：对各索引分片的倒排链分别基于二分索引方式和指令集索引方式进行索引，相应得到各索引分片的第一索引速率和第二索引速率。Step S1610: Index the inverted chain of each index fragment based on the binary index method and the instruction set index method respectively, and obtain the first index rate and the second index rate of each index fragment accordingly.

指令集索引方式为基于SIMD加速的二分索引方式。本实施例中提出一种预设阈值的确定方式，同时，根据索引块中的倒排链长度，先确认第一索引块中倒排链数据进行索引为二分索引方式，第二索引块中倒排链数据进行索引为指令集索引方式。The instruction set indexing method is a binary indexing method based on SIMD acceleration. This embodiment proposes a method for determining the preset threshold. At the same time, according to the length of the inverted chain in the index block, it is first confirmed that the inverted chain data in the first index block is indexed in the binary index method, and the inverted chain data in the second index block is indexed in binary index mode. The chain data is indexed using the instruction set index method.

此时，则可通过二分索引方式和指令集索引方式分别对各索引分片的倒排链数据进行索引，可相应得到各索引分片在二分索引方式下的第一索引速率与在指令集索引方式下的第二索引速率。At this time, the inverted chain data of each index shard can be indexed respectively through the binary index method and the instruction set index method, and the first index rate of each index shard in the binary index method and the instruction set index can be obtained accordingly. Second index rate in mode.

步骤S1630：根据各索引分片的倒排链长度，按照预设长度顺序比较各索引分片对应的第一索引速率和第二索引速率之间的数值大小，并根据得到的比较结果来确定预设阈值。Step S1630: Based on the inverted chain length of each index fragment, compare the numerical values between the first index rate and the second index rate corresponding to each index fragment in the preset length sequence, and determine the preset rate based on the obtained comparison result. Set threshold.

本实施例中，各索引分片的倒排链长度，按由短到长的顺序比较各索引分片对应的第一索引速率和第二索引速率之间的数值大小，将首个第一索引速率小于第二索引速率的索引分片对应的倒排链长度作为预设阈值；或，In this embodiment, the inverted chain length of each index fragment is compared with the numerical value between the first index rate and the second index rate corresponding to each index fragment in order from short to long, and the first first index is The length of the inverted chain corresponding to the index fragment whose rate is less than the second index rate is used as the preset threshold; or,

根据各索引分片的倒排链长度，按由长到短的顺序比较各索引分片对应的第一索引速率和第二索引速率之间的数值大小，将首个第一个第一索引速率大于第二索引速率的索引分片对应的倒排链长度作为预设阈值。According to the inverted chain length of each index fragment, compare the numerical values between the first index rate and the second index rate corresponding to each index fragment in order from long to short, and compare the first first index rate The length of the inverted chain corresponding to the index fragment greater than the second index rate is used as the preset threshold.

在确定预设阈值后，得到的第一索引块中可设定二分索引方式作为第一索引块中倒排链数据的索引方式，第二索引块中可设定指令集索引方式作为第二索引块中倒排链数据的索引方式，以此，对于不同的索引块，为其设定索引速率更快的索引方式。After determining the preset threshold, the binary indexing method can be set in the obtained first index block as the indexing method of the inverted chain data in the first index block, and the instruction set indexing method can be set in the second index block as the second index. The indexing method of the inverted chain data in the block. In this way, for different index blocks, set an indexing method with a faster indexing rate.

本实施例中，通过不同索引方式对倒排链数据进行索引，得到两种索引方式下的倒排链长度阈值，以各倒排链长度阈值作为预设阈值，大于该预设阈值的倒排链数据在某一索引方式下的索引速率大于另一索引方式，而不大于该预设阈值的倒排链数据在另一索引方式下的索引速率大大于某一索引方式，从而可通过该预设阈值确定不同的索引块，以及不同索引块下更优的索引方式，从而实现待索引分片的快速索引。In this embodiment, the inverted chain data is indexed through different indexing methods to obtain inverted chain length thresholds under two indexing methods. Each inverted chain length threshold is used as a preset threshold. The indexing rate of chain data in a certain indexing method is greater than that of another indexing method, and the indexing rate of inverted chain data that is not greater than the preset threshold in another indexing method is greater than that of a certain indexing method, so that the indexing rate can be passed through this preset threshold. Set thresholds to determine different index blocks and better indexing methods under different index blocks, so as to achieve fast indexing of the fragments to be indexed.

图17是根据一示例性实施例示出的一种数据索引装置的结构示意图。如图17所示，在一示例性实施例中，该数据索引装置包括：Figure 17 is a schematic structural diagram of a data indexing device according to an exemplary embodiment. As shown in Figure 17, in an exemplary embodiment, the data indexing device includes:

索引块获取模块1710，配置为根据各索引分片的倒排链长度，将倒排链长度不大于预设阈值的倒排链数据拼接为第一索引块，并将倒排链长度大于预设阈值的倒排链数据拼接为第二索引块；The index block acquisition module 1710 is configured to splice the inverted chain data whose length is not greater than the preset threshold into the first index block according to the inverted chain length of each index fragment, and the inverted chain length is greater than the preset threshold. The threshold inverted chain data is spliced into the second index block;

存储模块1730，配置为将第一索引块和第二索引块存储在存储块中；Storage module 1730, configured to store the first index block and the second index block in the storage block;

哈希表建立模块1750，配置为建立哈希表，哈希表包括多个关键值和各关键值对应的编码值，多个关键值分别对应不同的索引分片，编码值包括对应索引分片的索引方式，各索引分片的索引方式与对应索引分片的倒排链数据所在的索引块相关；The hash table creation module 1750 is configured to create a hash table. The hash table includes multiple key values and coding values corresponding to each key value. The multiple key values correspond to different index shards, and the coding values include corresponding index shards. The indexing method of each index fragment is related to the index block where the inverted chain data of the corresponding index fragment is located;

索引模块1770，配置为根据待索引分片的关键值和对应的索引方式在存储块中进行待索引分片的数据索引。The index module 1770 is configured to index the data of the fragment to be indexed in the storage block according to the key value of the fragment to be indexed and the corresponding indexing method.

本实施例中，通过上述的数据索引装置可高效进行索引分片的倒排索引，提高多term索引的速率。In this embodiment, the above-mentioned data indexing device can efficiently perform inverted indexing of index fragments and improve the speed of multi-term indexing.

第二索引块获取单元，配置为将进行对齐处理后的各索引分片的倒排链数据进行拼接，得到第二索引块。The second index block acquisition unit is configured to splice the aligned inverted chain data of each index fragment to obtain the second index block.

第一存储单元，配置为将第一索引块的数据存储在第一存储块中；A first storage unit configured to store data of the first index block in the first storage block;

第二存储单元，配置为按预设字节对齐的方式将第二索引块的数据存储在第二存储块中。The second storage unit is configured to store the data of the second index block in the second storage block in a preset byte alignment manner.

在一实施例中，哈希表的编码值还包括对应索引分片的倒排链数据在第一存储块或第二存储块中的存储位置；索引模块包括：In one embodiment, the coded value of the hash table also includes the storage location of the inverted chain data corresponding to the index fragment in the first storage block or the second storage block; the index module includes:

第一索引数据获取单元，配置为基于哈希表确认待索引分片对应的第一关键值以及第一关键值对应的索引方式和存储位置；The first index data acquisition unit is configured to confirm the first key value corresponding to the fragment to be indexed and the indexing method and storage location corresponding to the first key value based on the hash table;

第一索引单元，配置为若第一关键值对应第一索引方式，则根据第一索引方式和对应存储位置在第一存储块中索引待索引分片的倒排链数据；The first indexing unit is configured to, if the first key value corresponds to the first indexing method, index the inverted chain data of the fragments to be indexed in the first storage block according to the first indexing method and the corresponding storage location;

第二索引单元，配置为若第一关键值对应第二索引方式，则根据第二索引方式和对应存储位置在第二存储块中索引待索引分片的倒排链数据。The second indexing unit is configured to, if the first key value corresponds to the second indexing mode, index the inverted chain data of the fragments to be indexed in the second storage block according to the second indexing mode and the corresponding storage location.

第三存储单元，配置为按预设字节对齐的方式将第一索引块和第二索引块存储在存储块中，其中，存储块的长度等于或大于第一索引块与第二索引块之间的长度和，且存储块中第一索引块的数据存储在第二索引块的数据之后。The third storage unit is configured to store the first index block and the second index block in the storage block in a preset byte alignment manner, wherein the length of the storage block is equal to or greater than the length of the first index block and the second index block. and the data of the first index block in the storage block is stored after the data of the second index block.

在一实施例中，哈希表的编码值还包括对应索引分片的倒排链数据在存储块中的存储位置；索引模块包括：In one embodiment, the coded value of the hash table also includes the storage location of the inverted chain data corresponding to the index fragment in the storage block; the index module includes:

第二索引数据获取单元，配置为基于哈希表确认待索引分片对应的第二关键值以及第二关键值对应的索引方式和存储位置；The second index data acquisition unit is configured to confirm the second key value corresponding to the fragment to be indexed and the indexing method and storage location corresponding to the second key value based on the hash table;

第三索引单元，配置为若第二关键值对应第三索引方式，则根据第三索引方式和对应存储位置在存储块中索引待索引分片的倒排链数据，第三索引方式为针对第一索引块中的倒排链数据进行索引而设定；The third indexing unit is configured to index the inverted chain data of the fragments to be indexed in the storage block according to the third indexing method and the corresponding storage location if the second key value corresponds to the third indexing method. The third indexing method is for the third indexing method. The inverted chain data in an index block is indexed and set;

第四索引单元，配置为若第二关键值对应第四索引方式，则根据第四索引方式和对应存储位置在存储块中索引待索引分片的倒排链数据，第四索引方式为针对第二索引块中的倒排链数据进行索引而设定。The fourth indexing unit is configured to index the inverted chain data of the fragments to be indexed in the storage block according to the fourth indexing method and the corresponding storage location if the second key value corresponds to the fourth indexing method. The fourth indexing method is for the fourth indexing method. The inverted chain data in the second index block is indexed and set.

初始的哈希表建立单元，配置为预设初始的哈希表，初始的哈希表包括多个初始关键值和各初始关键值对应的初始编码值；The initial hash table creation unit is configured to preset an initial hash table. The initial hash table includes multiple initial key values and initial coding values corresponding to each initial key value;

关键值获取单元，配置为将各索引分片的哈希值分别覆盖初始关键值，得到多个关键值；The key value acquisition unit is configured to overwrite the initial key value with the hash value of each index fragment to obtain multiple key values;

编码值获取单元，配置为将各索引分片的倒排链数据在存储块中的位置和对应的索引方式覆盖于对应的初始编码值，得到哈希表。The coded value acquisition unit is configured to overwrite the position of the inverted chain data of each index fragment in the storage block and the corresponding index method with the corresponding initial coded value to obtain a hash table.

偏移区域覆盖板块，配置为将各索引分片的倒排链数据在存储块中的起始位置信息覆盖于对应的偏移区域，以得到哈希表。The offset area coverage block is configured to cover the starting position information of the inverted chain data of each index fragment in the storage block with the corresponding offset area to obtain a hash table.

在一实施例中，索引方式包括针对第一索引块中的倒排链数据进行索引而设定的二分索引方式，以及针对第二索引块中的倒排链数据进行索引而设定的指令集索引方式；该数据索引装置还包括：In one embodiment, the indexing method includes a binary indexing method set for indexing the inverted chain data in the first index block, and an instruction set set for indexing the inverted chain data in the second index block. Indexing method; the data indexing device also includes:

索引速率获取模块，配置为对各索引分片的倒排链分别基于二分索引方式和指令集索引方式进行索引，相应得到各索引分片的第一索引速率和第二索引速率；The index rate acquisition module is configured to index the inverted chain of each index fragment based on the binary index method and the instruction set index method respectively, and accordingly obtain the first index rate and the second index rate of each index fragment;

预设阈值获取模块，配置为根据各索引分片的倒排链长度，按照预设长度顺序比较各索引分片对应的第一索引速率和第二索引速率之间的数值大小，并根据得到的比较结果来确定预设阈值。The preset threshold acquisition module is configured to compare the numerical size between the first index rate and the second index rate corresponding to each index fragment in the preset length order according to the inverted chain length of each index fragment, and calculate the value according to the obtained Compare the results to determine the preset threshold.

需要说明的是，上述实施例所提供的数据索引装置与上述实施例所提供的数据索引方法属于同一构思，其中各个模块和单元执行操作的具体方式已经在方法实施例中进行了详细描述，此处不再赘述。It should be noted that the data indexing device provided by the above embodiments and the data indexing method provided by the above embodiments belong to the same concept, and the specific manner in which each module and unit performs operations has been described in detail in the method embodiments. No further details will be given.

本申请的实施例还提供了一种电子设备，包括：一个或多个处理器；存储装置，用于存储一个或多个程序，当一个或多个程序被一个或多个处理器执行时，使得电子设备实现上述各个实施例中提供的数据索引方法。Embodiments of the present application also provide an electronic device, including: one or more processors; a storage device configured to store one or more programs. When one or more programs are executed by one or more processors, The electronic device is caused to implement the data indexing method provided in the above embodiments.

需要说明的是，图18示出的电子设备的计算机系统1800仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。It should be noted that the computer system 1800 of the electronic device shown in FIG. 18 is only an example, and should not impose any restrictions on the functions and scope of use of the embodiments of the present application.

如图18所示，计算机系统1800包括中央处理单元(Central Processing Unit，CPU)1801，其可以根据存储在只读存储器(Read-Only Memory，ROM)1802中的程序或者从存储部分1808加载到随机访问存储器(Random Access Memory，RAM)1803中的程序而执行各种适当的动作和处理，例如执行上述实施例中的方法。在RAM 1803中，还存储有系统操作所需的各种程序和数据。CPU 1801、ROM 1802以及RAM 1803通过总线1804彼此相连。输入/输出(Input/Output，I/O)接口1805也连接至总线1804。As shown in Figure 18, the computer system 1800 includes a central processing unit (Central Processing Unit, CPU) 1801, which can be loaded into a random computer according to a program stored in a read-only memory (Read-Only Memory, ROM) 1802 or from a storage part 1808. Access the program in the memory (Random Access Memory, RAM) 1803 to perform various appropriate actions and processes, such as performing the methods in the above embodiments. In RAM 1803, various programs and data required for system operation are also stored. The CPU 1801, ROM 1802, and RAM 1803 are connected to each other through a bus 1804. An input/output (I/O) interface 1805 is also connected to bus 1804.

以下部件连接至I/O接口1805：包括键盘、鼠标等的输入部分1806；包括诸如阴极射线管(Cathode Ray Tube，CRT)、液晶显示器(Liquid Crystal Display，LCD)等以及扬声器等的输出部分1807；包括硬盘等的存储部分1808；以及包括诸如LAN(Local AreaNetwork，局域网)卡、调制解调器等的网络接口卡的通信部分1809。通信部分1809经由诸如因特网的网络执行通信处理。驱动器1810也根据需要连接至I/O接口1805。可拆卸介质1811，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器1810上，以便于从其上读出的计算机程序根据需要被安装入存储部分1808。The following components are connected to the I/O interface 1805: an input part 1806 including a keyboard, a mouse, etc.; an output part 1807 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc. ; a storage section 1808 including a hard disk, etc.; and a communication section 1809 including a network interface card such as a LAN (Local Area Network) card, a modem, etc. The communication section 1809 performs communication processing via a network such as the Internet. Driver 1810 is also connected to I/O interface 1805 as needed. Removable media 1811, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, etc., are installed on the drive 1810 as needed, so that a computer program read therefrom is installed into the storage portion 1808 as needed.

特别地，根据本申请的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本申请的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的计算机程序。在这样的实施例中，该计算机程序可以通过通信部分1809从网络上被下载和安装，和/或从可拆卸介质1811被安装。在该计算机程序被中央处理单元(CPU)1801执行时，执行本申请的系统中限定的各种功能。In particular, according to embodiments of the present application, the process described above with reference to the flowchart may be implemented as a computer software program. For example, embodiments of the present application include a computer program product including a computer program carried on a computer-readable medium, the computer program including a computer program for performing the method shown in the flowchart. In such embodiments, the computer program may be downloaded and installed from the network via communications portion 1809, and/or installed from removable media 1811. When the computer program is executed by the central processing unit (CPU) 1801, various functions defined in the system of the present application are executed.

需要说明的是，本申请实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory，EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory，CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的计算机程序。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的计算机程序可以用任何适当的介质传输，包括但不限于：无线、有线等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the embodiments of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Erasable Programmable Read Only Memory (EPROM), flash memory, optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any of the above suitable The combination. As used herein, a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, in which a computer-readable computer program is carried. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device . Computer programs embodied on computer-readable media may be transmitted using any suitable medium, including but not limited to: wireless, wired, etc., or any suitable combination of the above.

附图中的流程图和框图，图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。其中，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operations of possible implementations of systems, methods, and computer program products according to various embodiments of the present application. Each block in the flow chart or block diagram may represent a module, program segment, or part of the code. The above-mentioned module, program segment, or part of the code includes one or more executable components for implementing the specified logical function. instruction. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block in the block diagram or flowchart illustration, and combinations of blocks in the block diagram or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or operations, or may be implemented by special purpose hardware-based systems that perform the specified functions or operations. Achieved by a combination of specialized hardware and computer instructions.

描述于本申请实施例中所涉及到的单元可以通过软件的方式实现，也可以通过硬件的方式来实现，所描述的单元也可以设置在处理器中。其中，这些单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments of this application can be implemented in software or hardware, and the described units can also be provided in a processor. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.

本申请的另一方面还提供了一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如前的数据索引方法。该计算机可读存储介质可以是上述实施例中描述的电子设备中所包含的，也可以是单独存在，而未装配入该电子设备中。Another aspect of the present application also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the above data indexing method is implemented. The computer-readable storage medium may be included in the electronic device described in the above embodiments, or may exist separately without being assembled into the electronic device.

本申请的另一方面还提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述各个实施例中提供的数据索引方法。Another aspect of the present application also provides a computer program product or computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data indexing method provided in the above embodiments.

上述内容，仅为本申请的较佳示例性实施例，并非用于限制本申请的实施方案，本领域普通技术人员根据本申请的主要构思和精神，可以十分方便地进行相应的变通或修改，故本申请的保护范围应以权利要求书所要求的保护范围为准。The above content is only a preferred exemplary embodiment of the present application and is not intended to limit the implementation of the present application. Those of ordinary skill in the art can easily make corresponding modifications or modifications based on the main concept and spirit of the present application. Therefore, the protection scope of this application should be subject to the protection scope required by the claims.

Claims

1. A data indexing method, characterized by including:

According to the inverted chain length of each index fragment, the inverted chain data whose inverted chain length is not greater than the preset threshold is spliced into the first index block, and the inverted chain data whose inverted chain length is greater than the preset threshold is spliced into the first index block. Spliced into the second index block;

store the first index block and the second index block in a storage block;

Create a hash table. The hash table includes a plurality of key values and a coding value corresponding to each key value. The plurality of key values respectively correspond to different index fragments. The coding value includes an indexing method corresponding to the index fragment. , the indexing method of each index fragment is related to the index block where the inverted chain data of the corresponding index fragment is located;

Data indexing of the fragment to be indexed is performed in the storage block according to the key value of the fragment to be indexed and the corresponding indexing method.

2. The method according to claim 1, characterized in that, according to the inverted chain length of each index fragment, inverted chain data whose inverted chain length is not greater than a preset threshold is spliced into the first index block, And splicing the inverted chain data whose inverted chain length is greater than the preset threshold into a second index block, including:

Align the inverted chain data of each index fragment whose inverted chain length is greater than the preset threshold;

The aligned inverted chain data of each index fragment is spliced to obtain the second index block.

3. The method of claim 1, wherein the number of storage blocks is multiple, and storing the first index block and the second index block in the storage block includes:

Store the data of the first index block in the first storage block;

The data of the second index block is stored in the second storage block in a preset byte alignment manner.

4. The method according to claim 3, characterized in that the encoded value of the hash table further includes the inverted chain data corresponding to the index fragment in the first storage block or the second storage block. Storage location; the data indexing of the fragment to be indexed in the storage block based on the key value of the fragment to be indexed and the corresponding indexing method includes:

Confirm the first key value corresponding to the fragment to be indexed and the indexing method and storage location corresponding to the first key value based on the hash table;

If the first key value corresponds to the first indexing method, index the inverted chain data of the fragment to be indexed in the first storage block according to the first indexing method and the corresponding storage location;

If the first key value corresponds to the second indexing method, the inverted chain data of the fragment to be indexed is indexed in the second storage block according to the second indexing method and the corresponding storage location.

5. The method of claim 1, wherein storing the first index block and the second index block in a storage block includes:

The first index block and the second index block are stored in the storage block in a preset byte alignment manner, wherein the length of the storage block is equal to or greater than the length of the first index block and the The length sum between the second index blocks, and the data of the first index block in the storage block is stored after the data of the second index block.

6. The method according to claim 5, characterized in that the coded value of the hash table further includes the storage location of the inverted chain data corresponding to the index fragment in the storage block; The key value of the slice and the corresponding indexing method are used to index the data of the slice to be indexed in the storage block, including:

Confirm the second key value corresponding to the fragment to be indexed and the indexing method and storage location corresponding to the second key value based on the hash table;

If the second key value corresponds to the third indexing method, the inverted chain data of the fragment to be indexed is indexed in the storage block according to the third indexing method and the corresponding storage location, and the third indexing method Set for indexing the inverted chain data in the first index block;

If the second key value corresponds to the fourth indexing method, the inverted chain data of the fragment to be indexed is indexed in the storage block according to the fourth indexing method and the corresponding storage location, and the fourth indexing method Set for indexing the inverted chain data in the second index block.

7. The method according to claim 1, characterized in that said establishing a hash table includes:

Preset an initial hash table, which includes a plurality of initial key values and an initial code value corresponding to each initial key value;

Cover the initial key value with the hash value of each index shard respectively to obtain multiple key values;

Allocate an index method to the corresponding index shard according to the index block where each inverted chain data is located;

The position of the inverted chain data of each index fragment in the storage block and the corresponding index mode are overwritten with the corresponding initial encoding value to obtain the hash table.

8. The method according to claim 7, characterized in that the position of the inverted chain data of each index fragment in the storage block and the corresponding index mode are overwritten with the corresponding initial encoding value to obtain the The above hash table includes:

Divide each initial coding value into regions to obtain the index mode area, data length area and offset area corresponding to each initial coding value;

Cover the index mode information corresponding to each index fragment in the corresponding index mode area;

Cover the inverted chain length information of each index fragment in the corresponding data length area;

The starting position information of the inverted chain data of each index fragment in the storage block is overwritten in the corresponding offset area to obtain the hash table.

9. The method of claim 1, wherein the indexing method includes a binary indexing method set for indexing the inverted chain data in the first index block, and a binary indexing method for the second index. The instruction set indexing method set by indexing the inverted chain data in the block; according to the inverted chain length of each index fragment, the inverted chain data whose length is not greater than the preset threshold is spliced into the first An index block, and before splicing the inverted chain data whose inverted chain length is greater than the preset threshold into a second index block, the method further includes:

Index the inverted chain of each index fragment based on the binary indexing method and the instruction set indexing method respectively, and accordingly obtain the first index rate and the second index rate of each index fragment;

According to the inverted chain length of each index fragment, compare the numerical values between the first index rate and the second index rate corresponding to each index fragment in the preset length sequence, and determine the preset value based on the obtained comparison result. threshold.

10. A data indexing device, characterized by comprising:

The index block acquisition module is configured to splice the inverted chain data whose length is not greater than a preset threshold into the first index block according to the inverted chain length of each index fragment, and to splice the inverted chain data whose length is greater than the preset threshold into the first index block. Set the threshold inverted chain data to be spliced into the second index block;

A storage module configured to store the first index block and the second index block in a storage block;

A hash table creation module configured to create a hash table. The hash table includes a plurality of key values and a coding value corresponding to each key value. The plurality of key values respectively correspond to different index fragments. The coding value Including the indexing method of the corresponding index shard, the indexing method of each index shard is related to the index block where the inverted chain data of the corresponding index shard is located;

An index module is configured to perform data indexing on the fragment to be indexed in the storage block according to the key value of the fragment to be indexed and the corresponding indexing method.

11. An electronic device, characterized in that it includes:

one or more processors;

Storage device, used to store one or more computer programs, when the one or more computer programs are executed by the one or more processors, so that the electronic device implements any one of claims 1-9 method described in the item.

12. A computer-readable storage medium, characterized in that computer-readable instructions are stored thereon. When the computer-readable instructions are executed by a processor of a computer, the computer is caused to execute any one of claims 1-9. method described in the item.

13. A computer program product, comprising a computer program, characterized in that, when executed by a processor, the computer program implements the method according to any one of claims 1 to 9.