WO2008009995A2

WO2008009995A2 - System and method for indexing stored electronic data using a b-tree

Info

Publication number: WO2008009995A2
Application number: PCT/GB2007/050424
Authority: WO
Inventors: Stephen Robinson
Original assignee: Chronicle Solutions (Uk) Limited
Priority date: 2006-07-19
Filing date: 2007-07-19
Publication date: 2008-01-24
Also published as: GB2440175A; WO2008009995A3; GB0614336D0

Abstract

A computer system for indexing stored electronic data comprising: a lexicon database for storing an entry corresponding to each unique term to be indexed, wherein the lexicon database is implemented as a B-tree and each entry comprises a pointer to at least one postings file; wherein the postings file comprises a plurality of postings nodes which are doubly linked to allow queries to return results in both ascending and descending order.

Description

SYSTEM

The present invention relates to a system for storing electronic data and also to a system for creating an index for electronic data.

Known systems for storing electronic data, for example the systems used to store data on a hard disc of a computer, create files on the storage medium where there is a run of contiguous free space. They do this to minimise fragmentation of the data to avoid excessive movements of the read head back and forth across the disc so that read performance is maximised. However writing speed can be compromised by this system, particularly for writing to lots of small files at once because the write head must move backwards and forwards between the files.

US 2003/0070056 describes a storage structure for a random access medium having equal sized sequentially numbered logical sectors with separate levels for data, format and location and pointers. This is particularly advantageous for mapping data of different formats and the resulting structure is optimised for high performance random access to the data but is not optimised for speed of writing.

According to a first aspect of the present invention there is provided a computer system for indexing stored electronic data comprising: a lexicon database for storing an entry corresponding to each unique term to be indexed, wherein the lexicon database is implemented as a B-tree and each entry comprises a pointer to at least one postings file; wherein the postings file comprises a plurality of postings nodes which are doubly linked to allow queries to return results in both ascending and descending order.

A corresponding method is also provided.

The lexicon contains an entry for each unique term or word and the lexicon is implemented as a B-tree, written to disc on shutdown of a computer system and read back into memory on startup of the computer. Each entry in the lexicon ponts to a chain of document numbers that contain that term. The document numbers are known as postings and are stored in a postings file. The postings file is divided into 512 byte blocks linked together to form the individual chains for each term in the lexicon. Each node in the postings file stores the difference between successive document numbers for a term and this is stored in code, for example using Elias delta coding. On average such encoding results in an index consuming an average of 6 bits per document term.

The indexing system of the invention is also able to cope with indexing data other than text, such as file name, URL, comment field, and also IP addresses and file types without requiring large amounts of memory even though some such terms are often unique. The invention can also handle phrase queries and breaks a URL into its constituent parts. In addition it will allow random access searches, particularly for multi-term query processing.

According to a second aspect of the present invention there is provided a computer system for storing electronic data representing a file, the system comprising: a storage medium divided into equal sized memory blocks; means for writing the data into at least one data block sequentially such that the first free data block is filled with first received data and subsequently received data is written sequentially to subsequent free data blocks; means for storing a block list pointer to identify the or each data block in which the data for the file is stored, as respective offsets from the beginning of the storage medium; means for storing index pointers to identify the or each block list in at least one index block; and means for tracking the address of each free data block.

A corresponding method is also provided.

This system of writing the data sequentially substantially speeds up the process of writing data and allows extremely large volumes of data to be written quickly. This is particularly useful for archiving purposes or for saving network data traffic in real time. The reading speed of data stored in this sequential manner is not as high as it would be with normal systems because the sequential storage tends to lead to very fragmented files. However, for applications in which it is most important to save a large amount of data at a fast rate, particularly where the data may not need to be accessed frequently, the system of the invention is advantageous. When text is stored in electronic form it is desirable to provide an index so that word searches of text documents can be performed by a user. Known such indexing systems tend to be slow, require a large amount of memory and be relatively inflexible. They usually require all the data to be indexed to be available at once and they create an index in a batch operation which can take a long time. Hence they are usually incapable of indexing the data in real time as it becomes available.

It is desirable to provide a fast, flexible indexing system which can also handle indexation of data other than text without overburdening the memory requirements.

The invention may be applied to a computer system for indexing a fragment of text from a document, the system comprising: means for dividing the text into individual terms; means for stemming the individual terms using a predetermined algorithm; means for comparing each stemmed term with the terms already stored in the lexicon database; and if the term is already stored in the lexicon database then storing a pointer to the document in the postings file; if the term is not already stored in the lexicon database then storing it and also storing a pointer to the document in the postings file.

A corresponding method is also envisaged and computer programs and computer apparatus corresponding to all aspects of the invention are also provided.

A further aspect of the invention provides a method and a computer system combining the first and second aspects of the invention.

For a better understanding of the present invention, and to show how the same may be carried into effect, reference will now be made to the accompanying drawings, in which:

Figure 1 is a schematic illustration of the format of block types which may be used in one embodiment of the first aspect of the invention;

Figures 2 to 4 are flow diagrams illustrating operations that will be performed by the text index when it receives three different kinds of input: document body text fragment (figure 2), textual document property (figure 3), and other document property (figure 4) according to the second aspect of the present invention;

Figure 5 illustrates a B-tree node in the lexicon for an embodiment of the second aspect of the invention.

Figure 6 illustrates a postings node for an embodiment of the second aspect of the invention.

In the object file store system according to the first aspect of the invention, the memory 1 is divided up into equal sized blocks 2 and in this example the fields within the blocks are in little endian format and all addresses are 64 bit file offsets.

A single header block 3 is provided comprising an 8 byte sequence identifying the file type, a pointer to the first block list that identifies the free list bitmap block, a pointer to the root node of the index, a dirty flag D and a node size.

A plurality of index blocks 4 is provided as a standard B-tree with all files identified by a single 64 bit integer. The number of blocks grows as more entries are added to the index. An index node comprises a field specifying the number of index entries in the b-tree node, an array of index entries (first entry index zero), an array of 64 bit pointers to other B-tree index nodes, and padding to round the block size up to the block size defined in the file header.

An index entry comprises an 8 byte identifier for the file, a block list address, i.e. a pointer to the first block list for the file, and a field specifying the length of the file.

A plurality of block list nodes are provided, and they are filled in backwards to allow each to share a disk block with a data block when space allows. Each block list node 5 comprises a length field defined by the number of block pointers stored in the block list, a pointer to a data block and a pointer to the next block list. The block lists can contain many pointers to data blocks - as many as will fit before the block is full. There are a plurality of data blocks or block nodes 6 comprising simply data and padding.

A block list 5 and data block 6 can also be combined as a hybrid block. For example when a file requires only one block list it can be placed in the last data block if space allows so that small files need only one block for all their storage. If there is not enough room however then the block list will be stored in its own block.

There are also free list blocks 7. Free list bitmap data is only written to disk when the file store is closed. If the file store is not cleanly closed then the free list bitmap will be recreated by scanning the file.

The text index according to the second aspect of the invention uses 40 bit block numbers for the file offsets. Using 512 byte blocks this increases the maximum file size by 512 times but does not increase the overhead in the offset calculations too much. The lexicon may be compressed using known techniques.

An internal index for each node is provided allowing a binary search within the node.

The lexicon structure is implemented as a b-tree structure to allow it to overflow to disc if it grows too large. Preferably the caching strategy is arranged to ensure that junk entries fall out to disk to prevent excessive slow down.

Each lexicon entry points to two different postings chains, one at a document level for the main text and one at word level for the properties. This allows the size of the index to be kept manageable.

The postings nodes are doubly linked to allow queries to return results in both ascending and descending order, and contain embedded skip-lists to allow for relatively quick random access. Although not true random access as for example using a full B-tree, it is more efficient from a processing point of view. Each postings chain may have additional markers showing where each day begins allowing for easy rejection of documents outside of a date range being queried. The text index can use plugins to support different stemming algorithms, including an option to disable stemming or to support multiple stemming algorithms in the same index.

Figures 2 to 4 illustrate the operations that will be performed by the text index when it receives three different kinds of input; document body text fragment (figure 2), textual document property (figure 3), and other document property (figure 4).

Figure 2 illustrates the process for document body text processing. When the text index receives a fragment of a document body text, it will split the text fragment into individual terms, including reassembling terms that are spread across fragment boundaries and those that are spread across multiple fragments. The individual terms will then be stemmed using the currently active stemming algorithm and looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID will be stored in the postings chain for the term.

Figure 3 illustrates the process for textual document property processing. When the text index receives a textual document property, it will split the text into individual terms. The individual terms will then be looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID, property ID, and the position of the term in the property will be stored in the postings chain for the term.

The textual document properties store position information to allow queries to match phrases, so if we index 'user@address.com' as {'user, 1}, {'address', 2}, {'com', 3} then a query for 'user@address.com' will match, but a query for 'address@user.com' will not.

The property ID is also stored to differentiate between the different types of properties, this allows the user to restrict their search to a particular property type such as the subject of an e-mail or a URL. Unlike the processing for non-textual properties, the property type is not appended to the terms as this would dramatically increase the size of the lexicon, roughly multiplying it by the number of property types. Instead when querying, terms from other property types will be skipped over, the extra processing involved in ignoring hits should be negligible compared with the extra processing that would be required if the lexicon grew to accommodate unique terms per property type.

Figure 4 illustrates the process for other document property processing. When the text index receives a non-textual property, it will append the property ID to the term, making the term unique for the particular property type. The term will then be looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID will be added to the postings chain for the term.

The non-textual document properties have the property type appended to the term

(property value) to allow the user to restrict a query to a particular property type such as an IP address or file type. Unlike the text properties this should not dramatically increase the size of the lexicon and will allow for very efficient searching.

The lexicon and the postings are stored as two separate files to simplify management of space within the files and because the postings require much smaller blocks than the lexicon to avoid consuming too much space for terms that have only a few entries. If the index was implemented as one file we would have problems with fragmentation as the larger blocks get reused as postings blocks and then get freed again but not in large enough chunks to be used for the lexicon.

The structure of each node in the lexicon file is illustrated in figure 5. This is preceded by a header block.

There are some terms which will not be added to the lexicon as they have the potential to dramatically increase it's size while providing little to no benefit to the end user.

The text index will not index any number greater than 9,999 that is found in a documents main text or textual properties. If other numbers were permitted then it is possible to send a stream of ever increasing numbers in a document and forcing the text index lexicon to grow. By restricting the numbers to a maximum of four digits the user is still able to search on smaller numbers such as a year. The text index will not index any term from a document main text or textual properties that consists of both numbers and letters. Instead the term will be divided into component parts which consist only of letters or only numbers. For example the term 'userl234' would be indexed as the two terms 'user' and '1234'.

When indexing a fragment of text from a document body, if the text index finds binary data (byte values over 127 or below 32) then the text index will stop indexing the documents body text, ignoring any further fragments. This is done to prevent any binary data that is accidentally passed to the index from polluting the lexicon with junk entries. This should only occur if the monitoring system has mis-identified or failed to properly decode a document and therefore should not impact the accuracy of the index.

Each entry in the lexicon will contain four pointers referencing two different chains in the postings file. The first two point to the head and tail of the postings chain for document body text and the second two point to the head and tail of the postings chain for document properties. This is done to allow the format of postings for the main text and the properties to be different, allowing the properties to have a word level index and the main text to have a document level index.

If the pointers are set to -1 (NILL) they are invalid. If they are set to -2 then there is no postings chain, but there is one document that contains the term and that documents ID is stored in LastDocID. This optimisation is used to save disk space by not creating a postings block for the term until the term has occurred in at least two documents. This way if the term is random junk that will never occur again we haven't used up a whole 512 byte postings block to record it. Random junk can account for up to 50% of the entire lexicon. Unfortunately we can't just discard these terms as the rarer a term is, the more important it can be in a query.

A two level cache of recently used B-tree blocks will be maintained to prevent a sudden run of new blocks from forcing the really commonly used blocks out of the cache as the new blocks will only be able to evict blocks from the warm cache. The lexicon will use a bitmap free list to track free blocks. This will require 32MB of space for every ITB of file size (assuming 4KB blocks), which means the free list will easily fit in memory as the lexicon is only expected to grow to IGB at most. To improve the efficiency of searching for a free block the text index will keep track of the first free block so that the search can start from the first known free block rather than the beginning. This scheme is also used in the object file store, where the average number of comparisons when searching for a free block was 18, discounting situations where there were no free blocks where only one comparison is required.

While running the free list bitmap will be held entirely in memory. When shutting down the text index will store the free list bitmap in some of the free blocks and will update the FreeListHead field in the header to point to the first of these. Each of these blocks will start with a 40 bit pointer to the next block followed by the bitmap data. The text index will be able to calculate the length of the valid bitmap data by checking the file size of the lexicon.

Figure 6 shows the format of each of the 512 byte blocks in the postings file.

All of the integers in the data area of the postings node will be encoded using Elias delta coding. This is a variable length integer coding scheme that allows integers with lower values to consume fewer bits (the integer 1, requires only 1 bit of storage). The encoding consists of three parts; a unary length, a binary length, and a binary value. The unary length specifies the length of the binary length which in turn specifies the length of the binary value. A full description of the encoding scheme is outside the scope of this document and can be easily found on the internet or in reference books.

The Document ID's are converted to differences before encoding to reduce the value of the integer being stored and thus increasing the effectiveness of the Elias coding. This scheme means that if a term appears in every document it only takes 1 bit of storage per document.

For the next property postings which also store a property ID and position, these integers will also use the Elias delta encoding. The property ID's will be rearranged so that the most common properties such as filename and URL will have the lowest values. The positions will already be small values.

The non-textual property postings which do not need the property ID or position information will be stored in the main text postings format.

The document ID differences will be restarted in each block so that it is not necessary to decode the whole chain to read a value, allowing us to achieve random access using skip lists.

To allow documents to be indexed slightly out of order the API has been modified so that the text index must be informed when each document ID is complete, even if no data has been given to the text index for that document. This allows the text index to know exactly which document ID's are outstanding.

If the text index receives data for a document early, for example if it has completed up to document 10 and receives data for document 12 while document 11 is not yet complete, it will try to cache the terms for document 12 in memory until document 11 is complete. If it runs out of memory it will add a reference to document 11 to any term occurring in document 12, but it will mark the reference as invalid in the 'valid bitmap' in the block. If document 11 is later found to really contain that term, the reference will be marked as valid.

As most of the re-ordering will be completed in memory there should only be a few invalid entries wasting space in the postings file.

When a document ID needs to be removed from the text index during monitoring the text index will add the document ID to a special postings chain that lists deleted documents. This is done because it is not possible to know which terms occurred in the document and therefore real deletion would require reading the postings chains for all terms which would be extremely time consuming.

Claims

1. A computer system for indexing stored electronic data comprising: a lexicon database for storing an entry corresponding to each unique term to be indexed, wherein the lexicon database is implemented as a B-tree and each entry comprises a pointer to at least one postings file; wherein the postings file comprises a plurality of postings nodes which are doubly linked to allow queries to return results in both ascending and descending order.

2. A computer system according to claim 1 wherein each lexicon entry comprises a first pointer to a first, document level, postings file, and a second pointer to a second, word level, postings file.

3 A computer system according to claim 1 or 2 comprising means for embedding skip-lists in said postings nodes.

4. A computer system according to claim 1, 2 or 3 wherein each postings file comprises a marker to indicate where each day begins.

5. A computer system for indexing a fragment of text from a document, the system comprising: means for dividing the text into individual terms; means for stemming the individual terms using a predetermined algorithm; means for comparing each stemmed term with the terms already stored in the lexicon database; and if the term is already stored in the lexicon database then storing a pointer to the document in the postings file; if the term is not already stored in the lexicon database then storing it and also storing a pointer to the document in the postings file.

6. A computer system according to claim 5 adapted for use in indexing textual document properties, additionally comprising: means for additionally storing a property identification and positional information for the term, associated with the document pointer.

7. A computer system according to claim 5 or 6 adapted for use in indexing non- textual document properties further comprising: means for storing a property identification appended to the term.

8. A computer system according to any one of claims 1 to 7 comprising means for identifying predetermined types of term and excluding them from being stored in the lexicon database.

9. A computer system according to claim 8 wherein said predetermined types of term comprise numbers greater than four digits long.

10. A computer system according to claim 8 or 9 wherein said predetermined types of term comprise any continuous combination of letters or numbers.

11. A computer system according to claim 10 comprising means for dividing a combination of letters and numbers in such a way that each of the resultant component parts comprise only letters or only numbers.

12. A computer system according to claim 8 wherein said predetermined types of term comprise binary data for which byte values are over 127 or below 32.

13. A method for indexing stored electronic data comprising: storing an entry corresponding to each unique term to be indexed, as an entry in a lexicon implemented as a B-tree wherein each entry comprises a pointer to at least one postings file; wherein the postings file comprises a plurality of postings nodes which are doubly linked to allow queries to return results in both ascending and descending order.

14. A method according to claim 13 wherein each lexicon entry comprises a first pointer to a first, document level, postings file, and a second pointer to a second, word level, postings file.

15. A method according to claim 13 or 14 wherein each lexicon entry comprises skip-lists.

16. A method according to claim 13, 14 or 15 wherein each postings file comprises a marker to indicate where each day begins.

17. A method according to any one of claims 13 to 16 for indexing a fragment of text from a document, the method comprising: dividing the text into individual terms; stemming the individual terms using a predetermined algorithm; comparing each stemmed term with the terms already stored in a lexicon database; and if the term is already stored in the lexicon database then storing a pointer to the document in the postings file; if the term is not already stored in the lexicon database then storing it and also storing a pointer to the document in the postings file.

18. A method according to claim 17 adapted for use in indexing textual document properties, additionally comprising: means for additionally storing a property identification and positional information for the term, associated with the document pointer.

19. A method according to claim 13 or 14 adapted for use in indexing non- textual document properties further comprising: means for storing a property identification appended to the term.

20. A method according to any one of claims 13 to 19 comprising means for identifying predetermined types of term and excluding them from being stored in the lexicon database.

21. A method according to claim 20 wherein said predetermined types of term comprise numbers greater than four digits long.

22. A method according to claim 20 or 21 wherein said predetermined types of term comprise any continuous combination of letters or numbers.

23. A method according to claim 22 comprising means for dividing a combination of letters and numbers in such a way that each of the resultant component parts comprise only letters or only numbers.

24. A computer system for storing electronic data representing a file, the system comprising: a storage medium divided into equal sized memory blocks; means for writing the data into at least one data block sequentially such that the first free data block is filled with first received data and subsequently received data is written sequentially to subsequent free data blocks; means for storing a block list pointer to identify the or each data block in which the data for the file is stored; means for storing index pointers to identify the or each block list in at least one index block; and means for tracking the address of each free data block.

25. A computer system according to claim 24 wherein each block list pointer is stored as an offset from the beginning of the storage medium.

26. A computer system according to claim 24 or 25 wherein the block list pointers are stored in at least one block list block.

27. A computer system according to claim 24 or 25 wherein the block list pointers are stored in the last data block into which corresponding data is stored, if space allows.

28. A computer system according to any one of claims 24 to 27 comprising means for storing data representing the address of the or each index block in a header block.

29. A computer system according to any one of claims 24 to 28 comprising means for storing free list bitmap data in a free list bitmap block.

30. A computer system according to any one of claims 24 to 29 wherein fields within the blocks are stored in little endian format.

31. A computer system according to any one of claims 24 to 30 wherein the header block comprises data identifying the file type, a pointer to the first block list that identifies the free list bitmap block and a pointer to the root node of the index.

32. A computer system according to any one of claims 24 to 31 comprising a plurality of index blocks implemented as a B-tree.

33. A computer system according to any one of claims 24 to 32 wherein an index entry for a file, as stored in the index block, comprises data identifying the file, a pointer to the first block list for the file and a field specifying the length of the file.

34. A computer system according any one of claims 24 to 33 wherein the means for storing is adapted to enter the block lists backwards into one or more memory blocks.

35. A method for storing electronic data comprising: dividing a storage medium into equal sized blocks; writing the data to be stored to the storage medium into at least one data block sequentially such that the first free data block is filled first and subsequent free data blocks are filled subsequently; storing a pointer to a block list to identify the or each data block in which the data for the file is stored; storing index pointers to identify the or each block list in at least one index block; tracking the address of each free data block.

36. A method according to claim 35 comprising storing each block list pointer as an offset from the beginning of the storage medium.

37. A method according to claim 35 or 36 wherein the fields within the blocks are stored in little endian format.

38. A method according to claim 35 or claim 37 comprising storing in a header block data identifying the file type, a pointer to the first block list that identifies the free list bitmap block and a pointer to the root node of the index.

39. A method according to any one of claims 35 to 38wherein a plurality of index blocks are implemented as a B-tree.

40. A method according to any one of claims 35 to 39 wherein an index entry for a file, as stored in the index block, comprises data identifying the file, a pointer to the first block list for the file and a field specifying the length of the file.

41. A method according any one of claims 35 to 40 wherein the means for storing block list pointers is adapted to enter them backwards.

42. A method for storing electronic data comprising storing electronic data on a storage medium in accordance with the method steps of any one of claims 35 to 41 and indexing the stored electronic data in accordance with the method steps of any one of claims 13 to 23.

43. A computer system adapted for performing the method of claim 42.