WO2008009995A2 - System and method for indexing stored electronic data using a b-tree - Google Patents
System and method for indexing stored electronic data using a b-tree Download PDFInfo
- Publication number
- WO2008009995A2 WO2008009995A2 PCT/GB2007/050424 GB2007050424W WO2008009995A2 WO 2008009995 A2 WO2008009995 A2 WO 2008009995A2 GB 2007050424 W GB2007050424 W GB 2007050424W WO 2008009995 A2 WO2008009995 A2 WO 2008009995A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- block
- file
- data
- storing
- computer system
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
Definitions
- the present invention relates to a system for storing electronic data and also to a system for creating an index for electronic data.
- US 2003/0070056 describes a storage structure for a random access medium having equal sized sequentially numbered logical sectors with separate levels for data, format and location and pointers. This is particularly advantageous for mapping data of different formats and the resulting structure is optimised for high performance random access to the data but is not optimised for speed of writing.
- a computer system for indexing stored electronic data comprising: a lexicon database for storing an entry corresponding to each unique term to be indexed, wherein the lexicon database is implemented as a B-tree and each entry comprises a pointer to at least one postings file; wherein the postings file comprises a plurality of postings nodes which are doubly linked to allow queries to return results in both ascending and descending order.
- a corresponding method is also provided.
- the lexicon contains an entry for each unique term or word and the lexicon is implemented as a B-tree, written to disc on shutdown of a computer system and read back into memory on startup of the computer.
- Each entry in the lexicon ponts to a chain of document numbers that contain that term.
- the document numbers are known as postings and are stored in a postings file.
- the postings file is divided into 512 byte blocks linked together to form the individual chains for each term in the lexicon.
- Each node in the postings file stores the difference between successive document numbers for a term and this is stored in code, for example using Elias delta coding. On average such encoding results in an index consuming an average of 6 bits per document term.
- the indexing system of the invention is also able to cope with indexing data other than text, such as file name, URL, comment field, and also IP addresses and file types without requiring large amounts of memory even though some such terms are often unique.
- the invention can also handle phrase queries and breaks a URL into its constituent parts. In addition it will allow random access searches, particularly for multi-term query processing.
- a computer system for storing electronic data representing a file, the system comprising: a storage medium divided into equal sized memory blocks; means for writing the data into at least one data block sequentially such that the first free data block is filled with first received data and subsequently received data is written sequentially to subsequent free data blocks; means for storing a block list pointer to identify the or each data block in which the data for the file is stored, as respective offsets from the beginning of the storage medium; means for storing index pointers to identify the or each block list in at least one index block; and means for tracking the address of each free data block.
- a corresponding method is also provided.
- This system of writing the data sequentially substantially speeds up the process of writing data and allows extremely large volumes of data to be written quickly. This is particularly useful for archiving purposes or for saving network data traffic in real time.
- the reading speed of data stored in this sequential manner is not as high as it would be with normal systems because the sequential storage tends to lead to very fragmented files.
- the system of the invention is advantageous.
- When text is stored in electronic form it is desirable to provide an index so that word searches of text documents can be performed by a user.
- Known such indexing systems tend to be slow, require a large amount of memory and be relatively inflexible. They usually require all the data to be indexed to be available at once and they create an index in a batch operation which can take a long time. Hence they are usually incapable of indexing the data in real time as it becomes available.
- the invention may be applied to a computer system for indexing a fragment of text from a document, the system comprising: means for dividing the text into individual terms; means for stemming the individual terms using a predetermined algorithm; means for comparing each stemmed term with the terms already stored in the lexicon database; and if the term is already stored in the lexicon database then storing a pointer to the document in the postings file; if the term is not already stored in the lexicon database then storing it and also storing a pointer to the document in the postings file.
- a further aspect of the invention provides a method and a computer system combining the first and second aspects of the invention.
- Figure 1 is a schematic illustration of the format of block types which may be used in one embodiment of the first aspect of the invention
- Figures 2 to 4 are flow diagrams illustrating operations that will be performed by the text index when it receives three different kinds of input: document body text fragment (figure 2), textual document property (figure 3), and other document property (figure 4) according to the second aspect of the present invention;
- Figure 5 illustrates a B-tree node in the lexicon for an embodiment of the second aspect of the invention.
- Figure 6 illustrates a postings node for an embodiment of the second aspect of the invention.
- the memory 1 is divided up into equal sized blocks 2 and in this example the fields within the blocks are in little endian format and all addresses are 64 bit file offsets.
- a single header block 3 comprising an 8 byte sequence identifying the file type, a pointer to the first block list that identifies the free list bitmap block, a pointer to the root node of the index, a dirty flag D and a node size.
- a plurality of index blocks 4 is provided as a standard B-tree with all files identified by a single 64 bit integer. The number of blocks grows as more entries are added to the index.
- An index node comprises a field specifying the number of index entries in the b-tree node, an array of index entries (first entry index zero), an array of 64 bit pointers to other B-tree index nodes, and padding to round the block size up to the block size defined in the file header.
- An index entry comprises an 8 byte identifier for the file, a block list address, i.e. a pointer to the first block list for the file, and a field specifying the length of the file.
- a plurality of block list nodes are provided, and they are filled in backwards to allow each to share a disk block with a data block when space allows.
- Each block list node 5 comprises a length field defined by the number of block pointers stored in the block list, a pointer to a data block and a pointer to the next block list.
- the block lists can contain many pointers to data blocks - as many as will fit before the block is full.
- a block list 5 and data block 6 can also be combined as a hybrid block. For example when a file requires only one block list it can be placed in the last data block if space allows so that small files need only one block for all their storage. If there is not enough room however then the block list will be stored in its own block.
- Free list bitmap data is only written to disk when the file store is closed. If the file store is not cleanly closed then the free list bitmap will be recreated by scanning the file.
- the text index according to the second aspect of the invention uses 40 bit block numbers for the file offsets. Using 512 byte blocks this increases the maximum file size by 512 times but does not increase the overhead in the offset calculations too much.
- the lexicon may be compressed using known techniques.
- An internal index for each node is provided allowing a binary search within the node.
- the lexicon structure is implemented as a b-tree structure to allow it to overflow to disc if it grows too large.
- the caching strategy is arranged to ensure that junk entries fall out to disk to prevent excessive slow down.
- Each lexicon entry points to two different postings chains, one at a document level for the main text and one at word level for the properties. This allows the size of the index to be kept manageable.
- the postings nodes are doubly linked to allow queries to return results in both ascending and descending order, and contain embedded skip-lists to allow for relatively quick random access. Although not true random access as for example using a full B-tree, it is more efficient from a processing point of view.
- Each postings chain may have additional markers showing where each day begins allowing for easy rejection of documents outside of a date range being queried.
- the text index can use plugins to support different stemming algorithms, including an option to disable stemming or to support multiple stemming algorithms in the same index.
- Figures 2 to 4 illustrate the operations that will be performed by the text index when it receives three different kinds of input; document body text fragment (figure 2), textual document property (figure 3), and other document property (figure 4).
- Figure 2 illustrates the process for document body text processing.
- the text index When the text index receives a fragment of a document body text, it will split the text fragment into individual terms, including reassembling terms that are spread across fragment boundaries and those that are spread across multiple fragments. The individual terms will then be stemmed using the currently active stemming algorithm and looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID will be stored in the postings chain for the term.
- Figure 3 illustrates the process for textual document property processing.
- the text index receives a textual document property, it will split the text into individual terms. The individual terms will then be looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID, property ID, and the position of the term in the property will be stored in the postings chain for the term.
- the textual document properties store position information to allow queries to match phrases, so if we index 'user@address.com' as ⁇ 'user, 1 ⁇ , ⁇ 'address', 2 ⁇ , ⁇ 'com', 3 ⁇ then a query for 'user@address.com' will match, but a query for 'address@user.com' will not.
- the property ID is also stored to differentiate between the different types of properties, this allows the user to restrict their search to a particular property type such as the subject of an e-mail or a URL.
- the property type is not appended to the terms as this would dramatically increase the size of the lexicon, roughly multiplying it by the number of property types. Instead when querying, terms from other property types will be skipped over, the extra processing involved in ignoring hits should be negligible compared with the extra processing that would be required if the lexicon grew to accommodate unique terms per property type.
- Figure 4 illustrates the process for other document property processing.
- the text index When the text index receives a non-textual property, it will append the property ID to the term, making the term unique for the particular property type. The term will then be looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID will be added to the postings chain for the term.
- the lexicon and the postings are stored as two separate files to simplify management of space within the files and because the postings require much smaller blocks than the lexicon to avoid consuming too much space for terms that have only a few entries. If the index was implemented as one file we would have problems with fragmentation as the larger blocks get reused as postings blocks and then get freed again but not in large enough chunks to be used for the lexicon.
- each node in the lexicon file is illustrated in figure 5. This is preceded by a header block.
- the text index will not index any number greater than 9,999 that is found in a documents main text or textual properties. If other numbers were permitted then it is possible to send a stream of ever increasing numbers in a document and forcing the text index lexicon to grow. By restricting the numbers to a maximum of four digits the user is still able to search on smaller numbers such as a year.
- the text index will not index any term from a document main text or textual properties that consists of both numbers and letters. Instead the term will be divided into component parts which consist only of letters or only numbers. For example the term 'userl234' would be indexed as the two terms 'user' and '1234'.
- the text index When indexing a fragment of text from a document body, if the text index finds binary data (byte values over 127 or below 32) then the text index will stop indexing the documents body text, ignoring any further fragments. This is done to prevent any binary data that is accidentally passed to the index from polluting the lexicon with junk entries. This should only occur if the monitoring system has mis-identified or failed to properly decode a document and therefore should not impact the accuracy of the index.
- Each entry in the lexicon will contain four pointers referencing two different chains in the postings file.
- pointers are set to -1 (NILL) they are invalid. If they are set to -2 then there is no postings chain, but there is one document that contains the term and that documents ID is stored in LastDocID.
- This optimisation is used to save disk space by not creating a postings block for the term until the term has occurred in at least two documents. This way if the term is random junk that will never occur again we haven't used up a whole 512 byte postings block to record it. Random junk can account for up to 50% of the entire lexicon. Unfortunately we can't just discard these terms as the rarer a term is, the more important it can be in a query.
- a two level cache of recently used B-tree blocks will be maintained to prevent a sudden run of new blocks from forcing the really commonly used blocks out of the cache as the new blocks will only be able to evict blocks from the warm cache.
- the lexicon will use a bitmap free list to track free blocks. This will require 32MB of space for every ITB of file size (assuming 4KB blocks), which means the free list will easily fit in memory as the lexicon is only expected to grow to IGB at most. To improve the efficiency of searching for a free block the text index will keep track of the first free block so that the search can start from the first known free block rather than the beginning. This scheme is also used in the object file store, where the average number of comparisons when searching for a free block was 18, discounting situations where there were no free blocks where only one comparison is required.
- While running the free list bitmap will be held entirely in memory.
- When shutting down the text index will store the free list bitmap in some of the free blocks and will update the FreeListHead field in the header to point to the first of these.
- Each of these blocks will start with a 40 bit pointer to the next block followed by the bitmap data.
- the text index will be able to calculate the length of the valid bitmap data by checking the file size of the lexicon.
- Figure 6 shows the format of each of the 512 byte blocks in the postings file.
- Elias delta coding This is a variable length integer coding scheme that allows integers with lower values to consume fewer bits (the integer 1, requires only 1 bit of storage).
- the encoding consists of three parts; a unary length, a binary length, and a binary value.
- the unary length specifies the length of the binary length which in turn specifies the length of the binary value.
- a full description of the encoding scheme is outside the scope of this document and can be easily found on the internet or in reference books.
- the Document ID's are converted to differences before encoding to reduce the value of the integer being stored and thus increasing the effectiveness of the Elias coding. This scheme means that if a term appears in every document it only takes 1 bit of storage per document.
- the non-textual property postings which do not need the property ID or position information will be stored in the main text postings format.
- the API has been modified so that the text index must be informed when each document ID is complete, even if no data has been given to the text index for that document. This allows the text index to know exactly which document ID's are outstanding.
- the text index receives data for a document early, for example if it has completed up to document 10 and receives data for document 12 while document 11 is not yet complete, it will try to cache the terms for document 12 in memory until document 11 is complete. If it runs out of memory it will add a reference to document 11 to any term occurring in document 12, but it will mark the reference as invalid in the 'valid bitmap' in the block. If document 11 is later found to really contain that term, the reference will be marked as valid.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
A computer system for indexing stored electronic data comprising: a lexicon database for storing an entry corresponding to each unique term to be indexed, wherein the lexicon database is implemented as a B-tree and each entry comprises a pointer to at least one postings file; wherein the postings file comprises a plurality of postings nodes which are doubly linked to allow queries to return results in both ascending and descending order.
Description
SYSTEM
The present invention relates to a system for storing electronic data and also to a system for creating an index for electronic data.
Known systems for storing electronic data, for example the systems used to store data on a hard disc of a computer, create files on the storage medium where there is a run of contiguous free space. They do this to minimise fragmentation of the data to avoid excessive movements of the read head back and forth across the disc so that read performance is maximised. However writing speed can be compromised by this system, particularly for writing to lots of small files at once because the write head must move backwards and forwards between the files.
US 2003/0070056 describes a storage structure for a random access medium having equal sized sequentially numbered logical sectors with separate levels for data, format and location and pointers. This is particularly advantageous for mapping data of different formats and the resulting structure is optimised for high performance random access to the data but is not optimised for speed of writing.
According to a first aspect of the present invention there is provided a computer system for indexing stored electronic data comprising: a lexicon database for storing an entry corresponding to each unique term to be indexed, wherein the lexicon database is implemented as a B-tree and each entry comprises a pointer to at least one postings file; wherein the postings file comprises a plurality of postings nodes which are doubly linked to allow queries to return results in both ascending and descending order.
A corresponding method is also provided.
The lexicon contains an entry for each unique term or word and the lexicon is implemented as a B-tree, written to disc on shutdown of a computer system and read back into memory on startup of the computer. Each entry in the lexicon ponts to a chain of document numbers that contain that term. The document numbers are known
as postings and are stored in a postings file. The postings file is divided into 512 byte blocks linked together to form the individual chains for each term in the lexicon. Each node in the postings file stores the difference between successive document numbers for a term and this is stored in code, for example using Elias delta coding. On average such encoding results in an index consuming an average of 6 bits per document term.
The indexing system of the invention is also able to cope with indexing data other than text, such as file name, URL, comment field, and also IP addresses and file types without requiring large amounts of memory even though some such terms are often unique. The invention can also handle phrase queries and breaks a URL into its constituent parts. In addition it will allow random access searches, particularly for multi-term query processing.
According to a second aspect of the present invention there is provided a computer system for storing electronic data representing a file, the system comprising: a storage medium divided into equal sized memory blocks; means for writing the data into at least one data block sequentially such that the first free data block is filled with first received data and subsequently received data is written sequentially to subsequent free data blocks; means for storing a block list pointer to identify the or each data block in which the data for the file is stored, as respective offsets from the beginning of the storage medium; means for storing index pointers to identify the or each block list in at least one index block; and means for tracking the address of each free data block.
A corresponding method is also provided.
This system of writing the data sequentially substantially speeds up the process of writing data and allows extremely large volumes of data to be written quickly. This is particularly useful for archiving purposes or for saving network data traffic in real time. The reading speed of data stored in this sequential manner is not as high as it would be with normal systems because the sequential storage tends to lead to very fragmented files. However, for applications in which it is most important to save a large amount of data at a fast rate, particularly where the data may not need to be accessed frequently, the system of the invention is advantageous.
When text is stored in electronic form it is desirable to provide an index so that word searches of text documents can be performed by a user. Known such indexing systems tend to be slow, require a large amount of memory and be relatively inflexible. They usually require all the data to be indexed to be available at once and they create an index in a batch operation which can take a long time. Hence they are usually incapable of indexing the data in real time as it becomes available.
It is desirable to provide a fast, flexible indexing system which can also handle indexation of data other than text without overburdening the memory requirements.
The invention may be applied to a computer system for indexing a fragment of text from a document, the system comprising: means for dividing the text into individual terms; means for stemming the individual terms using a predetermined algorithm; means for comparing each stemmed term with the terms already stored in the lexicon database; and if the term is already stored in the lexicon database then storing a pointer to the document in the postings file; if the term is not already stored in the lexicon database then storing it and also storing a pointer to the document in the postings file.
A corresponding method is also envisaged and computer programs and computer apparatus corresponding to all aspects of the invention are also provided.
A further aspect of the invention provides a method and a computer system combining the first and second aspects of the invention.
For a better understanding of the present invention, and to show how the same may be carried into effect, reference will now be made to the accompanying drawings, in which:
Figure 1 is a schematic illustration of the format of block types which may be used in one embodiment of the first aspect of the invention;
Figures 2 to 4 are flow diagrams illustrating operations that will be performed by the text index when it receives three different kinds of input: document body text
fragment (figure 2), textual document property (figure 3), and other document property (figure 4) according to the second aspect of the present invention;
Figure 5 illustrates a B-tree node in the lexicon for an embodiment of the second aspect of the invention.
Figure 6 illustrates a postings node for an embodiment of the second aspect of the invention.
In the object file store system according to the first aspect of the invention, the memory 1 is divided up into equal sized blocks 2 and in this example the fields within the blocks are in little endian format and all addresses are 64 bit file offsets.
A single header block 3 is provided comprising an 8 byte sequence identifying the file type, a pointer to the first block list that identifies the free list bitmap block, a pointer to the root node of the index, a dirty flag D and a node size.
A plurality of index blocks 4 is provided as a standard B-tree with all files identified by a single 64 bit integer. The number of blocks grows as more entries are added to the index. An index node comprises a field specifying the number of index entries in the b-tree node, an array of index entries (first entry index zero), an array of 64 bit pointers to other B-tree index nodes, and padding to round the block size up to the block size defined in the file header.
An index entry comprises an 8 byte identifier for the file, a block list address, i.e. a pointer to the first block list for the file, and a field specifying the length of the file.
A plurality of block list nodes are provided, and they are filled in backwards to allow each to share a disk block with a data block when space allows. Each block list node 5 comprises a length field defined by the number of block pointers stored in the block list, a pointer to a data block and a pointer to the next block list. The block lists can contain many pointers to data blocks - as many as will fit before the block is full.
There are a plurality of data blocks or block nodes 6 comprising simply data and padding.
A block list 5 and data block 6 can also be combined as a hybrid block. For example when a file requires only one block list it can be placed in the last data block if space allows so that small files need only one block for all their storage. If there is not enough room however then the block list will be stored in its own block.
There are also free list blocks 7. Free list bitmap data is only written to disk when the file store is closed. If the file store is not cleanly closed then the free list bitmap will be recreated by scanning the file.
The text index according to the second aspect of the invention uses 40 bit block numbers for the file offsets. Using 512 byte blocks this increases the maximum file size by 512 times but does not increase the overhead in the offset calculations too much. The lexicon may be compressed using known techniques.
An internal index for each node is provided allowing a binary search within the node.
The lexicon structure is implemented as a b-tree structure to allow it to overflow to disc if it grows too large. Preferably the caching strategy is arranged to ensure that junk entries fall out to disk to prevent excessive slow down.
Each lexicon entry points to two different postings chains, one at a document level for the main text and one at word level for the properties. This allows the size of the index to be kept manageable.
The postings nodes are doubly linked to allow queries to return results in both ascending and descending order, and contain embedded skip-lists to allow for relatively quick random access. Although not true random access as for example using a full B-tree, it is more efficient from a processing point of view. Each postings chain may have additional markers showing where each day begins allowing for easy rejection of documents outside of a date range being queried.
The text index can use plugins to support different stemming algorithms, including an option to disable stemming or to support multiple stemming algorithms in the same index.
Figures 2 to 4 illustrate the operations that will be performed by the text index when it receives three different kinds of input; document body text fragment (figure 2), textual document property (figure 3), and other document property (figure 4).
Figure 2 illustrates the process for document body text processing. When the text index receives a fragment of a document body text, it will split the text fragment into individual terms, including reassembling terms that are spread across fragment boundaries and those that are spread across multiple fragments. The individual terms will then be stemmed using the currently active stemming algorithm and looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID will be stored in the postings chain for the term.
Figure 3 illustrates the process for textual document property processing. When the text index receives a textual document property, it will split the text into individual terms. The individual terms will then be looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID, property ID, and the position of the term in the property will be stored in the postings chain for the term.
The textual document properties store position information to allow queries to match phrases, so if we index 'user@address.com' as {'user, 1}, {'address', 2}, {'com', 3} then a query for 'user@address.com' will match, but a query for 'address@user.com' will not.
The property ID is also stored to differentiate between the different types of properties, this allows the user to restrict their search to a particular property type such as the subject of an e-mail or a URL. Unlike the processing for non-textual properties, the property type is not appended to the terms as this would dramatically increase the size of the lexicon, roughly multiplying it by the number of property types. Instead when querying, terms from other property types will be skipped over, the extra processing involved in ignoring hits should be negligible compared with the
extra processing that would be required if the lexicon grew to accommodate unique terms per property type.
Figure 4 illustrates the process for other document property processing. When the text index receives a non-textual property, it will append the property ID to the term, making the term unique for the particular property type. The term will then be looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID will be added to the postings chain for the term.
The non-textual document properties have the property type appended to the term
(property value) to allow the user to restrict a query to a particular property type such as an IP address or file type. Unlike the text properties this should not dramatically increase the size of the lexicon and will allow for very efficient searching.
The lexicon and the postings are stored as two separate files to simplify management of space within the files and because the postings require much smaller blocks than the lexicon to avoid consuming too much space for terms that have only a few entries. If the index was implemented as one file we would have problems with fragmentation as the larger blocks get reused as postings blocks and then get freed again but not in large enough chunks to be used for the lexicon.
The structure of each node in the lexicon file is illustrated in figure 5. This is preceded by a header block.
There are some terms which will not be added to the lexicon as they have the potential to dramatically increase it's size while providing little to no benefit to the end user.
The text index will not index any number greater than 9,999 that is found in a documents main text or textual properties. If other numbers were permitted then it is possible to send a stream of ever increasing numbers in a document and forcing the text index lexicon to grow. By restricting the numbers to a maximum of four digits the user is still able to search on smaller numbers such as a year.
The text index will not index any term from a document main text or textual properties that consists of both numbers and letters. Instead the term will be divided into component parts which consist only of letters or only numbers. For example the term 'userl234' would be indexed as the two terms 'user' and '1234'.
When indexing a fragment of text from a document body, if the text index finds binary data (byte values over 127 or below 32) then the text index will stop indexing the documents body text, ignoring any further fragments. This is done to prevent any binary data that is accidentally passed to the index from polluting the lexicon with junk entries. This should only occur if the monitoring system has mis-identified or failed to properly decode a document and therefore should not impact the accuracy of the index.
Each entry in the lexicon will contain four pointers referencing two different chains in the postings file. The first two point to the head and tail of the postings chain for document body text and the second two point to the head and tail of the postings chain for document properties. This is done to allow the format of postings for the main text and the properties to be different, allowing the properties to have a word level index and the main text to have a document level index.
If the pointers are set to -1 (NILL) they are invalid. If they are set to -2 then there is no postings chain, but there is one document that contains the term and that documents ID is stored in LastDocID. This optimisation is used to save disk space by not creating a postings block for the term until the term has occurred in at least two documents. This way if the term is random junk that will never occur again we haven't used up a whole 512 byte postings block to record it. Random junk can account for up to 50% of the entire lexicon. Unfortunately we can't just discard these terms as the rarer a term is, the more important it can be in a query.
A two level cache of recently used B-tree blocks will be maintained to prevent a sudden run of new blocks from forcing the really commonly used blocks out of the cache as the new blocks will only be able to evict blocks from the warm cache.
The lexicon will use a bitmap free list to track free blocks. This will require 32MB of space for every ITB of file size (assuming 4KB blocks), which means the free list will easily fit in memory as the lexicon is only expected to grow to IGB at most. To improve the efficiency of searching for a free block the text index will keep track of the first free block so that the search can start from the first known free block rather than the beginning. This scheme is also used in the object file store, where the average number of comparisons when searching for a free block was 18, discounting situations where there were no free blocks where only one comparison is required.
While running the free list bitmap will be held entirely in memory. When shutting down the text index will store the free list bitmap in some of the free blocks and will update the FreeListHead field in the header to point to the first of these. Each of these blocks will start with a 40 bit pointer to the next block followed by the bitmap data. The text index will be able to calculate the length of the valid bitmap data by checking the file size of the lexicon.
Figure 6 shows the format of each of the 512 byte blocks in the postings file.
All of the integers in the data area of the postings node will be encoded using Elias delta coding. This is a variable length integer coding scheme that allows integers with lower values to consume fewer bits (the integer 1, requires only 1 bit of storage). The encoding consists of three parts; a unary length, a binary length, and a binary value. The unary length specifies the length of the binary length which in turn specifies the length of the binary value. A full description of the encoding scheme is outside the scope of this document and can be easily found on the internet or in reference books.
The Document ID's are converted to differences before encoding to reduce the value of the integer being stored and thus increasing the effectiveness of the Elias coding. This scheme means that if a term appears in every document it only takes 1 bit of storage per document.
For the next property postings which also store a property ID and position, these integers will also use the Elias delta encoding. The property ID's will be rearranged
so that the most common properties such as filename and URL will have the lowest values. The positions will already be small values.
The non-textual property postings which do not need the property ID or position information will be stored in the main text postings format.
The document ID differences will be restarted in each block so that it is not necessary to decode the whole chain to read a value, allowing us to achieve random access using skip lists.
To allow documents to be indexed slightly out of order the API has been modified so that the text index must be informed when each document ID is complete, even if no data has been given to the text index for that document. This allows the text index to know exactly which document ID's are outstanding.
If the text index receives data for a document early, for example if it has completed up to document 10 and receives data for document 12 while document 11 is not yet complete, it will try to cache the terms for document 12 in memory until document 11 is complete. If it runs out of memory it will add a reference to document 11 to any term occurring in document 12, but it will mark the reference as invalid in the 'valid bitmap' in the block. If document 11 is later found to really contain that term, the reference will be marked as valid.
As most of the re-ordering will be completed in memory there should only be a few invalid entries wasting space in the postings file.
When a document ID needs to be removed from the text index during monitoring the text index will add the document ID to a special postings chain that lists deleted documents. This is done because it is not possible to know which terms occurred in the document and therefore real deletion would require reading the postings chains for all terms which would be extremely time consuming.
Claims
1. A computer system for indexing stored electronic data comprising: a lexicon database for storing an entry corresponding to each unique term to be indexed, wherein the lexicon database is implemented as a B-tree and each entry comprises a pointer to at least one postings file; wherein the postings file comprises a plurality of postings nodes which are doubly linked to allow queries to return results in both ascending and descending order.
2. A computer system according to claim 1 wherein each lexicon entry comprises a first pointer to a first, document level, postings file, and a second pointer to a second, word level, postings file.
3 A computer system according to claim 1 or 2 comprising means for embedding skip-lists in said postings nodes.
4. A computer system according to claim 1, 2 or 3 wherein each postings file comprises a marker to indicate where each day begins.
5. A computer system for indexing a fragment of text from a document, the system comprising: means for dividing the text into individual terms; means for stemming the individual terms using a predetermined algorithm; means for comparing each stemmed term with the terms already stored in the lexicon database; and if the term is already stored in the lexicon database then storing a pointer to the document in the postings file; if the term is not already stored in the lexicon database then storing it and also storing a pointer to the document in the postings file.
6. A computer system according to claim 5 adapted for use in indexing textual document properties, additionally comprising: means for additionally storing a property identification and positional information for the term, associated with the document pointer.
7. A computer system according to claim 5 or 6 adapted for use in indexing non- textual document properties further comprising: means for storing a property identification appended to the term.
8. A computer system according to any one of claims 1 to 7 comprising means for identifying predetermined types of term and excluding them from being stored in the lexicon database.
9. A computer system according to claim 8 wherein said predetermined types of term comprise numbers greater than four digits long.
10. A computer system according to claim 8 or 9 wherein said predetermined types of term comprise any continuous combination of letters or numbers.
11. A computer system according to claim 10 comprising means for dividing a combination of letters and numbers in such a way that each of the resultant component parts comprise only letters or only numbers.
12. A computer system according to claim 8 wherein said predetermined types of term comprise binary data for which byte values are over 127 or below 32.
13. A method for indexing stored electronic data comprising: storing an entry corresponding to each unique term to be indexed, as an entry in a lexicon implemented as a B-tree wherein each entry comprises a pointer to at least one postings file; wherein the postings file comprises a plurality of postings nodes which are doubly linked to allow queries to return results in both ascending and descending order.
14. A method according to claim 13 wherein each lexicon entry comprises a first pointer to a first, document level, postings file, and a second pointer to a second, word level, postings file.
15. A method according to claim 13 or 14 wherein each lexicon entry comprises skip-lists.
16. A method according to claim 13, 14 or 15 wherein each postings file comprises a marker to indicate where each day begins.
17. A method according to any one of claims 13 to 16 for indexing a fragment of text from a document, the method comprising: dividing the text into individual terms; stemming the individual terms using a predetermined algorithm; comparing each stemmed term with the terms already stored in a lexicon database; and if the term is already stored in the lexicon database then storing a pointer to the document in the postings file; if the term is not already stored in the lexicon database then storing it and also storing a pointer to the document in the postings file.
18. A method according to claim 17 adapted for use in indexing textual document properties, additionally comprising: means for additionally storing a property identification and positional information for the term, associated with the document pointer.
19. A method according to claim 13 or 14 adapted for use in indexing non- textual document properties further comprising: means for storing a property identification appended to the term.
20. A method according to any one of claims 13 to 19 comprising means for identifying predetermined types of term and excluding them from being stored in the lexicon database.
21. A method according to claim 20 wherein said predetermined types of term comprise numbers greater than four digits long.
22. A method according to claim 20 or 21 wherein said predetermined types of term comprise any continuous combination of letters or numbers.
23. A method according to claim 22 comprising means for dividing a combination of letters and numbers in such a way that each of the resultant component parts comprise only letters or only numbers.
24. A computer system for storing electronic data representing a file, the system comprising: a storage medium divided into equal sized memory blocks; means for writing the data into at least one data block sequentially such that the first free data block is filled with first received data and subsequently received data is written sequentially to subsequent free data blocks; means for storing a block list pointer to identify the or each data block in which the data for the file is stored; means for storing index pointers to identify the or each block list in at least one index block; and means for tracking the address of each free data block.
25. A computer system according to claim 24 wherein each block list pointer is stored as an offset from the beginning of the storage medium.
26. A computer system according to claim 24 or 25 wherein the block list pointers are stored in at least one block list block.
27. A computer system according to claim 24 or 25 wherein the block list pointers are stored in the last data block into which corresponding data is stored, if space allows.
28. A computer system according to any one of claims 24 to 27 comprising means for storing data representing the address of the or each index block in a header block.
29. A computer system according to any one of claims 24 to 28 comprising means for storing free list bitmap data in a free list bitmap block.
30. A computer system according to any one of claims 24 to 29 wherein fields within the blocks are stored in little endian format.
31. A computer system according to any one of claims 24 to 30 wherein the header block comprises data identifying the file type, a pointer to the first block list that identifies the free list bitmap block and a pointer to the root node of the index.
32. A computer system according to any one of claims 24 to 31 comprising a plurality of index blocks implemented as a B-tree.
33. A computer system according to any one of claims 24 to 32 wherein an index entry for a file, as stored in the index block, comprises data identifying the file, a pointer to the first block list for the file and a field specifying the length of the file.
34. A computer system according any one of claims 24 to 33 wherein the means for storing is adapted to enter the block lists backwards into one or more memory blocks.
35. A method for storing electronic data comprising: dividing a storage medium into equal sized blocks; writing the data to be stored to the storage medium into at least one data block sequentially such that the first free data block is filled first and subsequent free data blocks are filled subsequently; storing a pointer to a block list to identify the or each data block in which the data for the file is stored; storing index pointers to identify the or each block list in at least one index block; tracking the address of each free data block.
36. A method according to claim 35 comprising storing each block list pointer as an offset from the beginning of the storage medium.
37. A method according to claim 35 or 36 wherein the fields within the blocks are stored in little endian format.
38. A method according to claim 35 or claim 37 comprising storing in a header block data identifying the file type, a pointer to the first block list that identifies the free list bitmap block and a pointer to the root node of the index.
39. A method according to any one of claims 35 to 38wherein a plurality of index blocks are implemented as a B-tree.
40. A method according to any one of claims 35 to 39 wherein an index entry for a file, as stored in the index block, comprises data identifying the file, a pointer to the first block list for the file and a field specifying the length of the file.
41. A method according any one of claims 35 to 40 wherein the means for storing block list pointers is adapted to enter them backwards.
42. A method for storing electronic data comprising storing electronic data on a storage medium in accordance with the method steps of any one of claims 35 to 41 and indexing the stored electronic data in accordance with the method steps of any one of claims 13 to 23.
43. A computer system adapted for performing the method of claim 42.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0614336.6 | 2006-07-19 | ||
| GB0614336A GB2440175A (en) | 2006-07-19 | 2006-07-19 | System for determining and storing indexing data |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2008009995A2 true WO2008009995A2 (en) | 2008-01-24 |
| WO2008009995A3 WO2008009995A3 (en) | 2008-05-22 |
Family
ID=36998338
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/GB2007/050424 WO2008009995A2 (en) | 2006-07-19 | 2007-07-19 | System and method for indexing stored electronic data using a b-tree |
Country Status (2)
| Country | Link |
|---|---|
| GB (1) | GB2440175A (en) |
| WO (1) | WO2008009995A2 (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101692252A (en) * | 2009-08-31 | 2010-04-07 | 上海宝信软件股份有限公司 | Method for distributing and reclaiming idle blocks of file |
| US9824105B2 (en) | 2012-04-30 | 2017-11-21 | Hewlett Packard Enterprise Development Lp | Adaptive probabilistic indexing with skip lists |
| CN111859033A (en) * | 2020-07-22 | 2020-10-30 | 北京金山云网络技术有限公司 | IP library query method and device and IP library compression method and device |
| CN114816277A (en) * | 2022-06-30 | 2022-07-29 | 广东睿江云计算股份有限公司 | Control method and control system for guaranteeing sequence of file data blocks |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5313604A (en) * | 1990-11-13 | 1994-05-17 | Hewlett-Packard Company | Method for locating compressed data in a computed memory back up device including steps of refining estimater location |
| CA2244626A1 (en) * | 1998-07-31 | 2000-01-31 | Kom Inc. | A hardware and software system |
| US6636942B2 (en) * | 2001-10-05 | 2003-10-21 | International Business Machines Corporation | Storage structure for storing formatted data on a random access medium |
-
2006
- 2006-07-19 GB GB0614336A patent/GB2440175A/en not_active Withdrawn
-
2007
- 2007-07-19 WO PCT/GB2007/050424 patent/WO2008009995A2/en active Application Filing
Non-Patent Citations (4)
| Title |
|---|
| BAEZA-YATES R ET AL: "MODERN INFORMATION RETRIEVAL, Chapter 8: Indexing and Searching" MODERN INFORMATION RETRIEVAL, HARLOW : ADDISON-WESLEY, GB, 1999, pages 191-228, XP002457291 ISBN: 0-201-39829-X * |
| GALLI R: "Journal Flie Systems in Linux" UPGRADE, [Online] vol. 2, no. 6, December 2001 (2001-12), XP002473445 Retrieved from the Internet: URL:http://www.upgrade-cepis.org/issues/2001/6/up2-6Galli.pdf> [retrieved on 2008-03-18] * |
| SANTOS FLORIDO J I: "Journal File Systems" LINUX GAZETTE, [Online] no. 55, July 2000 (2000-07), XP002473444 Retrieved from the Internet: URL:http://ldp.dvo.ru/LDP/LGNET/issue55/index.html> [retrieved on 2008-03-18] * |
| TANENBAUM A S: "Chapter 6.3: File System Implementation" MODERN OPERATING SYSTEMS, 21 February 2001 (2001-02-21), pages 399-414, XP002473443 Upper Saddle River, New Jersey, USA * |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101692252A (en) * | 2009-08-31 | 2010-04-07 | 上海宝信软件股份有限公司 | Method for distributing and reclaiming idle blocks of file |
| US9824105B2 (en) | 2012-04-30 | 2017-11-21 | Hewlett Packard Enterprise Development Lp | Adaptive probabilistic indexing with skip lists |
| CN111859033A (en) * | 2020-07-22 | 2020-10-30 | 北京金山云网络技术有限公司 | IP library query method and device and IP library compression method and device |
| CN111859033B (en) * | 2020-07-22 | 2023-10-27 | 北京金山云网络技术有限公司 | IP library query method and device and IP library compression method and device |
| CN114816277A (en) * | 2022-06-30 | 2022-07-29 | 广东睿江云计算股份有限公司 | Control method and control system for guaranteeing sequence of file data blocks |
| CN114816277B (en) * | 2022-06-30 | 2022-11-11 | 广东睿江云计算股份有限公司 | Control method and control system for guaranteeing sequence of file data blocks |
Also Published As
| Publication number | Publication date |
|---|---|
| GB2440175A (en) | 2008-01-23 |
| WO2008009995A3 (en) | 2008-05-22 |
| GB0614336D0 (en) | 2006-08-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP5996088B2 (en) | Cryptographic hash database | |
| CN100498740C (en) | Data cache processing method, system and data cache device | |
| US8959089B2 (en) | Data processing apparatus and method of processing data | |
| US8423519B2 (en) | Data reduction indexing | |
| US8255434B2 (en) | Method and apparatus for storing data with reduced redundancy using data clusters | |
| US8175875B1 (en) | Efficient indexing of documents with similar content | |
| EP2633413B1 (en) | Low ram space, high-throughput persistent key-value store using secondary memory | |
| AU2009257851B2 (en) | Search index format optimizations | |
| CN107368436B (en) | Flash memory cold and hot data separated storage method combined with address mapping table | |
| US20080133565A1 (en) | Device and method for constructing inverted indexes | |
| US6654868B2 (en) | Information storage and retrieval system | |
| EP1866774A1 (en) | Method for storing data with reduced redundancy using data clusters | |
| US20080082554A1 (en) | Systems and methods for providing a dynamic document index | |
| US20080306949A1 (en) | Inverted index processing | |
| US20080033909A1 (en) | Indexing | |
| WO2008009995A2 (en) | System and method for indexing stored electronic data using a b-tree | |
| US7627609B1 (en) | Index processing using transformed values | |
| US7698325B1 (en) | Index processing for legacy systems | |
| US20110004630A1 (en) | Method for reliable and efficient filesystem metadata conversion | |
| US8156126B2 (en) | Method for the allocation of data on physical media by a file system that eliminates duplicate data | |
| US7752211B1 (en) | Adaptive index processing | |
| KR100810666B1 (en) | Data index method for devices using flash memory as storage device | |
| CN102479213A (en) | Data buffering method and device | |
| Boucetta et al. | LHFTL: A novel SSD performance optimisation approach using the LH index | |
| WO2025034717A1 (en) | Methods and systems for metadata reduction in paged direct mapping key-value stores |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07789336 Country of ref document: EP Kind code of ref document: A2 |
|
| NENP | Non-entry into the national phase in: |
Ref country code: DE |
|
| NENP | Non-entry into the national phase in: |
Ref country code: RU |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: COMMUNICATION NOT DELIVERED. NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112 EPC (EPO FORM 1205A DATED 07.04.2009) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 07789336 Country of ref document: EP Kind code of ref document: A2 |