WO2008009995A2 - Système - Google Patents
Système Download PDFInfo
- Publication number
- WO2008009995A2 WO2008009995A2 PCT/GB2007/050424 GB2007050424W WO2008009995A2 WO 2008009995 A2 WO2008009995 A2 WO 2008009995A2 GB 2007050424 W GB2007050424 W GB 2007050424W WO 2008009995 A2 WO2008009995 A2 WO 2008009995A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- block
- file
- data
- storing
- computer system
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
Definitions
- the present invention relates to a system for storing electronic data and also to a system for creating an index for electronic data.
- US 2003/0070056 describes a storage structure for a random access medium having equal sized sequentially numbered logical sectors with separate levels for data, format and location and pointers. This is particularly advantageous for mapping data of different formats and the resulting structure is optimised for high performance random access to the data but is not optimised for speed of writing.
- a computer system for indexing stored electronic data comprising: a lexicon database for storing an entry corresponding to each unique term to be indexed, wherein the lexicon database is implemented as a B-tree and each entry comprises a pointer to at least one postings file; wherein the postings file comprises a plurality of postings nodes which are doubly linked to allow queries to return results in both ascending and descending order.
- a corresponding method is also provided.
- the lexicon contains an entry for each unique term or word and the lexicon is implemented as a B-tree, written to disc on shutdown of a computer system and read back into memory on startup of the computer.
- Each entry in the lexicon ponts to a chain of document numbers that contain that term.
- the document numbers are known as postings and are stored in a postings file.
- the postings file is divided into 512 byte blocks linked together to form the individual chains for each term in the lexicon.
- Each node in the postings file stores the difference between successive document numbers for a term and this is stored in code, for example using Elias delta coding. On average such encoding results in an index consuming an average of 6 bits per document term.
- the indexing system of the invention is also able to cope with indexing data other than text, such as file name, URL, comment field, and also IP addresses and file types without requiring large amounts of memory even though some such terms are often unique.
- the invention can also handle phrase queries and breaks a URL into its constituent parts. In addition it will allow random access searches, particularly for multi-term query processing.
- a computer system for storing electronic data representing a file, the system comprising: a storage medium divided into equal sized memory blocks; means for writing the data into at least one data block sequentially such that the first free data block is filled with first received data and subsequently received data is written sequentially to subsequent free data blocks; means for storing a block list pointer to identify the or each data block in which the data for the file is stored, as respective offsets from the beginning of the storage medium; means for storing index pointers to identify the or each block list in at least one index block; and means for tracking the address of each free data block.
- a corresponding method is also provided.
- This system of writing the data sequentially substantially speeds up the process of writing data and allows extremely large volumes of data to be written quickly. This is particularly useful for archiving purposes or for saving network data traffic in real time.
- the reading speed of data stored in this sequential manner is not as high as it would be with normal systems because the sequential storage tends to lead to very fragmented files.
- the system of the invention is advantageous.
- When text is stored in electronic form it is desirable to provide an index so that word searches of text documents can be performed by a user.
- Known such indexing systems tend to be slow, require a large amount of memory and be relatively inflexible. They usually require all the data to be indexed to be available at once and they create an index in a batch operation which can take a long time. Hence they are usually incapable of indexing the data in real time as it becomes available.
- the invention may be applied to a computer system for indexing a fragment of text from a document, the system comprising: means for dividing the text into individual terms; means for stemming the individual terms using a predetermined algorithm; means for comparing each stemmed term with the terms already stored in the lexicon database; and if the term is already stored in the lexicon database then storing a pointer to the document in the postings file; if the term is not already stored in the lexicon database then storing it and also storing a pointer to the document in the postings file.
- a further aspect of the invention provides a method and a computer system combining the first and second aspects of the invention.
- Figure 1 is a schematic illustration of the format of block types which may be used in one embodiment of the first aspect of the invention
- Figures 2 to 4 are flow diagrams illustrating operations that will be performed by the text index when it receives three different kinds of input: document body text fragment (figure 2), textual document property (figure 3), and other document property (figure 4) according to the second aspect of the present invention;
- Figure 5 illustrates a B-tree node in the lexicon for an embodiment of the second aspect of the invention.
- Figure 6 illustrates a postings node for an embodiment of the second aspect of the invention.
- the memory 1 is divided up into equal sized blocks 2 and in this example the fields within the blocks are in little endian format and all addresses are 64 bit file offsets.
- a single header block 3 comprising an 8 byte sequence identifying the file type, a pointer to the first block list that identifies the free list bitmap block, a pointer to the root node of the index, a dirty flag D and a node size.
- a plurality of index blocks 4 is provided as a standard B-tree with all files identified by a single 64 bit integer. The number of blocks grows as more entries are added to the index.
- An index node comprises a field specifying the number of index entries in the b-tree node, an array of index entries (first entry index zero), an array of 64 bit pointers to other B-tree index nodes, and padding to round the block size up to the block size defined in the file header.
- An index entry comprises an 8 byte identifier for the file, a block list address, i.e. a pointer to the first block list for the file, and a field specifying the length of the file.
- a plurality of block list nodes are provided, and they are filled in backwards to allow each to share a disk block with a data block when space allows.
- Each block list node 5 comprises a length field defined by the number of block pointers stored in the block list, a pointer to a data block and a pointer to the next block list.
- the block lists can contain many pointers to data blocks - as many as will fit before the block is full.
- a block list 5 and data block 6 can also be combined as a hybrid block. For example when a file requires only one block list it can be placed in the last data block if space allows so that small files need only one block for all their storage. If there is not enough room however then the block list will be stored in its own block.
- Free list bitmap data is only written to disk when the file store is closed. If the file store is not cleanly closed then the free list bitmap will be recreated by scanning the file.
- the text index according to the second aspect of the invention uses 40 bit block numbers for the file offsets. Using 512 byte blocks this increases the maximum file size by 512 times but does not increase the overhead in the offset calculations too much.
- the lexicon may be compressed using known techniques.
- An internal index for each node is provided allowing a binary search within the node.
- the lexicon structure is implemented as a b-tree structure to allow it to overflow to disc if it grows too large.
- the caching strategy is arranged to ensure that junk entries fall out to disk to prevent excessive slow down.
- Each lexicon entry points to two different postings chains, one at a document level for the main text and one at word level for the properties. This allows the size of the index to be kept manageable.
- the postings nodes are doubly linked to allow queries to return results in both ascending and descending order, and contain embedded skip-lists to allow for relatively quick random access. Although not true random access as for example using a full B-tree, it is more efficient from a processing point of view.
- Each postings chain may have additional markers showing where each day begins allowing for easy rejection of documents outside of a date range being queried.
- the text index can use plugins to support different stemming algorithms, including an option to disable stemming or to support multiple stemming algorithms in the same index.
- Figures 2 to 4 illustrate the operations that will be performed by the text index when it receives three different kinds of input; document body text fragment (figure 2), textual document property (figure 3), and other document property (figure 4).
- Figure 2 illustrates the process for document body text processing.
- the text index When the text index receives a fragment of a document body text, it will split the text fragment into individual terms, including reassembling terms that are spread across fragment boundaries and those that are spread across multiple fragments. The individual terms will then be stemmed using the currently active stemming algorithm and looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID will be stored in the postings chain for the term.
- Figure 3 illustrates the process for textual document property processing.
- the text index receives a textual document property, it will split the text into individual terms. The individual terms will then be looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID, property ID, and the position of the term in the property will be stored in the postings chain for the term.
- the textual document properties store position information to allow queries to match phrases, so if we index 'user@address.com' as ⁇ 'user, 1 ⁇ , ⁇ 'address', 2 ⁇ , ⁇ 'com', 3 ⁇ then a query for 'user@address.com' will match, but a query for 'address@user.com' will not.
- the property ID is also stored to differentiate between the different types of properties, this allows the user to restrict their search to a particular property type such as the subject of an e-mail or a URL.
- the property type is not appended to the terms as this would dramatically increase the size of the lexicon, roughly multiplying it by the number of property types. Instead when querying, terms from other property types will be skipped over, the extra processing involved in ignoring hits should be negligible compared with the extra processing that would be required if the lexicon grew to accommodate unique terms per property type.
- Figure 4 illustrates the process for other document property processing.
- the text index When the text index receives a non-textual property, it will append the property ID to the term, making the term unique for the particular property type. The term will then be looked up in the lexicon, adding a new term to the lexicon if necessary. Finally the document ID will be added to the postings chain for the term.
- the lexicon and the postings are stored as two separate files to simplify management of space within the files and because the postings require much smaller blocks than the lexicon to avoid consuming too much space for terms that have only a few entries. If the index was implemented as one file we would have problems with fragmentation as the larger blocks get reused as postings blocks and then get freed again but not in large enough chunks to be used for the lexicon.
- each node in the lexicon file is illustrated in figure 5. This is preceded by a header block.
- the text index will not index any number greater than 9,999 that is found in a documents main text or textual properties. If other numbers were permitted then it is possible to send a stream of ever increasing numbers in a document and forcing the text index lexicon to grow. By restricting the numbers to a maximum of four digits the user is still able to search on smaller numbers such as a year.
- the text index will not index any term from a document main text or textual properties that consists of both numbers and letters. Instead the term will be divided into component parts which consist only of letters or only numbers. For example the term 'userl234' would be indexed as the two terms 'user' and '1234'.
- the text index When indexing a fragment of text from a document body, if the text index finds binary data (byte values over 127 or below 32) then the text index will stop indexing the documents body text, ignoring any further fragments. This is done to prevent any binary data that is accidentally passed to the index from polluting the lexicon with junk entries. This should only occur if the monitoring system has mis-identified or failed to properly decode a document and therefore should not impact the accuracy of the index.
- Each entry in the lexicon will contain four pointers referencing two different chains in the postings file.
- pointers are set to -1 (NILL) they are invalid. If they are set to -2 then there is no postings chain, but there is one document that contains the term and that documents ID is stored in LastDocID.
- This optimisation is used to save disk space by not creating a postings block for the term until the term has occurred in at least two documents. This way if the term is random junk that will never occur again we haven't used up a whole 512 byte postings block to record it. Random junk can account for up to 50% of the entire lexicon. Unfortunately we can't just discard these terms as the rarer a term is, the more important it can be in a query.
- a two level cache of recently used B-tree blocks will be maintained to prevent a sudden run of new blocks from forcing the really commonly used blocks out of the cache as the new blocks will only be able to evict blocks from the warm cache.
- the lexicon will use a bitmap free list to track free blocks. This will require 32MB of space for every ITB of file size (assuming 4KB blocks), which means the free list will easily fit in memory as the lexicon is only expected to grow to IGB at most. To improve the efficiency of searching for a free block the text index will keep track of the first free block so that the search can start from the first known free block rather than the beginning. This scheme is also used in the object file store, where the average number of comparisons when searching for a free block was 18, discounting situations where there were no free blocks where only one comparison is required.
- While running the free list bitmap will be held entirely in memory.
- When shutting down the text index will store the free list bitmap in some of the free blocks and will update the FreeListHead field in the header to point to the first of these.
- Each of these blocks will start with a 40 bit pointer to the next block followed by the bitmap data.
- the text index will be able to calculate the length of the valid bitmap data by checking the file size of the lexicon.
- Figure 6 shows the format of each of the 512 byte blocks in the postings file.
- Elias delta coding This is a variable length integer coding scheme that allows integers with lower values to consume fewer bits (the integer 1, requires only 1 bit of storage).
- the encoding consists of three parts; a unary length, a binary length, and a binary value.
- the unary length specifies the length of the binary length which in turn specifies the length of the binary value.
- a full description of the encoding scheme is outside the scope of this document and can be easily found on the internet or in reference books.
- the Document ID's are converted to differences before encoding to reduce the value of the integer being stored and thus increasing the effectiveness of the Elias coding. This scheme means that if a term appears in every document it only takes 1 bit of storage per document.
- the non-textual property postings which do not need the property ID or position information will be stored in the main text postings format.
- the API has been modified so that the text index must be informed when each document ID is complete, even if no data has been given to the text index for that document. This allows the text index to know exactly which document ID's are outstanding.
- the text index receives data for a document early, for example if it has completed up to document 10 and receives data for document 12 while document 11 is not yet complete, it will try to cache the terms for document 12 in memory until document 11 is complete. If it runs out of memory it will add a reference to document 11 to any term occurring in document 12, but it will mark the reference as invalid in the 'valid bitmap' in the block. If document 11 is later found to really contain that term, the reference will be marked as valid.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
La présente invention concerne un système informatique pour indexer des données électroniques stockées comprenant : une base de données lexique pour stocker une entrée correspondant à chaque terme unique à indexer, la base de données lexique étant mise en œuvre sous forme d'arbre binaire et chaque entrée comprenant un pointeur vers au moins un fichier d'éléments; le fichier d'éléments comprenant une pluralité de nœuds d'éléments qui sont doublement liés pour permettre aux requêtes de fournir des résultats à la fois dans l'ordre croissant et décroissant.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0614336.6 | 2006-07-19 | ||
GB0614336A GB2440175A (en) | 2006-07-19 | 2006-07-19 | System for determining and storing indexing data |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2008009995A2 true WO2008009995A2 (fr) | 2008-01-24 |
WO2008009995A3 WO2008009995A3 (fr) | 2008-05-22 |
Family
ID=36998338
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB2007/050424 WO2008009995A2 (fr) | 2006-07-19 | 2007-07-19 | Système |
Country Status (2)
Country | Link |
---|---|
GB (1) | GB2440175A (fr) |
WO (1) | WO2008009995A2 (fr) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101692252A (zh) * | 2009-08-31 | 2010-04-07 | 上海宝信软件股份有限公司 | 文件空闲块的分配和回收方法 |
US9824105B2 (en) | 2012-04-30 | 2017-11-21 | Hewlett Packard Enterprise Development Lp | Adaptive probabilistic indexing with skip lists |
CN111859033A (zh) * | 2020-07-22 | 2020-10-30 | 北京金山云网络技术有限公司 | Ip库查询方法、装置及ip库压缩方法、装置 |
CN114816277A (zh) * | 2022-06-30 | 2022-07-29 | 广东睿江云计算股份有限公司 | 文件数据块顺序性保障的控制方法及控制系统 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5313604A (en) * | 1990-11-13 | 1994-05-17 | Hewlett-Packard Company | Method for locating compressed data in a computed memory back up device including steps of refining estimater location |
CA2244626A1 (fr) * | 1998-07-31 | 2000-01-31 | Kom Inc. | Systeme materiel et logiciel |
US6636942B2 (en) * | 2001-10-05 | 2003-10-21 | International Business Machines Corporation | Storage structure for storing formatted data on a random access medium |
-
2006
- 2006-07-19 GB GB0614336A patent/GB2440175A/en not_active Withdrawn
-
2007
- 2007-07-19 WO PCT/GB2007/050424 patent/WO2008009995A2/fr active Application Filing
Non-Patent Citations (4)
Title |
---|
BAEZA-YATES R ET AL: "MODERN INFORMATION RETRIEVAL, Chapter 8: Indexing and Searching" MODERN INFORMATION RETRIEVAL, HARLOW : ADDISON-WESLEY, GB, 1999, pages 191-228, XP002457291 ISBN: 0-201-39829-X * |
GALLI R: "Journal Flie Systems in Linux" UPGRADE, [Online] vol. 2, no. 6, December 2001 (2001-12), XP002473445 Retrieved from the Internet: URL:http://www.upgrade-cepis.org/issues/2001/6/up2-6Galli.pdf> [retrieved on 2008-03-18] * |
SANTOS FLORIDO J I: "Journal File Systems" LINUX GAZETTE, [Online] no. 55, July 2000 (2000-07), XP002473444 Retrieved from the Internet: URL:http://ldp.dvo.ru/LDP/LGNET/issue55/index.html> [retrieved on 2008-03-18] * |
TANENBAUM A S: "Chapter 6.3: File System Implementation" MODERN OPERATING SYSTEMS, 21 February 2001 (2001-02-21), pages 399-414, XP002473443 Upper Saddle River, New Jersey, USA * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101692252A (zh) * | 2009-08-31 | 2010-04-07 | 上海宝信软件股份有限公司 | 文件空闲块的分配和回收方法 |
US9824105B2 (en) | 2012-04-30 | 2017-11-21 | Hewlett Packard Enterprise Development Lp | Adaptive probabilistic indexing with skip lists |
CN111859033A (zh) * | 2020-07-22 | 2020-10-30 | 北京金山云网络技术有限公司 | Ip库查询方法、装置及ip库压缩方法、装置 |
CN111859033B (zh) * | 2020-07-22 | 2023-10-27 | 北京金山云网络技术有限公司 | Ip库查询方法、装置及ip库压缩方法、装置 |
CN114816277A (zh) * | 2022-06-30 | 2022-07-29 | 广东睿江云计算股份有限公司 | 文件数据块顺序性保障的控制方法及控制系统 |
CN114816277B (zh) * | 2022-06-30 | 2022-11-11 | 广东睿江云计算股份有限公司 | 文件数据块顺序性保障的控制方法及控制系统 |
Also Published As
Publication number | Publication date |
---|---|
GB0614336D0 (en) | 2006-08-30 |
GB2440175A (en) | 2008-01-23 |
WO2008009995A3 (fr) | 2008-05-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5996088B2 (ja) | 暗号ハッシュ・データベース | |
CN100498740C (zh) | 一种数据缓存处理方法、系统及数据缓存装置 | |
US8959089B2 (en) | Data processing apparatus and method of processing data | |
US8423519B2 (en) | Data reduction indexing | |
US8255434B2 (en) | Method and apparatus for storing data with reduced redundancy using data clusters | |
US8175875B1 (en) | Efficient indexing of documents with similar content | |
EP2633413B1 (fr) | Stockage de clés et de valeurs permanent, à haut débit, à faible encombrement de ram et effectué à l'aide d'une mémoire secondaire | |
AU2009257851B2 (en) | Search index format optimizations | |
CN107368436B (zh) | 一种联合地址映射表的闪存冷热数据分离存储方法 | |
US20080133565A1 (en) | Device and method for constructing inverted indexes | |
US6654868B2 (en) | Information storage and retrieval system | |
EP1866774A1 (fr) | Procede de stockage de donnees avec une moindre redondance au moyen de groupements de donnees | |
US20080082554A1 (en) | Systems and methods for providing a dynamic document index | |
US20080033909A1 (en) | Indexing | |
WO2008009995A2 (fr) | Système | |
CN100498794C (zh) | 索引压缩的方法和装置 | |
US7627609B1 (en) | Index processing using transformed values | |
US7698325B1 (en) | Index processing for legacy systems | |
US20110004630A1 (en) | Method for reliable and efficient filesystem metadata conversion | |
US8156126B2 (en) | Method for the allocation of data on physical media by a file system that eliminates duplicate data | |
US7752211B1 (en) | Adaptive index processing | |
KR100810666B1 (ko) | 플래시 메모리를 저장장치로 사용하는 기기를 위한 데이터색인 방법 | |
CN102479213A (zh) | 数据缓冲方法和装置 | |
Boucetta et al. | LHFTL: A novel SSD performance optimisation approach using the LH index | |
WO2025034717A1 (fr) | Procédés et systèmes de réduction de métadonnées dans des magasins de valeurs clés de mappage direct par radiomessagerie |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07789336 Country of ref document: EP Kind code of ref document: A2 |
|
NENP | Non-entry into the national phase in: |
Ref country code: DE |
|
NENP | Non-entry into the national phase in: |
Ref country code: RU |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: COMMUNICATION NOT DELIVERED. NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112 EPC (EPO FORM 1205A DATED 07.04.2009) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07789336 Country of ref document: EP Kind code of ref document: A2 |