CN101620607A - Full-text retrieval method and full-text retrieval system - Google Patents
Full-text retrieval method and full-text retrieval system Download PDFInfo
- Publication number
- CN101620607A CN101620607A CN200810126025A CN200810126025A CN101620607A CN 101620607 A CN101620607 A CN 101620607A CN 200810126025 A CN200810126025 A CN 200810126025A CN 200810126025 A CN200810126025 A CN 200810126025A CN 101620607 A CN101620607 A CN 101620607A
- Authority
- CN
- China
- Prior art keywords
- file
- entry
- byte
- text
- full
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012545 processing Methods 0.000 claims abstract description 15
- 230000007246 mechanism Effects 0.000 claims description 19
- 230000011218 segmentation Effects 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 10
- 238000004321 preservation Methods 0.000 claims description 3
- 230000001133 acceleration Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 6
- 230000006870 function Effects 0.000 description 8
- 238000000151 deposition Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000000903 blocking effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- JEIPFZHSYJVQDO-UHFFFAOYSA-N ferric oxide Chemical compound O=[Fe]O[Fe]=O JEIPFZHSYJVQDO-UHFFFAOYSA-N 0.000 description 1
- 238000013549 information retrieval technique Methods 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 238000012797 qualification Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a full-text retrieval method, which comprises the steps of: receiving a retrieval expression comprising retrieval words, and performing word-splitting processing on the retrieval words; in a VIF file in a full-text database, searching position information of an entry in a BIF file in the full-text database according to the entry obtained after the word-splitting processing; searching recording information corresponding to the entry in the BIF file according to the position information; and extracting corresponding data information in a BAF file in the full-text database as a retrieval result according to the recording information. The invention also relates to a full-text retrieval system, which comprises a BAF file storage module, a BIF file storage module, a VIF file storage module, a word-splitting module and a retrieval module. The invention adopts inverted file index technology in the BIF file, and ensures that the high recall ratio and precision ratio are achieved during the retrieval by performing the indexing of continuous symbols on the BIF file due to the fact that retrieved information can be quickly found during the uniqueness retrieval of hash codes.
    Description
Technical field
      The present invention relates to information retrieval technique, relate in particular to a kind of text searching method and system.
    Background technology
      In recent years along with the development of network application, the network information sharply expands, this problem of how finding out the information of user's needs from mass data is vital to people, traditional manual type can't be suitable for so huge data capacity, and therefore the information search technique based on computer and network technologies becomes the important tool that addresses this problem.
      In present database technology, the database technology that is used for data retrieval is mainly relevant database.For relevant database, data holder must deposit already present data in the relevant database in, writes corresponding SQL expression formula for search at certain field then, carries out the retrieval of data by this SQL expression formula for search.When the process object data, relevant database need all be organized in relevant data in more than fixed-size forms, must reach the retrieval purpose by the association of SQL expression formula for search each form that information is related when retrieving information.The speed of relational data library searching is slower, and the accuracy rate of its Query Information is also relatively poor relatively, and especially under the situation of the above data volume of 1,000,000 ranks, the response speed of its retrieval is difficult to satisfactory.In the retrieval of multiword section, be the needs that all are difficult to satisfy people on speed or the precision ratio.In addition, for the retrieval of nonformatted data, relevant database is bad to handle.
    Summary of the invention
      The objective of the invention is to propose a kind of text searching method and system, can in mass data, find required information rapidly and accurately for the user.
      For achieving the above object, the invention provides a kind of text searching method, may further comprise the steps:
      Reception comprises the expression formula for search of term, and described term is carried out word segmentation processing;
      According to virtual information file (the Virtual Information File of the entry that obtains after the word segmentation processing in full-text database, abbreviation VIF) searches the positional information in the byte index file (Byte Index File is called for short BIF) of described entry in full-text database in;
      In described BIF file, search the recorded information corresponding according to described positional information with described entry;
      Quicken to extract corresponding data message as result for retrieval in the file (ByteAccelerated File is called for short BAF) according to the byte of described recorded information in full-text database.
      Further, before search operaqtion, also comprise the flow process of setting up of described full-text database, this flow process specifically may further comprise the steps:
      Set up database script according to the various search fields, the type of search field and the attribute of searching fast of search field that relate in the described full-text database of determining;
      Generate the BAF file according to described database script, and the record of storage input in the data block in this BAF file in whole or in part, in the recording mechanism index, preserve the pointer of corresponding data block in whole or in part of described record then;
      Generate the BIF file according to described BAF file, and to having in the described BAF file entry carries out the hash function computing in the search field of searching attribute fast, generate and the unique corresponding hash code of described entry, then the described hash code of preservation, the entry address corresponding and the positional information of described entry in described BIF file with this hash code;
      Generate the VIF file according to described BIF file, described entry is resolved into a plurality of symbols that connect, in the VIF file, preserve the hash code of described entry, described entry positional information and each the off normal value of initial character in described entry that even accords with in described BIF file then.
      Further, in described BAF file, during keeping records, also comprise: in the controll block of described BAF file, preserve tidemark number, through the tidemark of index number and on be modified and increase in the recording mechanism of noting down behind the secondary index at least a.
      Further, in described BAF file, during keeping records, also comprise: the untapped data space of record in the table of the room of described BAF file.
      Further, the operation of the positional information of the described entry of preservation is specially in described BIF file:
      For field type is the entry of phrase field, integer, numerical value, date or time, preserves the field and the son field of this entry;
      For field type is the entry of body field, preserves field, paragraph, sentence and the entry position in sentence of this entry.
      For achieving the above object, the present invention also provides a kind of text retrieval system, comprising:
      The BAF file storage module is used for storing each record of full-text database;
      The BIF file storage module is used for storing the positional information that described BAF file is provided with the entry in the search field of searching attribute fast;
      The VIF file storage module, be used for storing the positional information of described BIF file entry and company's symbol of decompositing by described entry with respect to the off normal value of described entry;
      Word-dividing mode is used to receive the expression formula for search that comprises term, and described term is carried out word segmentation processing;
      Retrieval module, be used in the VIF of full-text database file, searching positional information in the BIF file of described entry in full-text database according to the entry that obtains after the word segmentation processing, in described BIF file, search the recorded information corresponding according to described positional information then, at last according to extracting corresponding data message in the BAF file of described recorded information in full-text database as result for retrieval with described entry.
      Further, described BAF file storage module specifically comprises: the data block in whole or in part and the recording mechanism index of the pointer of corresponding data block in whole or in part that is used to store described record that are used for stored record;
      The BIF file storage module specifically comprises: with have in the described BAF file the unique corresponding hash code of entry in the search field of searching attribute fast, with the corresponding entry address of this hash code and the positional information of described entry;
      The VIF file storage module specifically comprises: the hash code of described entry, described entry positional information and each the off normal value of initial character in described entry that even accords with in described BIF file.
      Further, also comprise in the described BAF file storage module:
      Controll block is used for preserving tidemark number, through the tidemark of index number and to be modified and increase the recording mechanism of noting down after going up secondary index at least a;
      The room table is used at the untapped data space of record.
      Further, also comprise the hash code maker, be used for described BAF file storage module had that entry carries out the hash function computing in the search field of searching attribute fast, generate and the unique corresponding hash code of described entry.
      Based on technique scheme, the object database of the present invention's retrieval has adopted the full-text database that comprises BAF file, BIF file and VIF file, and adopted the inverted file index technology in the BIF file, can find the information of being retrieved rapidly when retrieving owing to the uniqueness of hash code.By the BIF file being connected the index of symbol, can reach the recall ratio and the precision ratio of height when having guaranteed retrieval.
    Description of drawings
      Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:
      Fig. 1 is the schematic flow sheet of an embodiment of text searching method of the present invention.
      Fig. 2 sets up the schematic flow sheet of process for the full-text database of another embodiment of text searching method of the present invention.
      Fig. 3 is the structural representation of BAF file among the present invention.
      Fig. 4 is the process synoptic diagram of the entry Hash of BIF file among the present invention.
      Fig. 5 is the structural representation of an embodiment of text retrieval system of the present invention.
      Fig. 6 is the structural representation of another embodiment of text retrieval system of the present invention.
    Embodiment
      Below by drawings and Examples, technical scheme of the present invention is described in further detail.
      The institutional framework of full-text database of the present invention is different with existing relevant database, and this database includes only three independently files usually, and its suffix is respectively .BAF .BIF and .VIF, just BAF file, BIF file and VIF file.Wherein the BAF file is basic document, is depositing the raw information of data-base recording.The BIF file inverted file (invertedfile) that is otherwise known as is being deposited the entry that can search fast raw information, and the VIF file is the information of falling row of BIF entry that file is deposited, and is mainly used in fuzzy search, to guarantee recall ratio and precision ratio.
      The structure of this three files depends primarily on the information content, and and the storage medium of system operation place environment irrelevant, that is to say that three files of this of database are independent of operating system platform.For example, can directly move on at these three files of setting up on the SUN solaris platform under the system of OSF1 True-64 and move, need not to carry out the format conversion of pouring out, pouring into and so on of data.
      Next, respectively these three kinds of files are described in more detail earlier.
      As shown in Figure 3, be the structural representation of BAF file among the present invention.The BAF file is made up of controll block (control block), recording mechanism index (directory track), room table (free list), data block (data block), and wherein data block and recording mechanism index are the elements of BAF file.
      Depositing tidemark number in the controll block, made so far through the tidemark of index number and after going up secondary index revise and the recording mechanism of newly-increased record at least a.
      The recording mechanism index is formed by many, and every is comprising pointer, points to the latest edition of each record and the date and the time of this edition generation/modification.Can preserve a plurality of records for every.Generally can preserve the record of latest edition in the recording mechanism index, but when forming the recording mechanism index, can produce a pointer that points to last version, and last version also there is the pointer of a more preceding version of sensing, so analogizes.That is to say that each each version that is recorded in its generation and modification all can be retained, when database was carried out index, each record was except that keeping a latest edition, and all the other versions are all with deleted.
      What the room table was preserved is to also have how many untapped data spaces in the BAF file in each data block.
      Data block is the place of put data itself, and an every part of depositing the whole or record of record continuously can also be preserved idle space.
      The BIF file inverted file that is otherwise known as, storage be the positional information that will search the entry (term) in the field in the BAF file fast.When system is doing inverted index and is handling, to the record of BAF file be scanned one by one, the entry that needs to search fast field in the record of also not doing the row of falling, or the last time is fallen the entry of making the relevant field in the record of revising behind the row all win out one by one, carry out hash function (hash) computing, generate the entry block number (entry blocknumber) of hash codes unique, 32 bit values (hash code) and a n bit.Entry block number is represented this entry reference position hereof, i.e. entry address, and actual is exactly a pointer that points to this entry block (entry block).What stored in the entry block is exactly the hash code of entry.Referring to Fig. 4.
      The content of the positional information of entry (term) has substantial connection with the data type of entry place field.For text TEXT field, the positional information of entry will comprise record, field, paragraph, sentence and the position of speech in sentence.As for the field (for example integer, numerical value, date or time etc.) of phrase (phrase) field and other non-texts and character string (String) type, because of wherein there not being the structure of sentence, paragraph, the positional information of entry only comprises field, son field.
      The phrase field is except calculating the hash code of 32bit value entry one by one, can also carry out the hash computing as an entry to whole son field (maximum 256 characters are long), obtain the hash code of 32 bit values, therefore, when whole son field was searched, speech of its seek rate and retrieval was the same fast.Be the BIF file structure table of text or phrase field below.
      
      The content of the positional information of each time of entry appearance is with "---" expression, and it comprises the position etc. of recording mechanism, field number, son field or paragraph, fullstop and speech.
      The VIF file is the index file of the field (for example body field and phrase field etc.) that occurs in the BIF file.System will resolve into company's symbol that one or more characters are formed to each entry when setting up the VIF file, then these are connected symbol and be used as the entry processing, builds up the VIF file.If there is not the VIF file, will there be problems such as precision ratio is not high in retrieving information so, have a strong impact on the information quality of retrieval, especially for the such function of fuzzy search.
      For instance, have in the record of BAF file " apple ", " apple " will be stored in the BIF file as an entry so.It is decomposed, can draw following 12 kind  3, the combination table of 2,1 character:
      | Entry (Term) | Three connect symbol | The doubly-linked symbol | Monocase | 
| ????apple | ??APP | ????AP | ????A | 
| ??PPL | ????PP | ????P | |
| ??PLE | ????PL | ????P | |
| ????LE | ????L | ||
| ????E | |||
The locating information of scanning BIF file gained relevant " apple " comprises the position of whole entry in the BIF file, and respectively connects the off normal value that accords with initial character among the VIF, promptly connects the distance of symbol initial character to the initial character of entry own.More than table is example, and three off normals that even accord with PPL are 2, because it starts from second character of entry apple.The VIF file structure is as shown in the table.
      
      Each positional information piece comprises off normal value in the entry of source of the hash code of a source entry and Lian Fu (content of message block with "---" expression).Blocking when the VIF file is used for database lookup mated or pattern match, and fuzzy search.The data that generally do not comprise value type, date type and time type field in the VIF file.
      For Chinese word, in the VIF file, can adopt the mode of vocabulary, to finish the indirect retrieval of information outside the given speech, for instance, vocabulary can be provided with the speech (control speech, synonym, broad sense speech, narrower term and related term) of five kinds of different concepts, plays the part of different roles respectively.When wherein controlling speech and be retrieval by the term that obtains in the retrieval type, synonym is and controls identical speech on the speech meaning, the broad sense speech is the upperseat concept speech of control speech, narrower term is the speech than control speech meaning more the next one-level notion, related term be with the control speech with the one-level notion and with the related speech of control speech.
      These vocabularys can be existed in the VIF file, if the user need search the synonym/related related content of broad sense speech/narrower term/related term of certain speech, just can realize the inquiry of these collateral informations by searching system.
      Next, more several data types that the present invention gave an example are carried out simple declaration.The data of full-text database of the present invention are made up of record, and record is made up of field again.Field in the record can be held several different kinds of information, for example: phrase Phrase (as title, title, address, keyword, phone, length less than, equal 255 characters); Integer Integer (2,147,483,647 to+2,147,483,647); Numerical value of N umber (integer, real number-1.7E+37 to 1.7E+37); Text Text (the free style of writing by paragraph, sentence, speech form); Date Date (pressing YYYY-MM-DD, YY.MM.DD or YY-MM); Time T ime (press HH:MM:SS, HH-MM-SS or HH:MM, HH-MM); Word string String (depositing binary message).
      The number of field is not limit in the record.Except that text, word string data type, can divide son field (subfield) under the other types field.For the text type field, can be divided into paragraph (paragraph), be divided into sentence (sentence) among the paragraph again, be divided into speech (word) among the sentence again.The quantity of son field can not limit in the field, as the people's file-name field in the record, can hold hundreds and thousands of (no maximum) names, and each name accounts for a son field separately.Paragraph number in the field, the sentence number in the paragraph, the number of the speech in the sentence also is not limit this convenient pieces of official document, legal document deposited.All these son fields, paragraph, sentence, speech all can be composed in database automatically with numbering, so that can accurately search by opsition dependent when needing.So file cabinet that database seems to have a unlimited drawer just as, each drawer is represented a record, but this drawer is like being the no end, because the content that it can be deposited (field, son field, paragraph, sentence, speech) does not have quantitative restriction.
      The field information of the above-mentioned type (for example phrase, integer, numerical value, text, date and time etc.) all can the row of falling, so that search fast.
      As shown in Figure 1, be the schematic flow sheet of an embodiment of text searching method of the present invention.This flow process may further comprise the steps:
      Owing in the BIF file, adopted hash function to generate unique hash code of each entry, therefore when retrieval, can locate the information that is retrieved easily, even in the database more than 1,000,000 grades, also can locate the information that is retrieved fast.Test shows, handles same data object on uniform machinery, and retrieval response speed is than more than the fast order of magnitude of universal relation type database (10 times).Under the 1000000 record order of magnitude situations, only need the part second time by the speed of computer nowadays running.In addition, by the BIF file being connected the index of symbol, can reach the recall ratio and the precision ratio of height when having guaranteed retrieval.
      As shown in Figure 2, set up the schematic flow sheet of process for the full-text database of another embodiment of text searching method of the present invention.This flow process may further comprise the steps:
      The various search fields, the type of search field and the attribute of searching fast of search field that relate in this full-text database that step 201, basis are determined are set up database script;
      Step 202, generate the BAF file, and the record of storage input in the data block in this BAF file in whole or in part according to this database script;
      The pointer of corresponding data block in whole or in part of this record preserved in step 203, the recording mechanism index in the BAF file;
      Step 204, generate the BIF file, and entry carries out the hash function computing in the search field of searching attribute fast to having in this BAF file, generates and the unique corresponding hash code of this entry according to this BAF file;
      Step 205, in this BIF file, preserve this hash code, the entry address corresponding and the positional information of this entry with this hash code;
      Step 206, generate the VIF file, this entry is resolved into a plurality of symbols that connect according to this BIF file;
      Step 207, the hash code of in the VIF file, preserving this entry, this entry positional information and each the off normal value of initial character in this entry that even accords with in this BIF file.
      In the step 203 of technique scheme, can also in the controll block of BAF file, preserve tidemark number, through the tidemark of index number and on be modified and increase in the recording mechanism of noting down behind the secondary index at least a.In addition, also can in the table of the room of BAF file, write down untapped data space.
      In step 205, be the entry field of phrase, integer, numerical value, date or time for field type, common field and son field of only preserving this entry.And be the entry of body field for field type, preserve field, paragraph, sentence and the entry position in sentence of this entry.
      One of ordinary skill in the art will appreciate that: all or part of step that realizes said method embodiment can be finished by the relevant hardware of programmed instruction, aforesaid program can be stored in the computer read/write memory medium, this program is carried out the step that comprises said method embodiment when carrying out; And aforesaid storage medium comprises: various media that can be program code stored such as ROM, RAM, magnetic disc or CD.
      As shown in Figure 5, be the structural representation of an embodiment of text retrieval system of the present invention.Native system embodiment comprises: BAF file storage module  1, BIF file storage module  2, VIF file storage module  3, word-dividing mode  4 and retrieval module  5.
      Wherein, BAF file storage module  1 is used for storing each record of full-text database.This BAF file storage module  1 can specifically comprise: the data block in whole or in part and the recording mechanism index of the pointer of corresponding data block in whole or in part that is used to store described record that are used for stored record.Optionally, this BAF file storage module  1 can also comprise: controll block is used for preserving tidemark number, through the tidemark of index number and to be modified and increase the recording mechanism of noting down after going up secondary index at least a; The room table is used at the untapped data space of record.
      BIF file storage module  2 is used for storing the positional information that the BAF file is provided with the entry in the search field of searching attribute fast.This BIF file storage module  2 can specifically comprise: with have in the described BAF file the unique corresponding hash code of entry in the search field of searching attribute fast, with the corresponding entry address of this hash code and the positional information of described entry.
      VIF file storage module  3 be used for storing the positional information of BIF file entry and company's symbol of decompositing by entry with respect to the off normal value of described entry.This VIF file storage module  3 can specifically comprise: the hash code of described entry, described entry positional information and each the off normal value of initial character in described entry that even accords with in described BIF file.
      BAF file storage module  1, BIF file storage module  2 and VIF file storage module  3 have constituted the data division of full-text database.Word-dividing mode  4 is used to receive the expression formula for search that comprises term, and described term is carried out word segmentation processing.Retrieval module  5 is used for searching positional information in the BIF file of this entry in full-text database according to the entry that obtains after the word segmentation processing in the VIF of full-text database file, in this BIF file, search the recorded information corresponding according to this positional information then, at last according to extracting corresponding data message in the BAF file of this recorded information in full-text database as result for retrieval with this entry.
      As shown in Figure 6, be the structural representation of another embodiment of text retrieval system of the present invention.Compare with a last embodiment, native system embodiment can also comprise hash code maker  6, is used for BAF file storage module  2 had that entry carries out the hash function computing in the search field of searching attribute fast, generates and the unique corresponding hash code of described entry.
      System embodiment of the present invention has adopted hash function to generate unique hash code of each entry in the BIF file, therefore can when retrieving, magnanimity locate the information that is retrieved easily, and, can reach the recall ratio and the precision ratio of height when having guaranteed retrieval by the BIF file being connected the index of symbol.
      Should be noted that at last: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; Although with reference to preferred embodiment the present invention is had been described in detail, those of ordinary skill in the field are to be understood that: still can make amendment or the part technical characterictic is equal to replacement the specific embodiment of the present invention; And not breaking away from the spirit of technical solution of the present invention, it all should be encompassed in the middle of the technical scheme scope that the present invention asks for protection.
    Claims (9)
1, a kind of text searching method may further comprise the steps:
      Reception comprises the expression formula for search of term, and described term is carried out word segmentation processing;
      According to the positional information of searching in the virtual information file file of the entry that obtains after the word segmentation processing in full-text database in the byte index file of described entry in full-text database;
      In described byte index file, search the recorded information corresponding according to described positional information with described entry;
      Quicken to extract in the file corresponding data message as result for retrieval according to the byte of described recorded information in full-text database.
    2, text searching method according to claim 1 wherein before search operaqtion, also comprises the flow process of setting up of described full-text database, and this flow process specifically may further comprise the steps:
      Set up database script according to the various search fields, the type of search field and the attribute of searching fast of search field that relate in the described full-text database of determining;
      Generate byte according to described database script and quicken file, and the record of storage input in the data block in this byte acceleration file in whole or in part, preserves the pointer of corresponding data block in whole or in part of described record then in the recording mechanism index;
      Quicken file according to described byte and generate the byte index file, and described byte quickened to have in the file entry carries out the hash function computing in the search field of searching attribute fast, generate and the unique corresponding hash code of described entry, then the described hash code of preservation, the entry address corresponding and the positional information of described entry in described byte index file with this hash code;
      Generate the virtual information file according to described byte index file, described entry is resolved into a plurality of symbols that connect, in the virtual information file, preserve the hash code of described entry, described entry positional information and each the off normal value of initial character in described entry that even accords with in described byte index file then.
    3, text searching method according to claim 2, wherein when described byte is quickened in the file keeping records, also comprise: in described byte is quickened the controll block of file, preserve tidemark number, through the tidemark of index number and on be modified and increase in the recording mechanism of noting down behind the secondary index at least a.
    4, text searching method according to claim 2 wherein when described byte is quickened in the file keeping records, also comprises: quicken the untapped data space of record in the room table of file in described byte.
    5, text searching method according to claim 2, the operation of wherein preserving the positional information of described entry in described byte index file is specially:
      For field type is the entry field of phrase, integer, numerical value, date or time, preserves the field and the son field of this entry;
      For field type is the entry of body field, preserves field, paragraph, sentence and the entry position in sentence of this entry.
    6, a kind of text retrieval system comprises:
      Byte is quickened file storage module, is used for storing each record of full-text database;
      The byte index file storage module is used for storing described byte and quickens the positional information that file is provided with the entry in the search field of searching attribute fast;
      The virtual information file storage module, be used for storing the positional information of described byte index file entry and company's symbol of decompositing by described entry with respect to the off normal value of described entry;
      Word-dividing mode is used to receive the expression formula for search that comprises term, and described term is carried out word segmentation processing;
      Retrieval module, be used in the virtual information file of full-text database, searching positional information in the byte index file of described entry in full-text database according to the entry that obtains after the word segmentation processing, in described byte index file, search the recorded information corresponding according to described positional information then, quicken to extract in the file corresponding data message as result for retrieval according to the byte of described recorded information in full-text database at last with described entry.
    7, text retrieval system according to claim 6, wherein said byte quicken file storage module and specifically comprise: the data block in whole or in part and the recording mechanism index of the pointer of corresponding data block in whole or in part that is used to store described record that are used for stored record;
      The byte index file storage module specifically comprises: with described byte quicken to have in the file the unique corresponding hash code of entry in the search field of searching attribute fast, with the corresponding entry address of this hash code and the positional information of described entry;
      The virtual information file storage module specifically comprises: the hash code of described entry, described entry positional information and each the off normal value of initial character in described entry that even accords with in described byte index file.
    8, text retrieval system according to claim 7, wherein said byte quicken also to comprise in the file storage module:
      Controll block is used for preserving tidemark number, through the tidemark of index number and to be modified and increase the recording mechanism of noting down after going up secondary index at least a;
      The room table is used at the untapped data space of record.
    9, text retrieval system according to claim 6, wherein also comprise the hash code maker, be used for that described byte is quickened file storage module and have that entry carries out the hash function computing in the search field of searching attribute fast, generate and the unique corresponding hash code of described entry.
    Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN200810126025A CN101620607A (en) | 2008-07-01 | 2008-07-01 | Full-text retrieval method and full-text retrieval system | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN200810126025A CN101620607A (en) | 2008-07-01 | 2008-07-01 | Full-text retrieval method and full-text retrieval system | 
Publications (1)
| Publication Number | Publication Date | 
|---|---|
| CN101620607A true CN101620607A (en) | 2010-01-06 | 
Family
ID=41513848
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN200810126025A Pending CN101620607A (en) | 2008-07-01 | 2008-07-01 | Full-text retrieval method and full-text retrieval system | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN101620607A (en) | 
Cited By (11)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN102081649A (en) * | 2010-12-31 | 2011-06-01 | 深圳联友科技有限公司 | Method and system for searching computer files | 
| CN103136352A (en) * | 2013-02-27 | 2013-06-05 | 华中师范大学 | Full-text retrieval system based on two-level semantic analysis | 
| CN103186621A (en) * | 2011-12-30 | 2013-07-03 | 北大方正集团有限公司 | Catalogue generation method and device | 
| CN103514256A (en) * | 2013-08-02 | 2014-01-15 | 西安电子工程研究所 | Rationalization proposal full-text retrieval system | 
| CN103823799A (en) * | 2012-11-16 | 2014-05-28 | 镇江诺尼基智能技术有限公司 | New-generation industry knowledge full-text search method | 
| WO2014169587A1 (en) * | 2013-04-18 | 2014-10-23 | Tao guangyi | Database storage system based on optical disk library, and method using same | 
| CN104834664A (en) * | 2015-02-02 | 2015-08-12 | 北京理工大学 | Optical disc juke-box oriented full text retrieval system | 
| CN104834663A (en) * | 2015-02-02 | 2015-08-12 | 北京理工大学 | Full-text retrieval system facing optical disc library | 
| CN108228643A (en) * | 2016-12-21 | 2018-06-29 | 北京视联动力国际信息技术有限公司 | A kind of search method and system | 
| CN110704579A (en) * | 2019-08-22 | 2020-01-17 | 中国人民解放军军事科学院评估论证研究中心 | Full-text retrieval method and system based on branch definition | 
| CN114003685A (en) * | 2022-01-04 | 2022-02-01 | 广州奥凯信息咨询有限公司 | Word segmentation position index construction method and device, and document retrieval method and device | 
- 
        2008
        - 2008-07-01 CN CN200810126025A patent/CN101620607A/en active Pending
 
Cited By (17)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN102081649B (en) * | 2010-12-31 | 2012-08-15 | 深圳联友科技有限公司 | Method and system for searching computer files | 
| CN102081649A (en) * | 2010-12-31 | 2011-06-01 | 深圳联友科技有限公司 | Method and system for searching computer files | 
| CN103186621B (en) * | 2011-12-30 | 2016-07-06 | 北大方正集团有限公司 | A kind of catalogue generates method and apparatus | 
| CN103186621A (en) * | 2011-12-30 | 2013-07-03 | 北大方正集团有限公司 | Catalogue generation method and device | 
| CN103823799A (en) * | 2012-11-16 | 2014-05-28 | 镇江诺尼基智能技术有限公司 | New-generation industry knowledge full-text search method | 
| CN103136352A (en) * | 2013-02-27 | 2013-06-05 | 华中师范大学 | Full-text retrieval system based on two-level semantic analysis | 
| CN103136352B (en) * | 2013-02-27 | 2016-02-03 | 华中师范大学 | Text retrieval system based on double-deck semantic analysis | 
| WO2014169587A1 (en) * | 2013-04-18 | 2014-10-23 | Tao guangyi | Database storage system based on optical disk library, and method using same | 
| JP2016522921A (en) * | 2013-04-18 | 2016-08-04 | 光毅 陶 | Database storage system based on jukebox and method using the same | 
| CN103514256A (en) * | 2013-08-02 | 2014-01-15 | 西安电子工程研究所 | Rationalization proposal full-text retrieval system | 
| CN104834663A (en) * | 2015-02-02 | 2015-08-12 | 北京理工大学 | Full-text retrieval system facing optical disc library | 
| CN104834664A (en) * | 2015-02-02 | 2015-08-12 | 北京理工大学 | Optical disc juke-box oriented full text retrieval system | 
| CN108228643A (en) * | 2016-12-21 | 2018-06-29 | 北京视联动力国际信息技术有限公司 | A kind of search method and system | 
| CN110704579A (en) * | 2019-08-22 | 2020-01-17 | 中国人民解放军军事科学院评估论证研究中心 | Full-text retrieval method and system based on branch definition | 
| CN110704579B (en) * | 2019-08-22 | 2020-10-23 | 中国人民解放军军事科学院评估论证研究中心 | Full-text retrieval method and system based on branch definition | 
| CN114003685A (en) * | 2022-01-04 | 2022-02-01 | 广州奥凯信息咨询有限公司 | Word segmentation position index construction method and device, and document retrieval method and device | 
| CN114003685B (en) * | 2022-01-04 | 2022-06-07 | 广州奥凯信息咨询有限公司 | Word segmentation position index construction method and device, and document retrieval method and device | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| CN101620607A (en) | Full-text retrieval method and full-text retrieval system | |
| US7979268B2 (en) | String matching method and system and computer-readable recording medium storing the string matching method | |
| US8117026B2 (en) | String matching method and system using phonetic symbols and computer-readable recording medium storing computer program for executing the string matching method | |
| CN101840400B (en) | A multi-level classification retrieval method and system | |
| US6470347B1 (en) | Method, system, program, and data structure for a dense array storing character strings | |
| US5099426A (en) | Method for use of morphological information to cross reference keywords used for information retrieval | |
| KR101231560B1 (en) | Method and system for discovery and modification of data clusters and synonyms | |
| US8473501B2 (en) | Methods, computer systems, software and storage media for handling many data elements for search and annotation | |
| CN102207948B (en) | Method for generating incident statement sentence material base | |
| CN109857898A (en) | A kind of method and system of mass digital audio-frequency fingerprint storage and retrieval | |
| CN1661593B (en) | Method for translating computer language and translation system | |
| US20080010238A1 (en) | Index having short-term portion and long-term portion | |
| CN101794307A (en) | Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea | |
| CN106503040B (en) | It is applicable in the KV database and its creation method of SQL query method | |
| CN105373541A (en) | Processing method and system for data operation request of database | |
| CN102831224A (en) | Creating method for data index base and searching suggest generation method and device | |
| CN106649286B (en) | A Method for Term Matching Based on Double Array Dictionary Tree | |
| US20200278980A1 (en) | Database processing apparatus, group map file generating method, and recording medium | |
| CN112783927A (en) | Database query method and system | |
| US6826563B1 (en) | Supporting bitmap indexes on primary B+tree like structures | |
| CN101493824A (en) | Data retrieval method and device for database | |
| CN116090416B (en) | Standard writing method, system, equipment and medium based on standard knowledge graph | |
| CN102207947B (en) | Direct speech material library generation method | |
| CN101436203B (en) | Recording index method and apparatus | |
| US7870138B2 (en) | File storage and retrieval method | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
| WD01 | Invention patent application deemed withdrawn after publication | Open date: 20100106 |