Embodiment
In computer realm, conventionally according to the characteristic that is organized data, build for organizing the data structure (also can be described as data model or data layout) of described data, after using these data structure organization data, will these data be stored with the form of file.When the shared biomolecular data by different pieces of information structure organization, in order to eliminate large quantity and the complicacy of the data structure that file adopts of carrying biomolecule number pick, the present embodiment be take the data structure of other various known organism molecular data files and is basis, according to the feature of the biomolecular data objectively existing, build the new data structure that is conducive to share biomolecular data, for the new computer model of tissue biological's molecular data, thereby the different biomolecular data in various sources is organized and is stored as biomolecular data file, for researcher, share.
According to the flow process of the first embodiment described in Fig. 1, can find out that this embodiment mainly comprises four main steps.
First in step 1, select significant field in the data structure of each biomolecular data file.The data structure of biological data file described here has been studied clear in advance, therefore can determine wherein significant field from the angle of biomolecular science.Due to the difference of research contents, the degree of depth, bio-diversity etc., complexity, the structure of different biological molecules data file data structure also have larger difference.For example the concrete field quantity of data structure is, the implication of field expression data is not identical, described significant field, exactly from the general character of biomolecule research, select the field of information value, could, in follow-up step, organize to be out conducive to continue to study, have and share the biomolecular data being worth like this.Unworthy field, the field that for example indicates individual research process can vary with each individual, for follow-up research, just nonsensical, also has in addition data storage to enter the serial number field etc. of file.Need explanation, the present embodiment is applicable to adopt the data store organisation of two-dimentional tabular form, is also applicable to adopt text structure, i.e. the file of .TXT structure.For .TXT file, if a certain line character does not identify the key word of biomolecular data, be exactly insignificant row, i.e. insignificant field, wherein, key word is field name.
Significant field in the data structure of each biomolecular data file of determining for step 1, in step 2, organize, be about to these fields according to the synthetic field groups of the der group of selecting, until significant field is all selected in the data structure of each biomolecular data file.
In step 3, according to the logic arrangement of described field expressing information, combine the field in described field groups, form newer field set.The present embodiment, when building new data structure, according to the feature of biomolecular data, rearranges and combines the field in new set of fields.The foundation of described permutation and combination, be the logic association that biomolecular data self objectively exists, that is to say, if there is relevance between the biomolecular data that field is expressed, just think that these fields have logical relation, the biomolecular data that these fields are expressed objectively subsistence logic is associated.These logic associations make the data of field after permutation and combination sequentially have following relation, the field that order is arranged is formerly the basis in rear field, be conducive to like this by the more information of limited data representation, thereby can obtain the technique effect that uses less data that more information is provided, and then obtain more quantity of information and technique effect that less storage space takies, the example of this respect word illustrated afterwards.Described combined type refers to expressing difference, but the identical field merging of field contents essence, concrete merging completes according to the meaning of data, has multiple concrete implementation in reality, for example:
First obtain significant field in the data structure of the first biomolecular data file, form field groups; For significant field in the data structure of the second obtaining and later each biomolecular data file, the not identical field of the field with field groups is supplemented and entered field groups.
This less data scale that the embodiment of the present invention provides reaches the scheme of more information, and new data structure can contain the data structure of existing biomolecular data file, when sharing the data of this biological data file, can consider the data structure of the biological data file that is shared, therefore adopt the embodiment of the present invention can to reach the efficient technique effect of sharing biomolecular data, solved researcher run into the biomolecular data file that is stored in different computer systems, adopts different pieces of information structure cannot handled easily and the problem of analysis and research.By the way, the present embodiment is specially adapted to the data of the biological data file that organising data amount is larger, if order is arranged the explanation of the data after data formerly need to be arranged in file, in order to share the high-level efficiency of data, spend the data that a large amount of time is adjusted or inquiry needs, the redundancy that relies on data solves this time and spends large problem, and adopts the present embodiment just can address this problem.The most important significance of this step has been to utilize the logic behavior of data itself, improves the efficiency of organising data, uses the problem of the less more information of data representation, and improves the efficiency of sharing data.
Then in step 4, with newer field set, generate the new data structure of biomolecular data file, generate and can be used in the file of sharing data in existing biomolecular data file.
Finally in step 5, use the neoformation molecular data file that has described new data structure to carry the data in the biomolecular data file reading.In the present embodiment, step 5 realizes according to following sub-step: first read in and need shared biomolecular data file in calculator memory, then judge that can the data structure that this biomolecular data file adopts be correctly validated, if can not, feedback None-identified information, end operation, otherwise use neoformation molecular data file data structure and the Related fields relation that is read into the biomolecular data file data structure in calculator memory, the data correspondence of data structural field in this document is filled into data space corresponding to neoformation molecular data file data structure respective field.Can the data structure that judgement biomolecular data file adopts be correctly validated, and can complete by the extension name of file or the special identifier in file content, and this does not repeat.
The example that is shared as with .TXT biomolecular data file illustrates embodiment illustrated in fig. 1 below..TXT the content of file is generally following form:
The first row lteral data;
The second style of writing digital data;
The third line lteral data;
The N digital data of composing a piece of writing.
Wherein, the first row lteral data is the first trip of this biomolecular data file, for describing this biomolecular data file second row to structure and the implication of N style of writing digital data.The second style of writing digital data is used for describing the concrete biomolecule information of this biomolecule file to N style of writing digital data.
For example, adopt instance data form with .TXT file in computing machine of the biomolecule file of GenBank data structure to preserve, its first trip lteral data is:
LOCUS?LISOD?756bp?DNA?linear?BCT30-JUN-1993;
After the second row, data are:
DEFINITION?Listeria?ivanovii?sod?gene?for?superoxide?dismutase;
KEYWORDS?sod?gene:superoxi?de?dismutase;
ACCESSION?X64011?S78972;
Wherein, the implication of first trip lteral data is:
Note: bp also can be written as BP, full name base pair, and base-pair namely, the 756bp meaning is exactly 756 base-pairs.For describing the long measure of DNA sequence dna.
After supposing to adopt second row of biomolecule file of EMBL data structure, data are identical, and first trip lteral data is:
ID?X64011;SV1;linear;genomic?DNA;STD;PRO;756BP.
The implication of above-mentioned first trip lteral data is:
Suppose take that above-mentioned two kinds of biomolecular data files generate the new data structure of biomolecular data file as basis, according to embodiment illustrated in fig. 1, through step 1, to step 4, obtain following new data structure:
< molecule title >, < sequence length >, < source database name >, < database access >, < Data Update time >, the > of < version number, < sequence type >, < molecular classification >, < molecule is described >, < key word >, whether < is ring molecule >, visible, new data structure is characterised in that: set of fields is the union of known data structure set of fields, and field name has been carried out standard and unification according to the meaning of biomolecule, and field has been carried out new sequence and combination.For example, molecular conformation field and whether ring molecule field combines and unifies.
When use the neoformation molecular data file that has described new data structure to carry the data in the GenBank biomolecule file reading through step 5, the partial content of neoformation molecular data file is:
< molecule title >LISOD;
< sequence length >756;
< source database name >BCT;
< database access >X64011 S78972; I.e. " ACCESSION " field;
< Data Update time >30-JUN-1993;
The >NULL of < version number; //NULL field represents sky;
< sequence type >DNA;
< molecular classification >NULL;
< molecule is described >Listeria ivanovii sod gene for superoxide dismutase; I.e. " DEFINITION " field;
< key word >sod gene; Superoxide dismutase; I.e. " KEYWORDS " field;
Whether < is that ring molecule >false//when the content of this field is false (false), molecular conformation is linear, if circular molecular conformation is annular (true).
Suppose that EMBL biomolecule file is also with the preservation of .TXT document form, ID is the sign of this biomolecule file.Its first trip lteral data is:
ID?X64011;SV1;linear;genomic?DNA;STD;PRO;756BP;
After the second row, data are:
XX;
AC?X64011;S78972;
XX;
SV?X64011.1:
XX;
DE?Listeria?ivanovii?sod?gene?for?superoxide?dismutase;
XX;
KW?sod?gene;superoxide?dismutase;
Note, in this document, " DE " is equivalent to " DEFINITION ", and " SV " is equivalent to " version number ", and " AC " is equivalent to " ACCESSION ", and " KW " is equivalent to " KEYWORDS ".
When use the neoformation molecular data file that has described new data structure to carry the data in the EMBL biomolecule file reading through step 5, the partial content of neoformation molecular data file is:
< molecule title >X64011;
< sequence length >756;
< source database name >STD; Explanation is from EMBL database;
< database access >X64011; S78972;
< Data Update time >NULL;
The >X64011.1 of < version number;
< sequence type >genomic DNA:
< molecular classification >PRO;
< molecule is described >Listeria ivanovii sod gene for superoxide dismutase;
< key word >sod gene; Superoxide dismutase;
Whether < is ring molecule >false;
As from the foregoing, the embodiment of the present invention builds a kind of new data model structure based on biomolecular data characteristic, this structure comprises the data structure of the biomolecular data file employing that existing researcher uses, the identical information of biological meaning essence in the different biological molecules data file that so just different researchers can be used, by other, adopt the field and the field mappings relation that adopts the biomolecule file of new data structure of the biomolecule file of traditional data structure, just by two kinds of separate sources, adopt data stuffing in the biological data file of different pieces of information structure to adopting data space corresponding to respective field in the biomolecular data file of new data structure.For example, adopt the 765bp in the biological data file of GenBank data structure and adopt the 765BP in the biological data file of EMBL data structure, be filled into the data space of the < molecular length > Related fields in the biological data file based on new data structure.The unique biological meaning data " 30-JUN-1993 " that occur in GenBank file will directly be filled into the data space of < Data Update time > Related fields in biological data file, and in EMBL, do not possess such biological meaning data, the data space of the < Data Update time > Related fields in biological data file is filled " NULL " (sky), such data stuffing does not need user's manual intervention of biological data file, visible, adopting method that the embodiment of the present invention provides to facilitate researcher to realize biomolecular data section shares and unified operation, for next step provides possibility to the efficient analysis of data and research.
In embodiment illustrated in fig. 1, the realization of step 5 is with reference to figure 2.
Step 21: read in biomolecular data file in calculator memory.
Step 22: judge that can the data structure that this biomolecular data file adopts be correctly validated.
What suppose to read in is GenBank biomolecular data file (.TXT file), and its partial data is:
LOCUS?LISOD?756bp?DNA?linear?BCT30-JUN-1993;
DEFINITION?Listeria?ivanovii?sod?gene?for?superoxide?dismutase;
ACCESSION?X64011?S78972:
VERSION?X64011.1GI:44010:
KEYWORDS?sod?gene;superoxide?dismutase;
The data of the first row are the description line of this document content, and the later data of the second row are that concrete biomolecule information is described.By identification this document the first row, whether there is " LOCUS " sign, can confirm that whether this document is GenBank formatted file, adopts the file of GenBank format data structure.
If what read in is EMBL biomolecular data file (.TXT file), its partial data is:
ID?X64011;SV1;linear;genomic?DNA;STD;PRO;756BP;
XX;
AC?X64011;S78972:
XX;
SV?X64011.1:
XX;
DE?Listeria?ivanovii?sod?gene?for?superoxide?dismutase:
XX;
KW?sod?gene;superoxi?de?di?smutase;
The data of the first row are the description line of this document content, and the later data of the second row are that concrete biomolecule information is described.XX is an insignificant null.By identification this document the first row, whether there is " ID " sign and can confirm that with " SV " sign afterwards whether this document is the file that adopts EMBL data structure, adopts the file of EMBL format data structure.Wherein, " SV " is the unique identification character of EMBL file.
Step 23: if current file None-identified sends a None-identified feedback, and the relevant information of notifying computer system to add new biomolecular data file, then end operation.That is to say, the present embodiment adopts for storing the pattern database of biomolecular data file-related information, when running into the biomolecular data file of unknown data structure, feedback can not automatic identification information, after the data structure identification that need to adopt this biomolecular data file, to the relevant information of adding this biomolecular data file in pattern database, biomolecular data file name for example, the field that data structure title, data structure adopt etc.If current file can correctly be identified, enter step 24.
Step 24: use neoformation molecular data file data structure and the Related fields relation that is read into other biomolecular data file data structure in calculator memory, the data correspondence of data structural field in the biomolecular data file reading in is filled into data space corresponding to neoformation molecular data file data structure respective field, completes data in the biomolecular data file that adopts traditional data structure to adopting map operation in the biomolecular data file of new data structure.
It is the concrete number pick map operation of example explanation that the file of following employing GenBank format data structure is take in the operation of concrete field mappings, supposes that the partial data of this document is:
LOCUS?LISOD?756bp?DNA?linear?BCT30-JUN-1993;
DEFINITION?Listeria?ivanovii?sod?gene?for?superoxide?dismutase;
ACCESSION?X64011?S78972;
VERSION?X64011.1GI:44010:
KEYWORDS?sod?gene;superoxide?dismutase;
The Related fields relation that adopts the field of GenBank format data structure file and the biomolecular data file of the employing new data structure that the embodiment of the present invention provides, data-mapping relation and map operation are:
< molecule title >=LISOD, illustrate: LOCUS is the file identification that adopts GenBank data structure in ncbi database, also be the key word before name, there is corresponding relation with < molecule title > in data model, in the data of the first row of this document, after LOCUS, it is exactly the title of current biomolecule, when running into this mark of LOCUS, it by Context resolution thereafter, is just the content of < molecule title > field in new data structure, further " LOCUS " correspondence is filled into the data space of neoformation molecular data file data structure < molecule title > Related fields.
< molecule is described >=DEFINITION; Illustrate: when running into this key word of DEFINITION, by Context resolution thereafter, be just the content that in new data structure, < molecule is described > field, further " Listeria ivanovii sod gene for superoxide dismutase " correspondence be filled into the data space that neoformation molecular data file data structure < molecule is described > Related fields.
The >=VERSION of < version number; Illustrate: when running into this key word of VERSION, by Context resolution thereafter, be just the content of the > of < version number field in new data structure, further " X64011.1GI:44010 " correspondence be filled into the data space of the neoformation molecular data file data structure < > of version number Related fields.
< database access >=ACCESSION; Illustrate: when running into this key word of ACCESSION, by Context resolution thereafter, be just the content of < database access > field in new data structure, further " X64011.1GI:44010 " correspondence be filled into the data space of neoformation molecular data file data structure < database access > Related fields.
< key word >=KEYWORDS; Illustrate: when running into this key word of KEYWORDS, by Context resolution thereafter, be just the content of < database access > field in new data structure, further by " sod gene; Superoxide dismutase " correspondence is filled into the data space of neoformation molecular data file data structure < key word > Related fields.
The rest may be inferred for the data-mapping relation of all the other fields and map operation.
Implementing the embodiment of the present invention travels through the biomolecular data file being read in calculator memory from the beginning to the end line by line, adopt the data of regular expression identification this document data structural field, according to the corresponding relation of field, realize the map operation of data from the biomolecular data file that reads in to new biomolecular data file.
Regular expression recognition methods is a kind of character match method general in computing machine, for example mate " ACCESSION ", regular expression is exactly that { ACCESSION}, utilizes this regular expression can identify ACCESSION character field and position, further finds the position of the concrete data of this field.
Adopt the file of other format data structure, for example, adopt the data-mapping operation of the file of EMBL data structure, identical with the data-mapping operating process of the file of above-mentioned employing GenBank format data structure, be not repeated herein.
According to embodiment illustrated in fig. 1, during molecular length field in will accessing above-mentioned 2 files, just only need to be to adopting the < sequence length > field value of biomolecular data file of new data structure just passable.Thus, the biomolecular data that is mapped to the biomolecular data file that adopts new data structure for all, can carry out value according to corresponding field, realizes unified operation easily.
Maximum feature embodiment illustrated in fig. 1, be exactly to implement the method all will generate the new data structure of biomolecular data file at every turn, then could use the neoformation molecular data file that has described new data structure to carry the data in the biomolecular data file reading.In order to simplify execution embodiment illustrated in fig. 1 and to improve execution efficiency, set in advance a model database, the new data structure that storage generates first, without all generate new data structure at every turn, step 1 arrives step 4 without all carry out at every turn thus, thereby improves the execution efficiency of the present embodiment.When running into a new biomolecular data file that can be correctly validated, will the data structure with this biomolecular data file in significant field supplement new data structure.In addition, the data structure that also can adopt with the biomolecular data file that this model database storage can be correctly validated, can the data structure that so just can use the data judgement biomolecular data file in described model database to adopt be correctly validated.For the biomolecular data file that can be correctly validated newly increasing, also the number pick structure of its employing to be supplemented and stored model database into, thereby increase the quantity of the biomolecular data file that uses this model database identification.
Above the efficient embodiment of the method for sharing of a kind of biomolecular data provided by the present invention is described in detail, applied specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment is just for helping to understand method of the present invention and core concept thereof; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.