[go: up one dir, main page]

CN112553361A - Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data - Google Patents

Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data Download PDF

Info

Publication number
CN112553361A
CN112553361A CN202011310367.5A CN202011310367A CN112553361A CN 112553361 A CN112553361 A CN 112553361A CN 202011310367 A CN202011310367 A CN 202011310367A CN 112553361 A CN112553361 A CN 112553361A
Authority
CN
China
Prior art keywords
snp
sequence
sequencing data
identifying
rad
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011310367.5A
Other languages
Chinese (zh)
Inventor
吴新义
李汉美
刘庭付
吴晓花
汪颖
汪宝根
鲁忠富
李国景
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Academy of Agricultural Sciences
Original Assignee
Zhejiang Academy of Agricultural Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Academy of Agricultural Sciences filed Critical Zhejiang Academy of Agricultural Sciences
Priority to CN202011310367.5A priority Critical patent/CN112553361A/en
Publication of CN112553361A publication Critical patent/CN112553361A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/6895Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Botany (AREA)
  • Mycology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for identifying SNP of broad beans by using simplified genome sequencing data, which comprises the following steps: step one, extracting sample genome DNA; step two, constructing a RAD library, digesting genomic DNA by using EcoRI, enriching the library by using a PCR amplification method, recovering a target band by using agarose gel, and sequencing a single end; generating an original sequence, performing sequence quality control analysis, clustering high-quality sequences according to sequence similarity to generate RAD-tags, clustering the RAD-tags to perform SNP calling, and correcting the SNP genotype; step four, KASP marker development and SNP genotyping. The method provided by the invention utilizes genome sequencing data to mine SNP, so that not only can SNP mutation of a gene expression region be identified, but also SNP mutation of non-coding regions such as the inside and the inter-gene of a gene can be identified, and the source of the SNP is more abundant.

Description

Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data
Technical Field
The invention relates to the technical field of biological detection, in particular to a method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data.
Background
The broad beans are rich in protein and cellulose and are easy to digest and absorb, dry seeds of the broad beans can be used as grains and feeds or processed into leisure food, and fresh seeds can be used as vegetables for eating. The root system of the broad bean has the function of biological nitrogen fixation, and is an important crop rotation and soil cultivation crop in the structure adjustment of the planting industry.
In recent years, with the rapid development of second-generation sequencing technologies and the significant reduction of sequencing cost, high-throughput sequencing has been widely applied to the label development, gene localization and other works of complex and huge genome crops such as wheat. DNA sequence Polymorphism caused by Single Nucleotide variation on genome level, namely Single Nucleotide Polymorphism (SNP) markers, become ideal molecular markers of a new generation due to the characteristics of wide distribution, high density, good stability, suitability for large-scale screening and the like on the genome, but on broad beans, the number of the SNP markers which are publicly reported at present is limited.
Simplified genome sequencing, such as RAD-Seq (Restriction site-associated DNA sequencing), refers to the use of Restriction enzymes to break down genomic DNA and to perform high-throughput sequencing of specific fragments to obtain sequence data representing the entire genomic information of a species of interest, by reducing the complexity of the genome. Because the sequencing depth is moderate, the cost is low and the reference genome can not be depended on, the method is widely applied to marker development, genetic map construction, target gene positioning and the like on a plurality of non-model species at present.
Broad beans are diploid crops (2n ═ 2x ═ 12), have genomes of about 13Gb, are 25 times larger than alfalfa, which is a leguminous crop, and are one of the species with the largest genome in the leguminous crops. The ultra-large genome of broad bean seriously hinders genome resource researches such as whole genome sequencing and marker development, so that work progress such as acquiring genetic gain by using molecular markers is slow, and therefore, the prior art needs to be improved.
Disclosure of Invention
The invention provides a method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data, which aims to solve the technical problem that the work progress is slow when the molecular markers are used for acquiring genetic gain and the like because the oversized genome of the broad beans seriously hinders genome sequencing, marker development and other genome resource researches.
In order to solve the technical problems, the invention provides the following technical scheme:
a method for identifying broad bean SNP by using simplified genome sequencing data comprises the following steps:
taking young leaves of broad bean seedlings as a sample, and grinding the young leaves by liquid nitrogen to extract sample genome DNA;
step two, constructing a RAD library, digesting genomic DNA by using EcoRI, enriching the library by using a PCR amplification method, recovering a target band by using agarose gel, and sequencing a single end;
generating an original sequence, performing sequence quality control analysis, removing sequences smaller than 85bp to obtain a screened sequence, clustering the screened sequence according to sequence similarity to generate RAD-tags, clustering the RAD-tags to perform SNP calling, and correcting the SNP genotype;
selecting a sequence which covers the SNP locus and has a total length of 100bp, designing KASP primers, enabling each KASP primer marker to respectively comprise two forward primer sequences and a universal reverse primer sequence for distinguishing SNP allelic variation, and carrying out SNP signal detection by SNP genotyping.
Further, the broad bean young leaf in the first step is the broad bean young leaf which grows for 1 week.
Further, in the first step, after the sample genomic DNA is extracted, the method further comprises:
the quality of the extracted sample genomic DNA was checked by agarose gel electrophoresis, and the concentration of the extracted sample genomic DNA was checked using a NanoDrop2000 ultramicro spectrophotometer.
Further, in the second step, when the genomic DNA is digested by EcoRI, adding ' A ' to the 3' end of the digested fragment for treatment, and connecting an MID joint; single-ended sequencing used the Illumina HiSeq2000 platform.
Further, in the third step, the original sequence is generated by Illumina base catching software CASAVA v1.8.2, and sequence quality control analysis is performed by using trimmatic software under default parameters.
Further, in the third step, clustering the screened sequences by using an ustacks software according to the sequence similarity to generate RAD-tags, clustering the RAD-tags by using a cstags software under default parameters to perform SNP calling, and finally correcting the SNP genotype by using a Bayesian algorithm.
Further, in the fourth step, Kraken is adoptedTMThe software designed KASP primers.
Further, in the fourth step, SNP genotyping adopts an IntelliQube high-throughput genotyping detection platform to detect SNP signals.
Furthermore, in the fourth step, when SNP signal detection is performed in SNP genotyping, the volume of a single-site reaction is 1.6. mu.L, wherein the volume of the sample DNA is 0.8. mu.L, and the volume of the mixture of 2xMaster mix and Primer mix is 0.8. mu.L.
Further, in the fourth step, when SNP genotyping is performed for SNP signal detection, the PCR amplification procedure is 15min at 95 ℃,20 s at 94 ℃ and 60s at 61-55 ℃, which are 10 cycles; 26 cycles at 94 ℃ for 20s and 55 ℃ for 60 s.
The technical scheme provided by the invention has the beneficial effects that at least:
the invention provides a method for identifying broad bean SNP by using simplified genome sequencing data, which is different from the method for mining SNP by using transcriptome sequencing data in the prior art. The SNP identified by the method can provide a powerful genetic tool for broad bean germplasm resource identification, gene localization and molecular breeding.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of 4 amplification signals of KASP markers provided by an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The embodiment provides a method for identifying broad bean SNP by using simplified genome sequencing data, which comprises the steps of carrying out simplified genome sequencing on broad bean varieties by using RAD-Seq technology, identifying broad bean SNP in a whole genome range by using a reference genome-independent SNP identification technology and analyzing the characteristics of the broad bean SNP.
The broad bean germplasm used in the embodiment is collected and stored by agriculture and forestry scientific research institute of Lishui city; wherein, 8 germplasms (FB017, FB032, FB036, FB056, FB076, FB080, FB081) are used for simplifying genome sequencing, and the other 46 germplasms are used for verifying SNP accuracy.
The method for identifying the broad bean SNP by using the simplified genome sequencing data comprises the following steps:
step one, DNA extraction:
taking young leaves of broad bean seedlings as samples, grinding the young leaves by using liquid nitrogen, and extracting sample genome DNA by using a DNA extraction kit; wherein, the young leaves of the broad beans adopted in the embodiment are young leaves of the broad beans which grow for 1 week; in this embodiment, after extracting the genomic DNA of the sample, the method further comprises: the quality of the extracted sample genomic DNA was checked by agarose gel electrophoresis, and the concentration of the extracted sample genomic DNA was checked using a NanoDrop2000 ultramicro spectrophotometer.
Step two, library construction and sequencing:
constructing RAD library by referring to the method of Baird et al (Baird NA, Etter PD, Atwood TS, et al, Rapid SNP discovery and genetic mapping using sequential RAD markers [ J ]. PLoS ONE,2008,3, e3376), digesting genomic DNA with EcoRI, treating 3' end of the digested fragment with ' A ', and connecting MID (multiple identifier) linker; enriching the library by using a PCR amplification method, recovering a target band by using agarose gel, and sequencing at a single end; wherein, the Illumina HiSeq2000 platform is used for single-end sequencing.
Step three, SNP identification:
generating an original sequence through Illumina base cloning software CASAVA v1.8.2, performing sequence quality control analysis under default parameters by using Trimmomatic software, removing sequences smaller than 85bp to obtain a screened sequence, clustering the screened sequence according to sequence similarity to generate RAD-tags, clustering the RAD-tags to perform SNP cloning, and correcting SNP genotype;
specifically, the screened sequences are clustered according to sequence similarity to generate RAD-tags, the RAD-tags are clustered to perform SNP calling, and the process of correcting the SNP genotype is as follows:
clustering the screened sequences by using ustacks software to generate read tags (RAD-tags), referring to Xu et al (Xu P, Xu S, Wu X, et al. A expression genetic analysis from low-coverage RAD-Seq data: a case study on the non-model cut gene good. plant J,2014,77: 430. sup. 442), identifying RAD gene full genome SNP without reference genome, clustering RAD-tags by using cstags software under default parameters, and finally correcting gene type SNP by using a Bayesian algorithm (Hohenlohe PA, Bassham S, Etter PD, et al. P.P. genetic analysis of additive in feedback gene, P.1000862, Bayesian gene, P.P. No. 13. sub.A method for identifying RAD gene full genome SNP by using a template software under default parameters.
Step four, KASP marker development and genotyping
Selecting a sequence which covers SNP sites and has a total length of about 100bp, and adopting KrakenTMThe software designs KASP primers so that each KASP primer marker comprises three primer sequences, namely two forward primer sequences for distinguishing SNP allelic variation and a universal reverse primer sequence, and SNP genotyping is carried out for SNP signal detection. Wherein, SNP genotyping is carried out in a public laboratory of agricultural scientific college of Zhejiang province, and SNP signal detection is carried out by adopting an IntelliQube high-throughput genotyping detection platform; the single-site reaction volume was 1.6. mu.L, where the sample DNA was 0.8. mu.L and the volume after mixing the 2xMaster mix and the Primer mix was 0.8. mu.L. The PCR amplification procedure is 10 cycles of 15min at 95 deg.C, 20s at 94 deg.C, and 60s at 61-55 deg.C (Touch-Down PCR, 0.6 deg.C per cycle); 26 cycles at 94 ℃ for 20s and 55 ℃ for 60 s. The test result is analyzed by adopting IntelliQube platform with software.
Further, in another possible embodiment, the implementation process of the second step is as follows:
1) the specific experimental steps are as follows:
1. constructing a library by using the initial amount of 1 mu g of DNA;
2, breaking DNA to 300-500 bp by Covaris M220 ultrasonic;
3. filling 3' end with A, connecting index joint (TruSeq)TMNano DNA Sample Prep Kit);
4. Enriching the library, and amplifying 8 cycles by PCR;
5.2% Agarose gel recovery of the target band (verified Low Range Ultra Agarose);
TBS380(PicoGreen) quantification, mixing and loading according to the data proportion;
performing bridge PCR amplification on the cBot to generate clusters;
illumina Hiseq sequencing platform, 2 × 150bp sequencing was performed.
2) And (3) biological information analysis flow:
the reads obtained by sequencing are aligned with a reference genome sequence by using BWA software, and then the sequencing reads generated by PCR-replication are removed by using Picard-tools. Then, based on the alignment results, the sequencing depth and coverage relative to the reference genome are calculated. And (3) detecting the SNP and small index information by using the GATK software package. SV was identified using the Breakdancer-1.1.2 software.
3) Raw sequencing data illustrates:
raw image data obtained by Illumina sequencing is converted into sequence data through Base Calling, and the result is stored in a FASTQ file format. The FASTQ file is the most primitive data file, and contains sequence information of sequencing reads as well as sequencing quality information. The FASTQ file format is as follows:
@K00169:186:HM5C2CCXX:6:1101:8136:2962 1:N:0:CTGGCATA
CCACTCATAATCCAGCAAATACTAAATCTGCTGCAGGAAAAGAAATGCGGTTGAGCTTAAATAGCCCAG
+
AFFKKFKKFFKKKKFKAFKKAAKFAFFKKFKKFFKKKKFKAFKKAAKFAFFKKFKKFFKKKKFKAFKKAA
each read contains 4 lines of information, where the first and third lines represent the read name and ID (where the first line starts with "@" and the third line starts with "+"; the ID may be omitted in the third line but "+" cannot be omitted), the second line is the base sequence of the read, and the fourth line is the sequencing quality value for each base of the sequence in the second line. To facilitate the storage and sharing of high throughput sequencing data generated by various laboratories, the NCBI data center has built a large database SRA (Sequence Read Archive, http:// www.ncbi.nlm.nih.gov/Traces/SRA) to store shared raw sequencing data. Raw data volume statistics are shown in table 1:
TABLE 1 raw sequencing data
Sample Raw reads Raw bases Q20(%) Q30(%)
FB017-0911-1 34620356 5089192332 94.85 88.84
FB032-09016-3 35246208 5181192576 95.16 89.36
FB036-09019-1 35445202 5210444694 95.11 89.36
FB056-09031 35092390 5158581330 95.1 89.31
FB076 35898166 5277030402 95.02 89.21
FB079 36945126 5467878648 95.36 89.77
FB080 29994288 4439154624 95.05 89.26
FB081 35299488 5224324224 95.53 90.09
Sample: the name of the sample;
raw reads: counting original sequence data, taking four rows as a unit, and counting the number of sequencing sequences of each file;
raw bases: multiplying the number of sequencing sequences by the length of the sequencing sequences;
q20, Q30: indicates the percentage of the total base by the base with the Phred value of more than 20 and 30 respectively;
4) quality control of original sequencing data:
illumina sequencing belongs to a second generation sequencing technology, billions of reads can be generated by single operation, and thus the quality condition of each read cannot be displayed one by massive data; the statistical method is used for counting the base distribution and quality fluctuation of each circle of all sequencing reads, and the sequencing quality and the library construction quality of a sample can be visually reflected macroscopically.
Since the original sequencing data of the Illumina Hiseq contains sequencing adaptor sequences, low-quality reads, sequences with high N-rate and sequences with too short length, the quality of subsequent assembly is seriously affected. In order to ensure the accuracy of the subsequent biological information analysis, the original sequencing data is filtered firstly, so as to obtain high-quality sequencing data (clean data) to ensure the smooth proceeding of the subsequent analysis, and the specific steps and sequence are as follows: removing the adaptor sequence in reads, removing reads without inserts due to adaptor self-ligation and the like; trimming the bases with lower quality (quality value less than 20) at the tail end (3' end) of the sequence, if the bases with quality value less than 10 still exist in the residual sequence, removing the whole sequence, otherwise, keeping the sequence; removing reads with the N content ratio exceeding 10%; discarding the sequence with the length less than 75bp after removing the adapter and mass pruning.
Further, in another possible embodiment, the SNP identification process in step three above is as follows:
and (3) comprehensively considering influence factors in the aspects of data characteristics, sequencing quality and experiments, and calculating the probability of each possible genotype on the basis of actually observed data by using a Bayesian model (GATK UnifiedGenottyper). And selecting the genotype with the highest probability as the genotype of the specific site of the sequenced individual, and providing a quality value reflecting the accuracy of the genotype on the basis of the genotype, and obtaining a consistent sequence. Based on the consensus sequence, sites with polymorphisms in the reference sequence are screened and filtered.
The method mainly comprises the following steps:
1. converting the sam file into a Bam file, and sequencing the Bam file;
2. marking PCR duplicates, and removing reads of the PCR duplicates;
3. filtering and indexing comparison reads with mappingQ lower than 10;
4. realignment (realignment) around INDEL;
5. SNPs and INDEL calling using GATK;
6. filtering the Variant result to obtain high-accuracy variation;
the statistical format for SNP identification is shown in Table 2:
TABLE 2 SNP identification statistical Format
type FB017-0911-1 FB032-09016-3 FB036-09019-1 ……
all-snp 7915 10227 9896 ……
hom 5259 6859 6676 ……
het 2656 3368 3220 ……
all-indel 302 368 385 ……
deletion 144 183 182 ……
insertion 158 185 203 ……
Hom represents homozygous mutation, example: a- > T; het represents a heterozygous mutation, example: a- > A/T; insert mutation and delete mutation.
Genome-wide SNP mutations can be divided into 6 classes. Taking T: A > C: G as an example, this type of SNP mutation includes T > C and A > G. Since the sequencing data aligns to both the positive and negative strands of the reference genome, when a T > C type mutation occurs on the positive strand of the reference genome, an A > G type mutation is at the same position on the negative strand of the reference genome, and thus T > C and A > G are divided into one class.
SNP annotation: ANNOVAR is an efficient software tool that can functionally annotate genetic variations detected from multiple genomes with up-to-date information. ANNOVAR can be analyzed given the chromosome in which the variation is located, the start site, the stop site, the reference nucleotide and the variant nucleotide. In view of ANNOVAR's powerful annotation function and international acceptance, we used it to annotate SNP detection results. The statistics of SNP annotation results are shown in Table 3, and the statistics of small index annotation results are shown in Table 4:
TABLE 3 SNP annotation results
type FB017-0911-1 FB032-09016-3 FB036-09019-1 ……
UTR3 111 120 133 ……
UTR5 132 129 148 ……
downstream 190 204 287 ……
exonic 2207 3222 2976 ……
exonic;splicing 0 0 0 ……
intergenic 3735 4761 4688 ……
intronic 1227 1469 1299 ……
splicing 9 8 7 ……
upstream 241 254 290 ……
TABLE 4 results of small indel annotation
type FB017-0911-1 FB032-09016-3 FB036-09019-1 ……
UTR3 10 13 15 ……
UTR5 17 9 11 ……
downstream 13 13 20 ……
exonic 37 48 47 ……
intergenic 136 172 181 ……
intronic 60 80 71 ……
splicing 2 3 4 ……
upstream 23 26 32 ……
The above table specifically describes and illustrates the reference link addresses:
http://www.openbioinformatics.org/annovar/annovar_gene.html
sample name.
Upstream: the 1Kb region upstream of the gene.
Exonic: the variation is located in an exon region; missense: non-synonymous variants; stop gain: allowing the gene to acquire a variation of a stop codon; stop loss: a mutation that deprives the gene of a stop codon; synonymous: synonymous variants.
Intronic: the variation is located in an intron region.
And (3) spicing: the variation is located at the splice site (2 bp near the exon/intron boundary in the intron).
Downstream: the 1Kb region downstream of the gene.
Upstream of the gene, 1Kb, and Downstream of the other gene, 1 Kb.
Intergenic: the variation is located in the intergenic region.
For the SNP and small indel sites in the CDS region, the effect of the mutation site on protein translation will be annotated. The statistics of the results (SNPs) of the effect of the mutated site of the CDS region on protein translation are shown in table 5:
TABLE 5 results of the influence of the mutated site of the CDS region on protein translation
type FB017-0911-1 FB032-09016-3 FB036-09019-1 ……
nonsynonymous SNV 888 1292 1201 ……
stopgain SNV 23 33 24 ……
stoploss SNV 2 3 3 ……
synonymous SNV 1294 1894 1748 ……
Statistics of the effect of the mutated positions of the CDS region on protein translation (Small Indel) are shown in Table 6, and the degenerate base meanings are shown in Table 7
TABLE 6 results of the influence of the mutated site of the CDS region on protein translation
type FB017-0911-1 FB032-09016-3 FB036-09019-1 ……
frameshift deletion 10 19 14 ……
frameshift insertion 15 18 19 ……
nonframeshift deletion 4 3 6 ……
nonframeshift insertion 8 8 7 ……
stopgain SNV 0 0 1 ……
TABLE 7 base meanings
Degenerate/mixed bases A+C+G V
Degenerate/mixed bases A+T+G D
Degenerate/mixed bases T+C+G B
Degenerate/mixed bases A+T+C H
Degenerate/mixed bases A+T W
Degenerate/mixed bases C+G S
Degenerate/mixed bases T+G K
Degenerate/mixed bases A+C M
Degenerate/mixed bases C+T Y
Degenerate/mixed bases A+G R
Degenerate/mixed bases A+G+C+T N
Further, to illustrate the feasibility of the method of the present invention for identifying SNP in faba beans using simplified genomic sequencing data, the results of the method were statistically analyzed as follows:
1) sequencing data statistics:
in this example, 8 broad bean germplasms were sequenced using Illumina Hiseq sequencing platform, and 35.47Gb data were obtained altogether, to generate 245443516 reads, each of which has an average length of 144 bp. In 8 germplasms, the minimum sequencing data amount is 3.83Gb, the maximum sequencing data amount is 4.77Gb, and the average sequencing data amount is 4.43 Gb; the minimum number of reads is 26415662, the maximum number is 32822210, and the average number is 30680439.5; q20 and Q30 are respectively more than 97.89 percent and 93.83 percent, and the variation range of GC content is 38.05 percent to 40.09 percent; statistics of 8 germplasm sequencing data are shown in table 8:
table 8, 8 germplasm sequencing data
Figure BDA0002789630510000101
2) Broad bean whole genome SNP identification:
in this example, 3722 group SNPs were identified by the method for identifying SNP in bottle gourd using the special bayesian algorithm without reference genome, and the statistics of SNP identification information in 8 materials are shown in table 9:
TABLE 9 SNP identification information
Figure BDA0002789630510000111
On a single germplasm, the number of SNPs identified in FB076 was the least, 3278, and the number of SNPs identified in FB079 was the most, reaching 3578. The number of homozygous SNP mutations varied from 1579 to 2033, the number of heterozygous SNP mutations varied from 1245 to 1804 in 8 germplasm, with the exception of FB080 and FB056, which were greater than the number of heterozygous SNP mutations in most germplasm (table 9).
Of the 6 SNP mutation types, the T: A- > C: G mutation type accounts for the largest proportion (average 38.8%), followed by C: G- > T: A (average 28.0%), and the T: A- > A: T (average 7.50%) with the smallest occurrence proportion, and the statistics of SNP mutation patterns are shown in Table 10:
TABLE 10 SNP mutation patterns
Figure BDA0002789630510000112
3) SNP validation
In order to verify the effectiveness of the SNPs, 56 SNPs are selected to develop KASP markers after filtering according to the standard that the deletion rate is less than or equal to 20%, the MAF value is greater than or equal to 0.05 and the occurrence frequency of the SNPs is greater than or equal to 40, and finally 31 SNPs are converted into KASP markers with the conversion success rate of 55.3%. The 31 pairs of KASP markers developed in this example are shown in table 11:
tables 11, 31 pairs of KASP tags
Figure BDA0002789630510000113
Figure BDA0002789630510000121
Figure BDA0002789630510000131
The results of genotyping 46 broad bean germplasm resources with the 31 pairs of KASP markers show that 22 pairs of markers detect successfully amplified signals, wherein 14 pairs of markers detect single genotype signals, 4 pairs of markers show 2 genotype signals, and 4 pairs of markers show 3 genotype signals as shown in FIG. 1. In FIG. 1, 4 amplification signals, A, of the present example, labeled with KASP failed amplification; b, single genotype; c, 2 genotypes; d, 3 genotypes.
With the rapid decrease in sequencing costs, identification of SNPs in the genome-wide range using genome re-sequencing has been widely used on a variety of crops. Because broad beans have large genome and no reference genome exists at present, SNP (single nucleotide polymorphism) mining of broad beans lags behind other leguminous crops such as soybeans, kidney beans, cowpeas and the like.
In this example, the RAD-Seq data of 8 germplasm were used to identify 3722 SNP markers. And Ocana et al (
Figure BDA0002789630510000132
S,Seoane P,Bautista R et al.Large-Scale Transcriptome Analysis in Faba Bean(Vicia faba L.)under Ascochyta fabae Infection.PLoS ONE,2015,10(8):e013514)]And Webb et al (Webb A, Cottage A, Wood T, et al. A SNP-based transducing qualitative map for synthesizing-based tracking targeting in faba bean (Vicia faba L.) [ J]Plant Biotechnology Journal,2016,14:177-185) utilizes transcriptome sequencing data to mine SNPs differently, this example utilizes genome sequencing data to mine SNPs, and not only can SNP mutations in gene expression regions be identified, but also SNP mutations in non-coding regions such as gene interiors and intergenes can be identified, and SNP sources are more abundant.
The foregoing is directed to the preferred embodiment of the present invention and it is noted that while the preferred embodiment of the present invention has been described, numerous modifications and adaptations may be made by those skilled in the art without departing from the principles of the invention and without departing from the scope of the invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims (10)

1. A method for identifying broad bean SNP by using simplified genome sequencing data is characterized by comprising the following steps:
taking young leaves of broad bean seedlings as a sample, and grinding the young leaves by liquid nitrogen to extract sample genome DNA;
step two, constructing a RAD library, digesting genomic DNA by using EcoRI, enriching the library by using a PCR amplification method, recovering a target band by using agarose gel, and sequencing a single end;
generating an original sequence, performing sequence quality control analysis, removing sequences smaller than 85bp to obtain a screened sequence, clustering the screened sequence according to sequence similarity to generate RAD-tags, clustering the RAD-tags to perform SNP calling, and correcting the SNP genotype;
selecting a sequence which covers the SNP locus and has a total length of 100bp, designing KASP primers, enabling each KASP primer marker to respectively comprise two forward primer sequences and a universal reverse primer sequence for distinguishing SNP allelic variation, and carrying out SNP signal detection by SNP genotyping.
2. The method for identifying SNP of broad beans by using simplified genome sequencing data as set forth in claim 1, wherein the young leaves of broad bean seedlings in the first step are young leaves of broad bean seedlings which grow for 1 week.
3. The method for identifying faba bean SNPs using simplified genomic sequencing data according to claim 1, wherein the step one, after extracting genomic DNA from the sample, further comprises:
the quality of the extracted sample genomic DNA was checked by agarose gel electrophoresis, and the concentration of the extracted sample genomic DNA was checked using a NanoDrop2000 ultramicro spectrophotometer.
4. The method for identifying SNP of broad beans by using simplified genome sequencing data as set forth in claim 1, wherein in the second step, when the genomic DNA is digested by EcoRI, the 3' end of the digested fragment is treated by adding ' A ' to connect with MID linker; single-ended sequencing used the Illumina HiSeq2000 platform.
5. The method for identifying broad bean SNPs using simplified genomic sequencing data as claimed in claim 1, wherein in step three, the original sequence is generated by Illumina base cloning software CASAVA v1.8.2, and sequence quality control analysis is performed using trimmatic software under default parameters.
6. The method for identifying SNP in broad beans by using simplified genome sequencing data as set forth in claim 1, wherein in the third step, the screened sequences are clustered by using ustacks software according to sequence similarity to generate RAD-tags, the RAD-tags are clustered by using cstags software under default parameters to perform SNP calling, and finally the SNP genotype is corrected by using Bayesian algorithm.
7. The method for identifying faba bean SNPs using simplified genomic sequencing data as claimed in claim 1 wherein in step four, Kraken is usedTMThe software designed KASP primers.
8. The method for identifying faba bean SNPs using simplified genomic sequencing data as claimed in claim 1, wherein in step four, SNP genotyping is performed using IntelliQube high throughput genotyping detection platform for SNP signal detection.
9. The method for identifying faba bean SNPs using simplified genomic sequencing data according to claim 1, wherein in the fourth step, when SNP genotyping is performed, the single-spot reaction volume is 1.6 μ L, wherein the sample DNA is 0.8 μ L, and the mixed volume of the 2xMaster mix and the Primer mix is 0.8 μ L.
10. The method for identifying SNP in broad beans according to claim 1, wherein in the fourth step, when SNP genotyping is performed on SNP signals, the PCR amplification procedure is 15min at 95 ℃,20 s at 94 ℃ and 60s at 61-55 ℃, which are 10 cycles; 26 cycles at 94 ℃ for 20s and 55 ℃ for 60 s.
CN202011310367.5A 2020-11-20 2020-11-20 Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data Pending CN112553361A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011310367.5A CN112553361A (en) 2020-11-20 2020-11-20 Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011310367.5A CN112553361A (en) 2020-11-20 2020-11-20 Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data

Publications (1)

Publication Number Publication Date
CN112553361A true CN112553361A (en) 2021-03-26

Family

ID=75044213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011310367.5A Pending CN112553361A (en) 2020-11-20 2020-11-20 Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data

Country Status (1)

Country Link
CN (1) CN112553361A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060041955A1 (en) * 2004-08-23 2006-02-23 Pioneer Hi-Bred International, Inc. Marker mapping and resistance gene associations in soybean
CN101701255A (en) * 2009-11-13 2010-05-05 中国检验检疫科学研究院 Primers and method for identification of broad bean by PCR
US20150322447A1 (en) * 2012-07-06 2015-11-12 Bayer Cropscience Nv Soybean rod1 gene sequences and uses thereof
CN106755328A (en) * 2016-11-25 2017-05-31 中国农业科学院作物科学研究所 A kind of construction method of broad bean SSR finger-prints
CN110139872A (en) * 2016-12-21 2019-08-16 中国农业科学院作物科学研究所 Plant seed character-related protein, gene, promoter and SNP and haplotype

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060041955A1 (en) * 2004-08-23 2006-02-23 Pioneer Hi-Bred International, Inc. Marker mapping and resistance gene associations in soybean
CN101701255A (en) * 2009-11-13 2010-05-05 中国检验检疫科学研究院 Primers and method for identification of broad bean by PCR
US20150322447A1 (en) * 2012-07-06 2015-11-12 Bayer Cropscience Nv Soybean rod1 gene sequences and uses thereof
CN106755328A (en) * 2016-11-25 2017-05-31 中国农业科学院作物科学研究所 A kind of construction method of broad bean SSR finger-prints
CN110139872A (en) * 2016-12-21 2019-08-16 中国农业科学院作物科学研究所 Plant seed character-related protein, gene, promoter and SNP and haplotype

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ANNE WEBB等: "A SNP-based consensus genetic map for synteny-based trait targeting in faba bean (Vicia faba L.)", 《PLANT BIOTECHNOLOGY JOURNAL》 *
ANNE WEBB等: "A SNP-based consensus genetic map for synteny-based trait targeting in faba bean (Vicia faba L.)", 《PLANT BIOTECHNOLOGY JOURNAL》, vol. 14, no. 1, 10 April 2015 (2015-04-10), pages 177 - 185 *
刘庭付等: "利用简化基因组测序数据鉴定蚕豆SNP", 《分子植物育种》 *
刘庭付等: "利用简化基因组测序数据鉴定蚕豆SNP", 《分子植物育种》, 13 November 2020 (2020-11-13), pages 1 - 9 *

Similar Documents

Publication Publication Date Title
Yang et al. Target SSR-Seq: a novel SSR genotyping technology associate with perfect SSRs in genetic analysis of cucumber varieties
Lee et al. Young inversion with multiple linked QTLs under selection in a hybrid zone
US9976191B2 (en) Rice whole genome breeding chip and application thereof
CN102747138B (en) Rice whole genome SNP chip and application thereof
Ryu et al. Genotyping-by-sequencing based single nucleotide polymorphisms enabled Kompetitive Allele Specific PCR marker development in mutant Rubus genotypes
CN117144040B (en) Fresh corn genotyping chip and application thereof
CN116004898A (en) Peanut 40K liquid-phase SNP chip PeannitGBTS 40K and application thereof
Mishra et al. Analysis of SSR and SNP markers
Gaur et al. A high-density SNP-based linkage map using genotyping-by-sequencing and its utilization for improved genome assembly of chickpea (Cicer arietinum L.)
CN115992265A (en) Grouper whole genome liquid phase chip and application thereof
CN110959178B (en) Systems and methods for targeted genome editing
CN111916151B (en) Traceability detection method and application of verticillium wilt of alfalfa
CN117457075B (en) A method for identifying oil-tea camellia varieties
CN118064428A (en) MNP molecular marker combination and method for constructing DNA fingerprint of rubber tree
CN112553361A (en) Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data
KR101911307B1 (en) Method for selecting and utilizing tag-SNP for discriminating haplotype in gene unit
CN116732153A (en) A method and application for whole-genome genotype identification of very large genome species
CN113718342A (en) Construction method of high-density genetic map of recombinant inbred line population
Bello et al. Genetic Diversity Analysis of Selected Sugarcane (Saccharum spp. Hybrids) Varieties Using DArT-Seq Technology.
CN118240948B (en) Identification method and application of genetic relationship of Litopenaeus vannamei based on targeted sequencing typing
CN116622881B (en) Tobacco whole genome SNP locus combination, probe, chip and application thereof
Dong et al. The mutational dynamics of the Arabidopsis centromeres
CN117904317B (en) SNP molecular marker combination for detecting propagation traits of Nile-Lafei buffalo and application
CN119685521B (en) Liquid phase chip detection method for leymus chinensis variety identification
Li et al. An initial exploration of core collection construction and DNA fingerprinting in Elymus sibiricus L. using SNP markers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210326