Detailed Description
The technical scheme of the invention is further elaborated below by referring to the drawings in the specification and the specific embodiments.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the scope of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
In the following description, reference is made to the expression "some embodiments" which describe a subset of all possible embodiments, but it should be understood that "some embodiments" may be the same subset or a different subset of all possible embodiments and may be combined with each other without conflict.
Gene sequencing refers to analyzing the base sequence of DNA fragments of the data to be tested, i.e., the arrangement of adenine (A), thymine (T), cytosine (C) and guanine (G). At present, a fluorescent labeling method is commonly used for gene sequencing, a laser is used for exciting a fluorescent label on a sequencing chip by a gene sequencing optical system to generate fluorescence, fluorescence signals are collected, and four bases are combined with different fluorescent labels to generate four different fluorescence wave bands, so that bases are identified.
In the second generation sequencing technology, using an Illumina sequencer as an example, different fluorescent molecules with different fluorescence emission wavelengths can emit fluorescent signals with corresponding wavelengths when being irradiated by laser, and the fluorescent signals with specific wavelengths can be obtained by selectively filtering light rays with non-specific wavelengths through a filter after the laser irradiation, so that the base type can be identified by analyzing the fluorescent signals by obtaining the fluorescent signals. Mainly comprises sample preparation, cluster generation, sequencing and data analysis.
Sample preparation: the DNA sample to be sequenced is subjected to extraction and purification treatment, and then DNA fragmentation and aptamer ligation are performed. In alternative examples, the DNA sample is typically cleaved using ultrasound or restriction enzymes, and the DNA sample is cleaved into smaller, larger DNA fragments. Then, an aptamer comprising a specific sequence for subsequent ligation and sequencing reactions is ligated to both ends of the DNA fragment.
Cluster generation: the process is to amplify a DNA fragment to form an immobilized DNA fragment so that a DNA fragment can be formed into a base cluster later. In an alternative example, specifically, the DNA fragments are amplified by polymerase chain reaction (Polymerase Chain Reaction, PCR) or bridge amplification or the like such that millions of replicas of each DNA fragment are formed, and the amplified DNA fragments are immobilized on a fixation plate. Each DNA fragment forms a separate cluster on the fixation plate.
Sequencing refers to sequencing reading each base cluster on a Flowcell, sequencing is added with a fluorescent marked dNTP sequencing primer, one end of a dNTP chemical formula is connected with an azide group, polymerization can be prevented when a sequenced chain extends, one cycle (cycle) can be ensured to be prolonged by only one base, and a sequencing reading is correspondingly generated, namely sequencing while synthesizing. In one cycle, a base is identified by fluorescent labeling dNTPs for each base cluster, sequencing signal responses of different base types are respectively corresponding to fluorescent signals of specific colors, and the base corresponding to each base cluster in the current cycle can be judged according to the emitted fluorescent colors by laser scanning. In one cycle, tens of millions of base clusters are sequenced simultaneously in a Flowcell, one fluorescent spot represents fluorescence emitted by one base cluster, and one base cluster corresponds to one read in fastq. In the sequencing stage, fluorescent images of the surface of the Flowcell are shot through an infrared camera, the fluorescent images are subjected to image processing and fluorescent spot position positioning to detect base clusters, template construction is carried out according to base cluster detection results of a plurality of fluorescent images corresponding to sequencing signal responses of different base types, and positions of all base cluster template spots (clusters) on the Flowcell are constructed. And extracting fluorescence intensity from the filtered image according to the template, correcting the fluorescence intensity, and finally calculating a score according to the maximum intensity of the position of the template point of each base cluster to output fastq base sequence files. Please refer to fig. 5, which is a schematic diagram of the Flowcell (as in fig. 5 (a)), a fluorescence image captured for a corresponding portion of the Flowcell in one cycle (as in fig. 5 (b)), and a schematic diagram of sequencing result display in the fastq file (as in fig. 5 (c)).
The gene sequencer can also comprise an optical platform, the optical platform can comprise an operation table and a camera, wherein the sequencing chip can be arranged on the operation table, the gene sequencer uses laser to excite fluorescent markers on the sequencing chip to generate fluorescence, and collect fluorescent signals, and four bases are combined with different fluorescent markers to generate four different fluorescent wave bands. I.e. fluorescence images of four base types. The sequencing chip is photographed by a camera, a fluorescent image of a fluorescent signal generated on a Charge Coupled Device (CCD) on the testing chip is captured, a plurality of fluorescent points exist in one fluorescent image, and one fluorescent point in the fluorescent image represents fluorescence emitted by one base cluster.
The imaging mode of the gene sequencer can be a four-channel imaging system or a two-channel imaging system. For a two-channel imaging system, each camera needs to be exposed twice at the same location of the test chip. For a four-channel imaging system, the camera of each channel shoots once at the same position of the sample, and fluorescent images of four base types are respectively obtained. For example, a fluorescent image of the A base type, a fluorescent image representing the A base type, a fluorescent image of the C base type, a fluorescent image of the G base type, and a fluorescent image of the T base type are obtained, respectively. Since the light with a non-specific wavelength is selectively filtered by using the optical filter after the laser irradiation to obtain the fluorescent signal with a specific wavelength, each base type corresponds to a different fluorescent signal, in the same Cycle (Cycle) reaction, the same type of base cluster emits light with a far greater brightness than other types of bases in the corresponding type of base type, and the base clusters emitted by each channel theoretically do not have repetition.
After the fluorescent image is obtained by the gene sequencer, the collected image is subjected to gene image reconstruction, gene image registration and gene base identification (gene baseband), so that a gene sequence is obtained.
Wherein the genetic image reconstruction is used to increase the resolution of the fluorescent image to increase the sharpness of the image to reduce the cross-talk effects between samples. Gene image reconstruction includes, but is not limited to, conventional operations such as deconvolution.
The gene image registration is to correct the fluorescent images of four base types, so that the fluorescent images of four base types can be overlapped, and the fluorescent brightness of 4 channels at the same position can be extracted, thereby facilitating the subsequent base identification. Genetic image registration includes, but is not limited to, image registration of the same channel, global or local affine registration.
The gene recognition process is to judge whether the base cluster in the image belongs to one of A, C, G, T bases according to the registered image. After the data to be detected is subjected to gene identification, the data to be detected is converted into sequence information of A, C, G, T four bases from a digital image, namely a DNA sequence result of a sample, so as to be used for subsequent analysis and evaluation.
Data analysis: analysis and interpretation of sequencing data is performed based on the image data and the sequence information. Sequence information was aligned with the reference genome for mutation identification.
The process of sequencing one piece of data to be tested is called one-time Run, and the sequencing process of one piece of data to be tested consists of a plurality of cycles (cycles), wherein one Cycle corresponds to one reaction period, namely, corresponds to the identification of one base type in a sequencing chip. Sequencing, sequencing while synthesis, is performed. In one cycle, several tens of millions of base clusters are sequenced simultaneously.
One test data includes a plurality of DNA fragments, and each DNA fragment is added with one base during the above-mentioned sequencing, so that the length of the base sequence of the DNA of the test data determines the number of cycles. In each cycle, the gene sequencer can obtain one fluorescence image of each of four base types of ACGT, and when the data to be tested is sequenced, the gene sequencer can obtain the fluorescence images of ACGT channels of a plurality of cycles.
It should be noted that, the foregoing describes a sequencing procedure by using Illumina sequencing technology as an example of a large-scale parallel sequencing technology (MPS), and by amplifying a DNA molecule to be detected by a specific amplification technology, amplifying each DNA fragment (single-stranded library molecule) to form a base cluster, and constructing a template point of the base cluster on the sequencing chip according to a detection result of the base cluster, so that operations such as base recognition can be performed according to the template point of the base cluster in the following steps, thereby improving the base recognition efficiency and accuracy. It can be understood that the base recognition method based on fluorescence labeling dNTP gene sequencing provided in the embodiments of the present application is based on the positioning detection and base type recognition of the base cluster after the single-stranded library molecule is amplified on the sequencing chip, where each base cluster refers to a base signal acquisition unit, so that it is not limited to the amplification technology adopted for the single-stranded library molecule, that is, the base recognition method based on semi-supervised learning provided in the embodiments of the present application may be also applicable to the positioning detection and base type recognition of the base signal acquisition unit for the sequencing chip in other large-scale parallel sequencing technologies, for example, the base signal acquisition unit may refer to the base cluster obtained by using bridge amplification technology in Illumina sequencing technology, and also includes nanospheres obtained by rolling circle amplification technology (RCA, rolling Circle Amplification), and the present application is not limited thereto. In the following examples, for the sake of understanding, a base signal acquisition unit will be described as an example of a base cluster.
Referring to fig. 6, a flowchart of a base recognition method based on semi-supervised learning according to an embodiment of the present application is provided. The base recognition method based on semi-supervised learning is applied to a gene sequencer, and comprises the following steps:
s11, obtaining fluorescent images to be detected corresponding to base signal acquisition units of various base types on the sequencing chip.
In this embodiment, the fluorescent image to be measured includes fluorescent images corresponding to a plurality of base types. The fluorescent image to be measured can be a fluorescent image collected in one cycle or a fluorescent image collected under multiple cycles.
The sequencing process of a gene sample to be tested is called Run once, and the gene sample to be tested is broken into M base sequences to be tested, which can also be called short chains, each base sequence to be tested comprises N base clusters, in one cycle, sequencing reaction is carried out on a sequencing chip on the base clusters at the top end of the M short chains, each base cluster being sequenced corresponds to one position on the sequencing chip, and in one cycle, tens of millions of base clusters are sequenced simultaneously. N determines the number of cycles tested, the greater N the number of cycles. And under different cycles, sequencing the base clusters in the M base sequences to be tested respectively. For example, if a sample of the gene to be tested is broken into tens of thousands of short strands, each of which is 100 bases in length, then 100 cycles of sequencing reactions are required to identify the base type. At each cycle, the top base cluster of the ten thousand short chains was subjected to a sequencing reaction on a sequencing chip.
During the sequencing reaction, different types of base clusters on the sequencing chip are respectively connected with one of different fluorescent markers, and in one cycle, the gene sequencer utilizes laser to excite fluorescence on the sequencing chip to emit a fluorescent signal, and utilizes a camera of the gene sequencer to capture a fluorescent image of a target position area on the sequencing chip corresponding to a shooting visual field range under the cycle. In each cycle, the camera of the gene sequencer photographs once, and fluorescent images corresponding to various base types, such as a fluorescent image of a base type, a fluorescent image of a C base type, a fluorescent image of a G base type, and a fluorescent image of a T base type, can be obtained. For example, if the imaging system of the gene sequencer is in a four-channel imaging mode, a fluorescent image of four base types can be obtained by photographing once in the field of view photographing range of one cycle. For example, a gene sample to be tested is broken into ten thousands of short chains, under one cycle, a camera of the gene sequencer adjusts the field of view of the camera, and a fluorescent image of a base cluster at the top end of the ten thousands of short chains on a sequencing chip corresponding to the field of view of the camera under the cycle is captured, and one base cluster corresponds to one read, so that ten thousands of reads exist at the moment.
S12, taking the input image data to be detected as input of a base recognition model after training, and outputting a base recognition result of the input image data to be detected through the base recognition model after training, wherein the base recognition model after training is a model obtained by semi-supervised learning training based on a training data set.
In this embodiment, the base recognition model after training is a model obtained by performing semi-supervised learning training based on the training data set. The training data set comprises training samples collected under multiple cycles, each training sample comprises a sample fluorescent image corresponding to multiple base types and a base type label graph corresponding to the sample fluorescent image, each training label corresponding to the training sample further comprises a first mask graph corresponding to the sample fluorescent image and a second mask graph corresponding to the sample fluorescent image, and the first mask graph is used for marking the position of a base signal collecting unit with a base type label in the sample fluorescent image; the second mask pattern is used to label the position of the base signal acquisition unit in the fluorescent image of the sample where there is no base type tag.
Wherein the sample fluorescent images corresponding to the base types in one cycle comprise sample fluorescent images of A base types, sample fluorescent images of C base types, sample fluorescent images of G base types and sample fluorescent images of T base types. The base type label map is used to identify base type labels of the base clusters at positions in the sample fluorescence image corresponding to the first mask map. The image sizes of the first mask image and the second mask image are the same as the size of the fluorescent image of the sample. For one gene sequencing on the same sequencing chip, the first mask patterns corresponding to the sample fluorescence images under multiple cycles are the same, and the second mask patterns corresponding to the sample fluorescence images under multiple cycles are the same. For example, if 30 ten thousand short strands are sequenced simultaneously during one gene sequencing process, where each short strand has a base length of 100 bases, then the first mask corresponding to the sample fluorescence image collected under the 100 cycles is the same, and the second mask corresponding to the sample fluorescence image collected under the 100 cycles is the same.
Wherein the first mask pattern is used to mark the locations of the base clusters in the fluorescent image of the sample where there are already real base type labels, e.g. in the first mask pattern the locations of the base clusters where there are real base type labels are marked "1" and the other locations are 0. The second mask is used to mark base cluster positions in the sample fluorescence image that do not have a true base type tag, e.g., in the second mask, base cluster positions that do not have a true base type tag are marked "1" and the other positions are 0. When the base recognition model is trained, the data with the real base type label and the data without the real base type label in the training sample can be marked through the first mask image and the second mask image, and when the data with the real base type label is trained, training learning can be performed by taking the data with the real base type label as a training target, and when the data without the real base type label is trained, non-label training learning can be performed, so that the data with the real base type label and the data without the real base type label are combined to combine the training base recognition model, and semi-supervised learning is realized. The data with the real base type label usually contains the characteristics under various different conditions, but information loss can exist in some acquired data, the base cluster at some positions does not have the real base type label, and the model can learn more information by introducing the data without the real base type label into the training set, so that the model is better suitable for various different conditions, and the generalization performance is improved.
For example, as shown in FIG. 7, in the 4X 4 sample fluorescence image, there are base clusters at (1, 3), (2, 1), (2, 3), (3, 2), (3, 4), (4, 2) in the base cluster position diagram, and background images at other positions, the first mask image, the sample fluorescence image and the second mask image are allAn image of the matrix, wherein the first mask map is identified as "1" at positions (1, 3), (2, 1), (2, 3) and (4, 2), indicates that the input sample fluorescence image has a true base type label at positions (1, 3), (2, 1), (2, 3) and (4, 2). And (3) performing base recognition on the input sample fluorescent image to obtain an output image representing the base type, wherein the output image is processed by a first mask image, and then the base recognition results are obtained at the four positions (1, 3), (2, 1), (2, 3) and (4, 2), wherein 1 represents the A base type, 2 represents the C base type, 3 represents the G base type and 4 represents the T base type in the output image processed by the first mask image. In the second mask, the positions (3, 2) and (3, 4) are marked as "1", the other positions are marked as "0", and the base cluster at the position marked as "1" has no real base type tag. The non-base cluster positions in the first mask map and the second mask map are both 0, and the background value at the non-base cluster position does not participate in calculation.
When the base recognition model is trained, training and learning are carried out by adopting training samples collected under multiple cycles, so that when the base recognition model predicts the base recognition result, the brightness relation characteristics of sample images under different cycles are learned, and the adaptability of the model to early reaction or delayed reaction under different cycles can be improved. And each training sample comprises fluorescent images corresponding to a plurality of base channels, so that the base recognition model learns the brightness relation characteristics among different base channels, and the adaptability of the model to the brightness crosstalk among different base channels can be improved. The base clusters with the real base type labels in the sample fluorescent image are marked through the first mask image, and the base clusters without the real base type labels in the sample fluorescent image are marked through the second mask image, so that semi-supervised learning can be realized when the base recognition model is trained, training learning is carried out on the data with the real base type labels in the training sample by taking the real base type labels as training targets, the model can pay attention to the characteristics of the data with the real base type labels, and model convergence can be accelerated. Meanwhile, the base cluster without the real base type label can lead the model to pay attention to the characteristics of the data without the real base type label during training, learn the more diversified characteristics without the base type label data, help the model better understand and generalize to different conditions, and therefore, the model can better balance training data and generalization requirements and reduce the risk of overfitting. And moreover, the base clusters without real base type labels can be integrated into the training sample through the second mask diagram, so that the scale of the training sample can be increased.
In some embodiments, the method further comprises:
acquiring a training data set;
acquiring training samples from the training data set as input training samples, processing the input training samples based on different data enhancement modes to obtain a plurality of groups of processed training samples corresponding to the input training samples, and forming a plurality of groups of input data corresponding to the input training samples based on the plurality of groups of processed training samples corresponding to the input training samples;
constructing an initial base recognition model, taking a plurality of groups of input data corresponding to the input training sample as the input of the base recognition model respectively to obtain base recognition data corresponding to each group of input data, and carrying out iterative training on the initial base recognition model through the training data set until a loss function converges to obtain the trained base recognition model;
wherein the loss function comprises:
calculating a first loss function of a first loss value between base identification data corresponding to each set of adjusted input data and a base type label graph corresponding to the input training sample, wherein the base identification data corresponding to each set of adjusted input data is obtained by adjusting the base identification data corresponding to each set of input data based on a first mask graph corresponding to the input training sample;
And calculating a second loss function of a second loss value between base identification data corresponding to each two groups of processed input data, wherein the base identification data corresponding to each group of processed input data is obtained by processing the base identification data corresponding to each group of input data based on a second mask diagram corresponding to the input training sample.
In the above embodiment, the input training samples are processed based on different data enhancement modes to obtain multiple sets of processed training samples, so that the training samples are added, multiple sets of input data corresponding to the input training samples formed based on the multiple sets of processed training samples are used as input data of a base model in training, in the iterative training process, a base cluster without a real base type label in the base identification data corresponding to each set of input data is shielded by using a first mask map, and when loss is calculated by using a first loss function, the base identification result of the base cluster with the real base type label is more focused, so that the influence of the identification result of the base cluster without the real base type label is reduced, and the training speed of the model is accelerated. Meanwhile, a second mask map is introduced to shield base clusters with real base type labels in base identification data corresponding to each group of input data, and when loss is calculated by using a second loss function, consistency loss of identification results of the base clusters without the real base type labels in each two groups is more concerned, so that the model learns more features of the base clusters without the real base type labels during training, learns more diversified features, helps the model better understand and generalize to different conditions, and therefore, the model can better balance training data and generalization requirements, and the risk of overfitting is reduced.
FIG. 8 is a flowchart showing training of a base recognition model in a base recognition method based on semi-supervised learning according to an embodiment; the flow chart includes:
s81, acquiring a training data set.
The training data set comprises training samples collected under multiple cycles, wherein each training sample comprises a sample fluorescent image corresponding to multiple base types, a base type label graph corresponding to the sample fluorescent image, a first mask graph corresponding to the sample fluorescent image and a second mask graph corresponding to the sample fluorescent image. For example, four A, C, G, T base type sample fluorescence images were acquired in one cycle.
In some embodiments, a base cluster position representing the center of a base cluster in a sample fluorescent image of multiple base types collected under each cycle is located by using a traditional base cluster position locating algorithm, base recognition is performed on the base type at the base cluster position in the sample fluorescent image of multiple base types collected under each cycle by using a traditional base recognition algorithm, a base recognition result corresponding to the sample fluorescent image under each cycle is obtained, a base sequence is obtained according to the base recognition result of the sample fluorescent image under multiple cycles continuously collected by a sequencing chip, the base sequence is compared with a standard base sequence in a known gene library, and a base sequence successfully compared with the standard base sequence and a base sequence failed in comparison with the standard base sequence are determined, and a first mask image and a second mask image are generated according to the base cluster position, the base sequence successfully compared and the base sequence failed in comparison. Where a mask (mask) map refers to a template selected to mask a processed image for controlling the area or process of image processing. The first mask map is used for marking the base cluster positions with the real base type labels, and is used for shielding the base cluster positions without the real base type labels for the processed image. The second mask pattern is used to mark base cluster positions without a true base type tag and to mask base cluster positions with a true base type tag from the processed image.
Alternatively, a base sequence having a ratio of correctly recognized bases greater than or equal to a predetermined ratio is determined as a base sequence successfully compared with a standard base sequence, and a base sequence having a ratio of correctly recognized bases less than the predetermined ratio is determined as a base sequence failed to be compared with the standard base sequence. Wherein the ratio of the correctly recognized bases in one base sequence is equal to the number of correctly recognized bases in one base sequence/the total number of bases in one base sequence.
When one-time gene sequencing is performed on a sequencing chip, a plurality of sample gene sequences are input once, namely a plurality of sample short chains are input, sequencing reaction is performed on a base cluster at the top end of the plurality of sample short chains under each cycle, and a plurality of base types of sample fluorescent images under each cycle are obtained by photographing. The position of the corresponding base cluster of each short strand of sample on the sequencing chip is fixed at each cycle. Therefore, in the primary gene sequencing, the first mask map and the second mask map under the secondary gene sequencing can be obtained according to the base cluster position on the sequencing chip and the comparison result of the base sequences corresponding to the short chains of the plurality of samples, namely the first mask map and the second mask map corresponding to the short chains of the plurality of samples. The generated first mask image is used as a first mask image corresponding to the sample fluorescence image collected under the multiple cycles of the secondary gene sequencing, and the generated second mask image is used as a second mask image corresponding to the sample fluorescence image collected under the multiple cycles of the secondary gene sequencing.
In the one-time gene sequencing process of the same sequencing chip, a plurality of short chains of samples with the same input are sequenced, so that the first mask images corresponding to the fluorescent images of the samples with the plurality of base types collected under each cycle are the same, and the second mask images corresponding to the fluorescent images are also the same.
FIG. 9 is a schematic diagram of generating a mask map based on base cluster positions in one embodiment, as shown in FIG. 9; the base cluster position distribution map is a base cluster position distribution map when the sequencing reaction is carried out on the base clusters at the top ends of the four short chains under one cycle, and the background position in the base cluster position distribution map is 0. The base cluster A1 represents the base cluster of the sample short chain A1, the base cluster A2 represents the base cluster of the sample short chain A2, the base cluster A3 represents the base cluster of the sample short chain A3, the base lengths of the three sample short chains are 10, the base recognition is respectively carried out on the sample fluorescent images of various base types under 10 cycles of the collected sample short chains A1, A2 and A3 by utilizing a traditional base recognition algorithm, the base sequences of the sample short chains A1, A2 and A3 are respectively obtained, after the base sequences are compared with the standard base sequences, the base sequences corresponding to the sample short chain A1 are the base sequences successfully compared, the base sequences corresponding to the sample short chains A2 and A3 are the base sequences failed in comparison, so that the position of the base cluster of the sample short chain A1 in the first mask graph generated according to the base cluster position is marked as '1', the rest positions are 0, and the base cluster marked as '1' has a real base type label. The position of the base cluster of the sample short chain A1 in the first mask map generated according to the position of the base cluster is marked as 0, the rest positions are 1, and the base cluster at the position marked as 1 has no real base type label.
Optionally, some bases in the base sequence which is successfully compared are misidentified based on the traditional base identification algorithm, the misidentified bases in the base sequence which is successfully compared are corrected according to the standard base sequence, the corrected base sequence is obtained, and the base type label graph corresponding to the sample fluorescent image corresponding to various base types in each cycle is determined based on the corrected base sequence and the positioned base cluster positions on the sequencing chip.
S82, training samples are obtained from the training data set to serve as input training samples, the input training samples are processed based on different data enhancement modes, multiple groups of processed training samples corresponding to the input training samples are obtained, and multiple groups of input data corresponding to the input training samples are formed based on the multiple groups of processed training samples corresponding to the input training samples.
In some embodiments, the different data enhancements include at least one combination of: different noise is added to the input training samples, and different brightness processing is performed on the input training samples.
Wherein the image size of each set of processed training samples is the same as the image size of the input training samples. For example, two different random gaussian noise on the input training sample a, resulting in two processed training samples A1 and A2.
The training samples are processed in different data enhancement modes, so that a plurality of groups of processed training samples can be obtained, the scale of the training samples is expanded, and the data enhancement modes such as noise and the like are added in the training samples to improve the diversity and the robustness of the training sample data, so that more characteristics in the training samples can be learned when the model is trained, the trained base recognition model can be more suitable for data in different data types, and the accuracy of base recognition is improved.
S83, constructing an initial base recognition model, taking a plurality of groups of input data corresponding to the input training sample as the input of the base recognition model respectively, obtaining base recognition data corresponding to each group of input data, carrying out iterative training on the initial base recognition model through the training data set, and calculating a loss value in each iterative process based on a loss function.
In some embodiments, the base recognition model is a deep learning model based on a Unet (U-shaped Convolutional Neural Network) network, and mainly comprises an Encoder (Encoder), a mid-connection (Skip Connections), and a Decoder (Decode).
The Encoder (Encoder) includes four convolutional layers and a pooling layer (MaxPooling). It is responsible for extracting feature information from an input image, gradually reducing the resolution of the input image, and capturing feature information of different scales. Intermediate Connections (Skip Connections): the feature map of the encoder is connected to the corresponding layer feature map of the decoder. These hopping connections allow information to be passed freely between the encoder and decoder, helping the network to better recover detailed information. The decoder section is for restoring the feature information extracted by the encoder to a prediction result of the same resolution as the input image. The decoder is typically composed of a deconvolution layer and an upsampling layer. Wherein the convolution layer is for performing a convolution operation on the input image data to extract features from the input image data. The pooling layer is used for carrying out downsampling processing on the output of the convolution layer, so that the data dimension is reduced, and the complexity and the calculated amount of the model are reduced. The deconvolution layer is used for up-sampling the image input by the encoder by using deconvolution to obtain a decoded image. In addition, in order to preserve detailed information of the original image and reduce information loss due to convolution operation, skip connections (skip connections) are used between the encoder and decoder. The method allows the intermediate feature map in the encoding process to be directly spliced and fused with the feature map of the corresponding scale in the decoding process based on the channel dimension. Unet (U-shaped Convolutional Neural Network) is a common deep learning neural network architecture, consisting mainly of an encoder, an intermediate connection and a decoder.
FIG. 10 is a schematic diagram showing the constitution of a base recognition model in one embodiment; the input image is (12, H, W), where H and W represent the length and width of the training image, respectively. First, four fluorescence images under each cycle are stacked according to the channel dimension, and input data of one four channels is created as data of one cycle. The dimensions of this input data are (4, H, W), where H and W represent the height and width of the training image, respectively. The sample fluorescence image under a plurality of cycles is input at a time, for example, 3 cycles, and then the input data is (12, H, w), where H is 2160 and w is 4096. The input image (12, H, W) is subjected to four convolutions and four downsamples in succession in the encoding stage by the encoder, with the number of channels doubled and the length and width halved each time. The encoder and decoder are then connected by a skip connection using an up-sampling operation in the decoding phase by the decoder.
In some embodiments, in each iteration process, multiple groups of input data corresponding to the input training sample are respectively used as input of the base recognition model, and base recognition data corresponding to each group of input data is obtained. And for the base identification data corresponding to each group of input data, respectively carrying out shielding treatment on the base identification data corresponding to each group of input data by using a first mask image corresponding to the input training sample, and reserving the base identification data of the base cluster position of the real base type label, so that the influence of the error result of the base identification of the base cluster position of the real base type label on the first loss value of the training model can be reduced, and then calculating the first loss value of the iteration by using the base identification data of the base cluster position of the real base type label in the iteration process and the base type label image corresponding to the input training sample of the iteration process based on the first loss function, so that the first loss value corresponding to each group of input data in the iteration process can be calculated.
And for the base identification data corresponding to each group of input data, respectively utilizing a second mask diagram corresponding to the input training sample to carry out shielding treatment on the base identification data corresponding to each group of input data, and reserving the base identification data of the base cluster position without the real base type label, so that the influence of the error result of the base identification of the base cluster position with the real base type label on the second loss value of the training model can be reduced, and then, the second loss value between the base identification data corresponding to each two groups of input data processed in the iterative process is calculated based on the second loss function, so that a plurality of second loss values in the iterative process are obtained. For example, if two sets of input data are used, the second loss value between the base identification data corresponding to the two sets of processed input data is directly calculated. If the input data are three groups, respectively calculating second loss values between the base identification data corresponding to every two groups of processed input data.
And calculating the loss value in the iteration process based on the first loss value and the second loss values corresponding to each group of input data in the iteration process.
Optionally, calculating a loss value in iterative training based on the loss function The method comprises the following steps:
wherein the method comprises the steps ofRepresenting the calculated +.sup.th based on said first loss function>First loss value corresponding to group input data, < >>Is indicated as +.>Group input data->Indicate->Weight corresponding to iteration round number, +.>Representing the calculated +.sup.th based on said second loss function>Second loss value,/, for>Representing the total number of calculated second loss values.
Optionally, the first loss function is:
wherein the method comprises the steps ofIs a cross entropy loss function, < >>Is the number of categories>Is->One-hot coding of a class reality tag,/->Is the base recognition model to predict the base cluster type as +.>Class probability distribution values.
Optionally, the second loss function is:
wherein N represents the number of pixels in the base identification data corresponding to each set of input data,and->Is the distribution of base identification data corresponding to each two sets of input data after processing, wherein +.>Representing the +.sup.th in the base recognition data corresponding to one of the input data in every two groups>Probability distribution of base class of individual pixels, < ->Representing the +.sup.th in the base recognition data corresponding to the other input data in every two groups>Probability distribution of base class for individual pixels.
Optionally, the training of the base recognition model based on the training data set includes multiple iterations, wherein the training of the base recognition model based on the training data set is performed in one iteration based on a number of iterations, the With iterationThe number of rounds increases.
In the training process, multiple iterations are needed to complete training of the model, m times of iterations exist in one iteration, the m size depends on the size of the training data set, and the larger the training data set is, the larger the value of m is. In one iteration, there are multiple iterations, each iteration extracts a portion of the training samples from the training dataset as input training samples, respectively, until all training samples in the training dataset have been extracted, and then one iteration is completed. For example, m takes 4000, and three iterations are needed to complete training, then each iteration is 4000 times in the first iteration, the second iteration and the third iteration, and each iteration is performed in the second iterationGreater than +.>In the third iteration +.>Greater than in the second iteration。
In this embodiment, in the early training stage of semi-supervised learning, the base recognition model is more required to learn from the base type label map as a training target to accelerate the learning ability of the model, so that the early learning stage is iteratedThe value of (2) is small, the base recognition model tends to be stable in the later stage of training, and more useful information can be learned from consistency regularization without real base type tag data, so that the value of +_ >The MSE weight is increased, so that the model learns more features of base clusters without real base type labels during training, learns more diversified features, and helps the model to better understand and generalizeThe model can better balance training data and generalization requirements, and the risk of overfitting is reduced.
As shown in fig. 11, fig. 11 is a schematic diagram of a base recognition method based on semi-supervised learning in one embodiment; in each iteration, the input training sample is processed by a first data enhancement mode to obtain X1 image data, and is processed by a second data enhancement mode to obtain X2 image data. Input data of the base recognition models corresponding to the X1 and the X2 are obtained based on the X1 and the X2 image data respectively, and the base recognition results Y1 and Y2 corresponding to the X1 image data are obtained after the base recognition models are recognized. And processing the base recognition results Y1 and Y2 through a first mask map respectively to obtain image data Z1 corresponding to the Y1 and image data Z2 corresponding to the Y2. Based on the first loss function, the image data Z1 and the base type label image are subjected to loss calculation to obtain a loss value CE_1. Based on the first loss function, the image data Z2 and the base type label image are subjected to loss calculation to obtain a loss value CE_2. After Y1 and Y2 are processed by the second mask pattern, image data U1 corresponding to Y1 and image data U2 corresponding to Y2 are obtained, and a loss value MSE is calculated between U1 and U2 based on the second loss function. The total loss value corresponding to the input training sample for this iteration is:
FIG. 12 is a diagram showing calculation of a first loss value when a base recognition model is trained in a base recognition method based on semi-supervised learning according to an embodiment. Taking the sample fluorescence image of the input training sample including four base types as an example in fig. 12, the center position of the base cluster of the sample fluorescence image of the a base type is located at (1, 3), the data processed by the data enhancement mode of the input training sample forms a group of input data of a base recognition model, the base recognition result Y1 shown in fig. 12 is obtained after the base recognition model is recognized, the base recognition result Y1 is processed by the first mask map B1, the image data Z1 is obtained, and the loss is calculated by the image data Z1 and the base type label image D.
FIG. 13 is a diagram showing calculation of a second loss value when a base recognition model is trained in a base recognition method based on semi-supervised learning according to an embodiment; taking a sample fluorescent image of which the input training sample comprises four base types as an example, taking black spot positions in a base cluster position distribution diagram in the sample fluorescent image as base cluster positions and taking other positions as background positions, wherein a first mask diagram and a second mask diagram are obtained based on the base cluster position distribution diagram, the center positions of base clusters of the sample fluorescent image of the base type A are positioned at (1 and 3), inputting data of the training sample processed by a data enhancement mode to form a group of input data of a base recognition model, obtaining a base recognition result Y1 of the base A shown in FIG. 13 after the base recognition model recognition, inputting data of the training sample processed by another data enhancement mode to form a group of input data of the base recognition model, and obtaining a base recognition result Y2 of the base A shown in FIG. 13 after the base recognition model recognition. The base recognition result Y1 and the base recognition result Y2 are processed by a second mask chart B2 to respectively obtain U1 image data corresponding to the base recognition result Y1 and U2 image data corresponding to the base recognition result Y2. Based on the second loss function, a loss is calculated between the U1 image data and the U2 image data.
S84, judging whether the iteration termination condition is met.
In some embodiments, the iteration termination condition includes, but is not limited to, a number of iterations, whether a loss value in an iteration is less than a preset loss value. And when the iteration termination condition is not met in the iteration process, returning to continuously executing S82, continuously acquiring training samples from the training data set, and training the base recognition model until the iteration termination condition is met. When the iteration condition is satisfied, S85 is executed.
S85, taking the base recognition model after iteration termination as a trained base recognition model.
Referring to fig. 14, an embodiment of the present application provides a base recognition device based on semi-supervised learning, including: the acquisition module 21 is used for acquiring fluorescent images to be detected corresponding to base signal acquisition units of various base types on the sequencing chip, and forming input image data to be detected based on the fluorescent images to be detected; wherein the fluorescence image to be detected comprises fluorescence images to be detected corresponding to various base types; the recognition module 22 is configured to take the input image data to be detected as input of a trained base recognition model, and output a base recognition result of the input image data to be detected through the trained base recognition model, where the trained base recognition model is a model obtained by performing semi-supervised learning training based on a training data set;
The training data set comprises training samples collected under multiple cycles, each training sample comprises a sample fluorescent image corresponding to multiple base types and a base type label graph corresponding to the sample fluorescent image, each training label corresponding to the training sample further comprises a first mask graph corresponding to the sample fluorescent image and a second mask graph corresponding to the sample fluorescent image, and the first mask graph is used for marking the position of a base signal collecting unit with a base type label in the sample fluorescent image; the second mask pattern is used to label the position of the base signal acquisition unit in the fluorescent image of the sample where there is no base type tag.
Optionally, the identification module 22 is configured to:
acquiring a training data set;
acquiring training samples from the training data set as input training samples, processing the input training samples based on different data enhancement modes to obtain a plurality of groups of processed training samples corresponding to the input training samples, and forming a plurality of groups of input data corresponding to the input training samples based on the plurality of groups of processed training samples corresponding to the input training samples;
constructing an initial base recognition model, taking a plurality of groups of input data corresponding to the input training sample as the input of the base recognition model respectively to obtain base recognition data corresponding to each group of input data, and carrying out iterative training on the initial base recognition model through the training data set until a loss function converges to obtain the trained base recognition model;
Wherein the loss function comprises:
calculating a first loss function of a first loss value between base identification data corresponding to each set of adjusted input data and a base type label graph corresponding to the input training sample, wherein the base identification data corresponding to each set of adjusted input data is obtained by adjusting the base identification data corresponding to each set of input data based on a first mask graph corresponding to the input training sample;
and calculating a second loss function of a second loss value between base identification data corresponding to each two groups of processed input data, wherein the base identification data corresponding to each group of processed input data is obtained by processing the base identification data corresponding to each group of input data based on a second mask diagram corresponding to the input training sample.
Optionally, the different data enhancement modes include at least one combination of the following: different noise is added to the input training samples, and different brightness processing is performed on the input training samples.
Optionally, calculating a loss value in iterative training based on the loss functionThe method comprises the following steps:
wherein the method comprises the steps ofRepresenting the calculated +.sup.th based on said first loss function >First loss value corresponding to group input data, < >>Is indicated as +.>Group input data->Indicate->Weight corresponding to iteration round number, +.>Representing the calculated +.sup.th based on said second loss function>Second loss value,/, for>Representing the total number of calculated second loss values.
Optionally, the training of the base recognition model based on the training data set includes multiple iterations, wherein the training of the base recognition model based on the training data set is performed in one iteration based on a number of iterations, theAs the number of iteration rounds increases.
Optionally, the first loss function is:
wherein the method comprises the steps ofIs a cross entropy loss function, < >>Is the number of categories>Is->One-hot coding of a class reality tag,/->Is a base recognition model predicting base signal acquisition unit type as +.>Class probability values.
Optionally, the second loss function is:
wherein N represents the number of pixels in the base identification data corresponding to each set of input data,and->Is the distribution of base identification data corresponding to each two sets of input data after processing, wherein +.>Representing the +.sup.th in the base recognition data corresponding to one of the input data in every two groups >Probability distribution of base class of individual pixels, < ->Representing the +.sup.th in the base recognition data corresponding to the other input data in every two groups>Probability distribution of base class for individual pixels.
It will be appreciated by those skilled in the art that the structure of the base recognition device based on semi-supervised learning in fig. 14 does not constitute a limitation of the base recognition device based on semi-supervised learning, and the respective modules may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a controller in a computer device, or may be stored in software in a memory in the computer device, so that the controller may call and execute operations corresponding to the above modules. In other embodiments, more or fewer modules than illustrated may be included in a semi-supervised learning based base recognition apparatus.
Referring to fig. 15, in another aspect of the embodiments of the present application, there is further provided a gene sequencer 200, including a memory 3011 and a processor 3012, where the memory 3011 stores a computer program, and the computer program when executed by the processor causes the processor 3012 to execute the steps of the base recognition method based on semi-supervised learning provided in any of the embodiments of the present application. The gene sequencer 200 may include a gene sequencer (e.g., desktop computer, laptop computer, tablet computer, handheld computer, smart speaker, server, etc.), a mobile phone (e.g., smart phone, wireless phone, etc.), a wearable device (e.g., a pair of smart glasses or smart watch), or the like.
Where the processor 3012 is a control center, various interfaces and lines are utilized to connect various portions of the overall computer device, perform various functions of the computer device and process data by running or executing software programs and/or modules stored in the memory 3011, and invoking data stored in the memory 3011. Optionally, the processor 3012 may include one or more processing cores; preferably, the processor 3012 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user pages, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 3012.
The memory 3011 may be used to store software programs and modules, and the processor 3012 executes various functional applications and data processing by executing the software programs and modules stored in the memory 3011. The memory 3011 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 3011 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 3011 may also include a memory controller to provide access to the memory 3011 by the processor 3012.
In another aspect of the embodiments of the present application, there is further provided a storage medium storing a computer program, where the computer program when executed by a processor causes the processor to execute the steps of the base recognition method based on semi-supervised learning provided in any of the embodiments of the present application.
Those skilled in the art will appreciate that implementing all or part of the processes of the methods provided in the above embodiments may be accomplished by computer programs stored on a non-transitory computer readable storage medium, which when executed, may comprise processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. The scope of the invention is to be determined by the appended claims.