[go: up one dir, main page]

CN105989248B - Data processing method and device for multiple molecular signals - Google Patents

Data processing method and device for multiple molecular signals Download PDF

Info

Publication number
CN105989248B
CN105989248B CN201510061908.8A CN201510061908A CN105989248B CN 105989248 B CN105989248 B CN 105989248B CN 201510061908 A CN201510061908 A CN 201510061908A CN 105989248 B CN105989248 B CN 105989248B
Authority
CN
China
Prior art keywords
molecular
cluster
molecular cluster
fluorescent
clusters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510061908.8A
Other languages
Chinese (zh)
Other versions
CN105989248A (en
Inventor
李雷
王博
万林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Academy of Mathematics and Systems Science of CAS
Original Assignee
Academy of Mathematics and Systems Science of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Academy of Mathematics and Systems Science of CAS filed Critical Academy of Mathematics and Systems Science of CAS
Priority to CN201510061908.8A priority Critical patent/CN105989248B/en
Publication of CN105989248A publication Critical patent/CN105989248A/en
Application granted granted Critical
Publication of CN105989248B publication Critical patent/CN105989248B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Investigating, Analyzing Materials By Fluorescence Or Luminescence (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明提出了用于多个分子信号的数据处理方法。该方法计算任意不同的分子簇A与分子簇B的荧光信号之间的混杂系数C(A←B)和C(B←A),用于衡量所述分子簇A和分子簇B的荧光信号相互混杂的严重程度,进而,可以干预、减少不同分子簇间相互混杂的干扰,以提高分子识别技术的辨识精度。

The present invention proposes a data processing method for multiple molecular signals. The method calculates the mixing coefficients C(A←B) and C(B←A) between the fluorescent signals of any different molecular cluster A and molecular cluster B, which are used to measure the fluorescent signals of the molecular cluster A and molecular cluster B The severity of mutual mixing, and then, can intervene and reduce the interference of mutual mixing between different molecular clusters, so as to improve the identification accuracy of molecular recognition technology.

Description

用于多个分子信号的数据处理方法和装置Data processing method and device for multiple molecular signals

技术领域technical field

本发明涉及分子测序的数据处理领域,具体来说,涉及一种数据处理方法和装置。The invention relates to the field of data processing of molecular sequencing, in particular to a data processing method and device.

背景技术Background technique

Illumina公司的基因序列合成的测序技术和基于该技术的测序平台被广泛使用的,已最成功的第二代基因测序技术之一。它首先将短的单链DNA分子随机固定在芯片表面上,然后通过复制形成包含相同序列的单链分子簇。每一轮测序中,通过加入带有不同荧光标记的可逆终止子基团的四种单核苷酸,分子簇的互补链生长且仅每一轮生长一个碱基。之后分别在不同频率的激光光谱上对芯片表面进行拍照。每个频道主要对应一种荧光。在拍照完成后,再将终止子基团洗去,以进行下一轮测序。这样,通过对分子簇进行定位,然后提取同一个分子簇每轮测序的荧光信号,并根据荧光信号的不同类型确定每轮测序识别出的碱基,进而完成对这一分子簇包含序列的测序。这一技术被应用在GA,Hiseq和Miseq等平台上。关于这一技术的更详细的内容及现有相关数据处理技术,可参见文献Bentley etc,2008;Li&Speed,1999;Massingham&Goldman,2012;Whiteford etc,2009等。Illumina's gene sequence synthesis sequencing technology and the sequencing platform based on this technology are widely used and have been one of the most successful second-generation gene sequencing technologies. It first randomly immobilizes short single-stranded DNA molecules on the surface of the chip, and then forms clusters of single-stranded molecules containing the same sequence through replication. In each round of sequencing, by adding four mononucleotides with different fluorescently labeled reversible terminator groups, the complementary strand of the molecular cluster grows by only one base per round. After that, the surface of the chip is photographed on the laser spectrum of different frequencies. Each channel mainly corresponds to one fluorescence. After the photographing is completed, the terminator group is washed away for the next round of sequencing. In this way, by locating the molecular cluster, then extracting the fluorescent signal of each round of sequencing of the same molecular cluster, and determining the bases identified by each round of sequencing according to the different types of fluorescent signals, and then completing the sequencing of the sequence contained in this molecular cluster . This technology is applied on platforms such as GA, Hiseq and Miseq. For more detailed content of this technology and existing related data processing technology, please refer to the literature Bentley etc., 2008; Li&Speed, 1999; Massingham&Goldman, 2012; Whiteford etc., 2009, etc.

但是该技术还存在诸多不足。除光谱串色和相位失相以外,还包括下述问题:首先,由于测序仪精度限制,不同照片中的景物有从小于一个像素到数十甚至上百像素的位移和轻微的拉伸现象。同时,分子簇不发光的那些位置也存在较小的非零、随机的光强背景值。更为麻烦的是,由于序列片断的分子是随机落在芯片上的,因此形成的分子簇可能离得较近,这时每张照片中这些离得较近的分子簇的信号将混杂在一起相互影响(如图1A、图1B、图1C和图2,图1A是现有技术测得的一轮测序一个频道的图片的局部示意图,显示了离得较近的分子簇;图1B是经过光谱串色和相位失相的校正后两个离得较近的分子簇信号部分测序轮的示意图,此图中第二个分子簇对第一个分子簇信号产生相邻分子簇混杂,并导致第一个分子簇的第13个碱基被错误辨识;图1C是相邻分子簇信号混杂的示意图;图2是两个离得较近的分子簇的定位和产生信号混杂的示意图,当两个分子簇离得较近时,根据荧光信号最大值确定的两个分子簇的位置会相互靠近,同时产生信号混杂)。如图2所示,离得较近的分子簇的坐标位置的确定也有可能存在偏差。But this technology still has many deficiencies. In addition to spectral cross-color and phase loss, the following problems are also included: First, due to the limitation of the accuracy of the sequencer, the scenes in different photos have displacement and slight stretching from less than one pixel to tens or even hundreds of pixels. At the same time, there are also small non-zero, random light intensity background values at those positions where the molecular clusters do not emit light. What's more troublesome is that since the molecules of the sequence fragments are randomly dropped on the chip, the formed molecular clusters may be relatively close together, and the signals of these relatively close molecular clusters in each photo will be mixed together Mutual influence (as shown in Fig. 1A, Fig. 1B, Fig. 1C and Fig. 2, Fig. 1A is a partial schematic diagram of a picture of a channel of a round of sequencing measured by the prior art, showing the molecular clusters that are closer; Fig. Schematic diagram of partial sequencing rounds of two closely spaced molecular cluster signals after correction of spectral cross-color and phase dephasing. The 13th base of the first molecular cluster was misrecognized; Figure 1C is a schematic diagram of the mixed signals of adjacent molecular clusters; Figure 2 is a schematic diagram of the positioning and signal mixing of two molecular clusters that are closer together When the two molecular clusters are close to each other, the positions of the two molecular clusters determined according to the maximum value of the fluorescence signal will be close to each other, and the signals will be mixed at the same time). As shown in FIG. 2 , there may also be deviations in determining the coordinate positions of molecular clusters that are closer.

针对上述相关技术中的难题,目前尚未提出有效的解决方案。Aiming at the problems in the above-mentioned related technologies, no effective solution has been proposed yet.

以下是对本领域的相关术语的解释:The following is an explanation of the relevant terms in this field:

分子簇:英文名称为Cluster,指分子测序过程中特定分子的集合,该集合内包含具有相同序列的分子,并且这些分子之间的平均距离小于不同分子簇的分子之间的平均距离。Molecular cluster: the English name is Cluster, which refers to a collection of specific molecules in the molecular sequencing process, which contains molecules with the same sequence, and the average distance between these molecules is smaller than the average distance between molecules of different molecular clusters.

测序:测序的目的为识别分子簇内分子的序列。所述分子的序列指所述分子中特定位置的分子基本元件的类型。以DNA分子测序为例,其序列为DNA分子中特定片断的每个碱基的类型。Sequencing: The purpose of sequencing is to identify the sequence of molecules within a molecular cluster. The sequence of the molecule refers to the type of molecular building block at a particular position in the molecule. Taking DNA molecular sequencing as an example, the sequence is the type of each base in a specific segment of the DNA molecule.

荧光信号:英文名称为fluorescence intensity,指通过预定测量方式得到的,分子簇内分子荧光标记受激发发出的光强,亦称作荧光强度。Fluorescence signal: the English name is fluorescence intensity, which refers to the light intensity obtained by a predetermined measurement method, and the molecular fluorescent markers in the molecular cluster are excited and emitted, also known as fluorescence intensity.

信号混杂:无英文名称,指任一分子簇的荧光信号中出现的来源于其它分子簇荧光标记的荧光信号。Signal mixing: no English name, refers to the fluorescent signals of any molecular clusters that appear in the fluorescent signals derived from the fluorescent labels of other molecular clusters.

频道:英文名称为channel,对某一状态下的分子簇荧光标记进行测量时,每种测量方式称为一个频道。Channel: The English name is channel. When measuring the fluorescent label of a molecular cluster in a certain state, each measurement method is called a channel.

测序轮:英文名称为cycle,以不同测量方式对分子荧光标记进行测量时,对一种状态的测量过程为一个测序轮。Sequencing wheel: The English name is cycle. When measuring molecular fluorescent markers in different measurement methods, the measurement process of a state is a sequencing wheel.

光谱串色,英文名称为laser-crosstalk或spectra-crosstalk,指某种类型的基团对应的荧光标记在超过一个频道中引起荧光信号不为零的现象。Spectral cross-color, the English name is laser-crosstalk or spectrum-crosstalk, refers to the phenomenon that the fluorescent label corresponding to a certain type of group causes the fluorescent signal to be non-zero in more than one channel.

相位失相,英文名称为phasing,指特定位置的基团对应的荧光标记在超过一个测序轮中引起荧光信号不为零的现象。Phase dephasing, the English name is phasing, refers to the phenomenon that the fluorescent label corresponding to the group at a specific position causes the fluorescent signal to be non-zero in more than one sequencing round.

分子簇定位,英文名称为template generation,指确定图像中的哪些坐标存在符合预定条件的分子簇。Molecular cluster positioning, the English name is template generation, which refers to determining which coordinates in the image have molecular clusters that meet predetermined conditions.

发明内容Contents of the invention

针对相关技术中存在的难题,尤其是离得较近的分子簇的信号会混杂在一起相互影响,本发明提出一种用于多个分子的测序数据的处理方法。Aiming at the problems existing in related technologies, especially that the signals of molecular clusters that are closer together will mix together and affect each other, the present invention proposes a processing method for sequencing data of multiple molecules.

该方法的内容包括:The contents of this method include:

(1)计算任意分子簇A与符合预定条件的分子簇B的荧光信号之间的混杂系数C(A←B),用于衡量所述分子簇A的荧光信号中所述分子簇B的混杂的严重程度。(1) Calculating the mixing coefficient C (A←B) between any molecular cluster A and the fluorescent signal of the molecular cluster B meeting the predetermined conditions, which is used to measure the mixing of the molecular cluster B in the fluorescent signal of the molecular cluster A severity.

(2)利用计算出的混杂系数,对分子簇荧光信号进行处理。(2) Using the calculated confounding coefficients, the fluorescence signals of the molecular clusters are processed.

本发明的意义在于:本发明提出的数据处理方法通过计算不同分子簇荧光信号之间的混杂系数,有效的衡量了所述不同分子簇荧光信号之间的干扰或混杂的严重程度。进而,能够在进行分子测序时通过对离得较近的分子簇信号进行处理,并将处理结果用于分子序列识别和输出序列识别的相关信息,以极大提升分子识别技术的辨识精度。现有技术使用图像去模糊化的方法减少分子簇荧光信号混杂,但部分荧光信号的混杂程度不符合模糊化方法使用的核函数模式,致使分子簇的荧光信号中仍残留一定程度的混杂,影响序列识别的精度。本发明有效弥补了现有技术中的这一不足。The significance of the present invention lies in that the data processing method proposed by the present invention effectively measures the severity of interference or mixing between the fluorescent signals of different molecular clusters by calculating the mixing coefficients between the fluorescent signals of different molecular clusters. Furthermore, it is possible to process the signals of molecular clusters that are closer to each other during molecular sequencing, and use the processing results for molecular sequence identification and output related information for sequence identification, so as to greatly improve the identification accuracy of molecular identification technology. The prior art uses the method of image deblurring to reduce the mixing of fluorescent signals of molecular clusters, but the degree of mixing of some fluorescent signals does not conform to the kernel function mode used in the fuzzing method, resulting in a certain degree of mixing still remaining in the fluorescent signals of molecular clusters, affecting Accuracy of sequence identification. The invention effectively makes up for this deficiency in the prior art.

本发明提出的数据处理方法的技术路线包括:The technical route of the data processing method that the present invention proposes comprises:

(1)计算任意分子簇A与符合预定条件的分子簇B的荧光信号之间的混杂系数C(A←B),所述C(A←B)用于衡量分子簇A的荧光信号中来源于分子簇B的混杂的严重程度,其值为E(A←B)与E(B←B)的比,其中,所述E(A←B)为所述分子簇A的荧光信号中属于分子簇B中分子荧光标记的荧光信号,所述E(B←B)为所述分子簇B的荧光信号中属于所述分子簇B中分子荧光标记的荧光信号。通过下述公式计算所述C(A←B):(1) Calculate the mixing coefficient C(A←B) between any molecular cluster A and the fluorescent signal of the molecular cluster B that meets the predetermined conditions, and the C(A←B) is used to measure the source of the fluorescent signal of the molecular cluster A The severity of the confounding of the molecular cluster B is the ratio of E(A←B) to E(B←B), wherein the E(A←B) is the fluorescence signal of the molecular cluster A that belongs to The fluorescent signal of the molecular fluorescent label in the molecular cluster B, the E(B←B) is the fluorescent signal of the molecular fluorescent label in the molecular cluster B among the fluorescent signals of the molecular cluster B. The C(A←B) is calculated by the following formula:

C(A←B)=argminc(f(IA-cIB)+h(c));C(A←B)=argmin c (f(I A -cI B )+h(c));

其中,h(c)是预先设定的单调非减函数,用于控制过大的混杂系数对序列识别精度的影响,IA和IB为分子簇A和分子簇B在预先指定的测序轮和测序频道的荧光信号,用于衡量输入荧光信号中混杂的严重程度。其中n为测序轮的数量,对测序轮数j,rj为预先设定的函数,wj为根据所有分子簇在第j轮测序中的荧光信号计算出的标量或是预先设定的常数。输入信号中的高的混杂使f(I)的值变大,因此计算出的混杂系数使分子簇A的荧光信号进行信号混杂的校正后其混杂程度减小。Among them, h(c) is a pre-set monotone non-decreasing function, which is used to control the influence of excessive confounding coefficients on sequence recognition accuracy, and I A and I B are molecular cluster A and molecular cluster B in the pre-specified sequencing round and the fluorescent signal of the sequencing channel, Used to measure the severity of confounding in the input fluorescent signal. where n is the number of sequencing rounds, for the number of sequencing rounds j, r j is a preset function, w j is a scalar calculated from the fluorescence signals of all molecular clusters in the j-round sequencing or a preset constant . High aliasing in the input signal makes the value of f(I) larger, so the calculated aliasing coefficient makes the fluorescence signal of molecular cluster A corrected for signal aliasing, and the degree of aliasing decreases.

计算argminc(f(IA-cIB)+h(c))时,通过使用分位数法求f(IA-cIB)+h(c)的导函数零点的方法完成。When calculating argmin c (f(I A -cI B )+h(c)), it is completed by using the quantile method to find the zero point of the derivative function of f(I A -cI B )+h(c).

(2)根据所述混杂系数对对所述分子簇荧光信号进行处理,以完成分子簇中分子的序列的识别和序列识别相关信息的计算。(2) Process the fluorescent signal of the molecular cluster according to the mixing coefficient, so as to complete the identification of the sequence of the molecules in the molecular cluster and the calculation of information related to the sequence identification.

其中,对所述分子簇荧光信号进行处理包括,对分子簇荧光信号中的信号混杂进行校正,校正方法包括:Wherein, processing the fluorescent signal of the molecular cluster includes correcting signal mixing in the fluorescent signal of the molecular cluster, and the correction method includes:

通过下述公式计算所述分子簇的没有信号混杂的荧光信号所组成的矩阵IIThe matrix II formed by the fluorescent signals of the molecular clusters without signal confounding is calculated by the following formula:

C·II=IOC · I I = I O ;

其中在所述矩阵II中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号;所述C为由各个分子簇之间的混杂系数所组成的矩阵;所述IO为需要进行所述校正的分子簇荧光信号所组成的矩阵,在矩阵IO中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号。Wherein in the matrix II , the elements of each row correspond to the fluorescence signals of a molecular cluster, and the elements of each column correspond to the fluorescence signals of all molecular clusters of one channel in a sequencing round; The matrix formed by the confounding coefficient; The I O is the matrix formed by the molecular cluster fluorescence signals that need to be corrected. In the matrix I O , the elements in each row correspond to the fluorescence signals of a molecular cluster, and the elements in each column Fluorescent signal corresponding to all molecular clusters in one channel in one sequencing round.

对所述分子簇荧光信号进行处理还包括对校正过信号混杂的分子簇荧光信号进行后续处理,以完成分子序列的识别。Processing the fluorescent signal of the molecular cluster further includes performing subsequent processing on the fluorescent signal of the molecular cluster corrected for signal confusion, so as to complete the recognition of the molecular sequence.

(3)为了更容易计算分子簇间的混杂系数,本方法在计算不同分子簇的荧光信号之间的混杂系数之前采用预定方式对输入数据进行处理,所述预定方式包括以下至少之一:(3) In order to calculate the mixing coefficient between molecular clusters more easily, the method uses a predetermined method to process the input data before calculating the mixing coefficient between the fluorescent signals of different molecular clusters, and the predetermined method includes at least one of the following:

校正光谱串色、校正相位失相、对原始图像数据进行预处理生成所述分子簇荧光信号。Correcting spectral cross-color, correcting phase loss, and preprocessing raw image data to generate the molecular cluster fluorescence signal.

对原始图像数据进行预处理生成所述分子簇荧光信号时,本方法包括下述步骤:When preprocessing the raw image data to generate the fluorescent signal of the molecular cluster, the method includes the following steps:

移除背景光,正规化,生成对准模版,分子簇定位和提取分子簇荧光信号。Background light removal, normalization, generation of alignment templates, cluster localization and extraction of cluster fluorescence signals.

其中,所述生成对准模板步骤包括:Wherein, the step of generating an alignment template includes:

对准存在光谱串色的频道的图像,并校正所述对准的图像的光谱串色;aligning images of channels in which spectral bleed-through is present, and correcting spectral bleed-through of the aligned images;

将各个所述校正过光谱串色的图像中相同位置的像素的亮度进行比较,保留所述相同位置中亮度最大的值,生成对准模板。Comparing the luminances of pixels at the same position in each of the images that have undergone spectral color cross-correction, and retaining the value with the highest luminance in the same position to generate an alignment template.

所述生成对准模板步骤中,将不同图像(或图像同对准模板)对准的方法包括:In the step of generating an alignment template, the method for aligning different images (or images with the alignment template) includes:

选取需要对准的两幅图像中预定坐标范围和预定数量的区域,将其中一幅图像的所选区域进行位移操作;Select a predetermined coordinate range and a predetermined number of areas in the two images to be aligned, and perform a displacement operation on the selected area of one of the images;

对两幅图像的预定坐标范围的区域,搜索其中一幅图像所述区域的整点坐标的位移,并将所述区域与另一幅图像中所述区域的最大相关对应的位移坐标作为初始点,通过BFGS或其他求解非约束最优化问题的算法定位位移。For the area of the predetermined coordinate range of the two images, search for the displacement of the whole-point coordinates of the area in one of the images, and use the displacement coordinate corresponding to the maximum correlation between the area and the area in the other image as the initial point , locate displacements by BFGS or other algorithms for solving unconstrained optimization problems.

所述分子簇定位步骤包括:The molecular cluster localization step comprises:

对校正过光谱串色的图像进行定位操作,所述定位操作包括:A positioning operation is performed on the image corrected for spectral color crosstalk, and the positioning operation includes:

查找所述校正过光谱串色的图像中的亮点,并通过目标亮点及所述目标亮点周围的多个亮点的荧光信号,分别在两个方向上拟合抛物线,并计算所述抛物线的对称轴以确定所述目标亮点的坐标;Find the bright spot in the image that has been corrected for spectral cross-color, and fit a parabola in two directions respectively through the fluorescence signals of the target bright spot and multiple bright spots around the target bright spot, and calculate the symmetry axis of the parabola to determine the coordinates of the target bright spot;

通过不存在邻居的亮点的坐标均值计算各个亮点所对应的分子簇的坐标,其中所述不存在邻居的亮点为一个包含亮点的单位像素内的亮点,且在所述包含亮点的单位像素的周围两个单位像素范围内不存在除自身所包含的亮点外,其他同频道同测序轮的亮点。The coordinates of the molecular clusters corresponding to each bright spot are calculated by the mean value of the coordinates of bright spots without neighbors, wherein the bright spots without neighbors are bright spots within a unit pixel containing bright spots, and are around the unit pixel containing bright spots There are no bright spots of the same channel and the same sequencing round in the range of two unit pixels except the bright spots contained in itself.

根据本发明的另一方面,提供了一种数据处理装置。According to another aspect of the present invention, a data processing device is provided.

该装置包括:The unit includes:

计算混杂系数模块,用于计算不同分子簇的荧光信号之间的混杂系数。其中,任意分子簇A与符合预定条件的分子簇B的荧光信号之间的混杂系数C(A←B)用于衡量所述分子簇B对所述分子簇A的荧光信号产生的混杂的严重程度。Calculation Confusion Coefficient module, used to calculate the confounding coefficient between the fluorescent signals of different molecular clusters. Wherein, the mixing coefficient C (A←B) between any molecular cluster A and the fluorescent signal of the molecular cluster B meeting the predetermined conditions is used to measure the severity of the mixing of the molecular cluster B to the fluorescent signal of the molecular cluster A degree.

该装置还可以包括,处理模块,用于通过所述混杂系数对分子簇荧光信号进行处理,以完成分子序列的识别。The device may also include a processing module, which is used to process the fluorescent signal of the molecular cluster through the mixing coefficient, so as to complete the recognition of the molecular sequence.

该装置还可以包括,预处理模块,用于在计算不同分子簇的荧光信号之间的混杂系数之前采用预定方式对输入数据进行处理。The device may also include a preprocessing module, which is used to process the input data in a predetermined manner before calculating the mixing coefficient between the fluorescent signals of different molecular clusters.

其中,计算混杂系数模块进一步用于计算如下的混杂系数:对任意分子簇A和符合预定条件的分子簇B,所述混杂系数C(A←B)为E(A←B)与E(B←B)的比,其中,所述E(A←B)为所述分子簇A的荧光信号中来源于分子簇B的荧光标记的荧光信号,所述E(B←B)为所述分子簇B的荧光信号中来源于所述分子簇B中分子荧光标记的荧光信号。Wherein, the module of calculating the mixing coefficient is further used to calculate the following mixing coefficient: for any molecular cluster A and the molecular cluster B meeting the predetermined conditions, the mixing coefficient C (A←B) is E(A←B) and E(B ←B), wherein, the E(A←B) is the fluorescent signal of the fluorescent label derived from the molecular cluster B in the fluorescent signal of the molecular cluster A, and the E(B←B) is the molecular The fluorescent signal of cluster B is derived from the fluorescent signal of the molecular fluorescent label in the molecular cluster B.

计算混杂系数模块进一步用于通过下述公式计算所述C(A←B):Calculating the mixing coefficient module is further used to calculate the C(A←B) by the following formula:

C(A←B)=argminc(f(IA-cIB)+h(c));C(A←B)=argmin c (f(I A -cI B )+h(c));

其中,h(c)是预先设定的单调非减函数,IA和IB为分子簇A和分子簇B在预先指定的测序轮和测序频道的荧光信号,其中n为测序轮的数量,对测序轮数j,rj为预先设定的函数,wj为根据所有分子簇在第j轮测序中的荧光信号计算出的标量或是预先设定的常数,其中j≥1。Among them, h(c) is a pre-set monotone non-decreasing function, I A and I B are the fluorescent signals of molecular cluster A and molecular cluster B in the pre-specified sequencing round and sequencing channel, where n is the number of sequencing rounds, for the number of sequencing rounds j, r j is a preset function, w j is a scalar calculated from the fluorescence signals of all molecular clusters in the j-round sequencing or a preset constant , where j≥1.

处理模块可进一步包括校正单元,用于对分子簇荧光信号中的信号混杂进行校正,校正方法包括:The processing module may further include a correction unit, which is used to correct the signal mixing in the fluorescence signal of the molecular cluster, and the correction method includes:

通过下述公式计算所述分子簇的没有信号混杂的荧光信号所组成的矩阵:The matrix composed of the fluorescence signals without signal confounding of the molecular clusters is calculated by the following formula:

C·II=IOC · I I = I O ;

其中在所述矩阵II中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号;所述C为由各个分子簇之间的混杂系数所组成的矩阵;所述IO为需要进行所述校正的分子簇荧光信号所组成的矩阵,在矩阵IO中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号。Wherein in the matrix II , the elements of each row correspond to the fluorescence signals of a molecular cluster, and the elements of each column correspond to the fluorescence signals of all molecular clusters of one channel in a sequencing round; The matrix formed by the confounding coefficient; The I O is the matrix formed by the molecular cluster fluorescence signals that need to be corrected. In the matrix I O , the elements in each row correspond to the fluorescence signals of a molecular cluster, and the elements in each column Fluorescent signal corresponding to all molecular clusters in one channel in one sequencing round.

处理模块可进一步包括下游处理单元,用于对校正过信号混杂的分子簇荧光信号进行后续处理,以完成分子序列的识别。The processing module may further include a downstream processing unit for performing subsequent processing on the fluorescent signal of the molecular cluster corrected for signal confusion, so as to complete the recognition of the molecular sequence.

其中,预处理模块包括图像处理单元和预处理单元,图像处理单元用于当输入数据为测序得到的图像时,对图像进行处理以生成分子簇荧光信号,预处理单元用于对分子簇荧光信号进行处理以符合计算混杂系数所需的条件。Wherein, the preprocessing module includes an image processing unit and a preprocessing unit. The image processing unit is used to process the image to generate molecular cluster fluorescence signals when the input data is an image obtained by sequencing. The preprocessing unit is used to analyze the molecular cluster fluorescent signals. Processing is performed to meet the conditions required to calculate the confounding coefficients.

其中,图像处理单元进一步用于采用本发明的方法,对测序得到的图像进行以下操作以生成分子簇荧光信号:移除背景光,正规化,生成对准模版,分子簇定位和提取分子簇荧光信号。Wherein, the image processing unit is further used to use the method of the present invention to perform the following operations on the images obtained by sequencing to generate molecular cluster fluorescence signals: remove background light, normalize, generate alignment templates, molecular cluster positioning and extract molecular cluster fluorescence Signal.

其中,图像处理单元包括校正子单元和定位子单元:Wherein, the image processing unit includes a correction subunit and a positioning subunit:

所述校对单元用于校正存在光谱串色的频道对应的图像的光谱串色;The proofreading unit is used to correct the spectral cross-color of the image corresponding to the channel with spectral cross-color;

所述定位子单元用于对所述校正过光谱串色的图像进行分子簇定位操作。The locating subunit is used to perform a molecular cluster locating operation on the image corrected for spectral cross-color.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图进行简单的介绍。显而易见地,下面描述中的附图仅仅符合本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他实施例对应的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that are used in the embodiments. Apparently, the drawings in the following description only conform to some embodiments of the present invention, and those skilled in the art can also obtain the corresponding drawings of other embodiments according to these drawings without creative work. .

图1A是现有技术测得的一轮测序一个频道的图片的局部示意图;Fig. 1A is a partial schematic diagram of a picture of one channel of one round of sequencing measured by the prior art;

图1B是经过光谱串色和相位失相的校正后两个离得较近的分子簇信号部分测序轮的示意图,此图中第二个分子簇对第一个分子簇信号产生相邻分子簇混杂,并导致第一个分子簇的第13个碱基被错误辨识;Figure 1B is a schematic diagram of partial sequencing rounds of two close molecular cluster signals after correction of spectral cross-color and phase dephasing. In this figure, the second molecular cluster generates adjacent molecular clusters for the first molecular cluster signal Mixed, and caused the 13th base of the first molecular cluster to be misidentified;

图1C是三个分子簇之间发生信号混杂的示意图;Figure 1C is a schematic diagram of signal mixing between three molecular clusters;

图2是离得较近的分子簇对分子簇定位产生影响的示意图;Figure 2 is a schematic diagram of the impact of closer molecular clusters on molecular cluster positioning;

图3是根据本发明实施例的数据处理方法的流程示意图;3 is a schematic flow diagram of a data processing method according to an embodiment of the present invention;

图4是根据本发明实施例的数据处理方法的步骤流程的示意图;FIG. 4 is a schematic diagram of a step flow of a data processing method according to an embodiment of the present invention;

图5是根据本发明实施例的数据处理结果示意图;Fig. 5 is a schematic diagram of data processing results according to an embodiment of the present invention;

图6是根据本发明实施例的数据处理装置的结构示意图。Fig. 6 is a schematic structural diagram of a data processing device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments of the present invention belong to the protection scope of the present invention.

在实现本发明的过程中发明人发现,在现有的分子测序的技术中方案中,部分基于测序仪器提供的分子簇的荧光信号来进行(这一数据被存储在扩展名为CIF的文件或未压缩的TXT文档中)。这一格式的文件主要包括每个分子簇每轮测序在每个频道上的荧光信号。由于测序仪器提供的数据已扔掉因距离过近而被混杂得较严重的分子簇,因此目前的方法对混杂的信号均没有太好的办法去处理,而是采用稳健性的方法尽力减少小部分混杂带来的影响。In the process of realizing the present invention, the inventors found that in the existing technical solutions of molecular sequencing, the fluorescent signals of the molecular clusters provided by the sequencing instrument are partly carried out (this data is stored in a file with an extension of CIF or uncompressed TXT file). The file in this format mainly includes the fluorescent signal on each channel of each molecular cluster and each round of sequencing. Since the data provided by the sequencing instrument has discarded the molecular clusters that are too close to be mixed seriously, the current method does not have a good way to deal with the mixed signals, but uses a robust method to try to reduce the number of small clusters. The impact of partial confounding.

根据本发明的实施例,提供了一种数据处理方法,主要应用于分子测序中。该方法通过计算任一分子簇与符合预定条件的另一分子簇间的混杂系数,并将计算出的混杂系数应用于分子序列的识别,从而克服信号混杂对序列识别准确度的影响。According to an embodiment of the present invention, a data processing method is provided, which is mainly applied to molecular sequencing. The method calculates the confounding coefficient between any molecular cluster and another molecular cluster meeting predetermined conditions, and applies the calculated confounding coefficient to the identification of molecular sequences, so as to overcome the influence of signal confounding on the accuracy of sequence recognition.

如图3所示,根据本发明实施例的数据处理方法包括:As shown in Figure 3, the data processing method according to the embodiment of the present invention includes:

步骤S2,计算任意分子簇A与符合预定条件的分子簇B的荧光信号之间的混杂系数,任意分子簇A和符合预定条件的分子簇B的混杂系数C(A←B)用于衡量分子簇A中来源于分子簇B的混杂的严重程度。发明人发现,对任一分子簇A和在A的荧光信号中具有混杂的分子簇B,在任意测序轮和频道中,分子簇B在A中的混杂与分子簇B自身的荧光信号的比值近似不变,因此,发明人在实施例中使用该比值作为混杂系数C(A←B)的值。发明人还发现,只有距离较近的分子簇会存在相互混杂的现象。因此只计算任意分子簇与和它距离不超过预定像素值的其它分子簇之间的混杂系数。同时,由于可以通过预处理,使没有混杂的分子簇的荧光信号仅在与其序列对应的频道中存在较大数值,而在其余频道中近似为0,因此使用下述公式计算混杂系数C(A←B):;Step S2, calculating the mixing coefficient between the fluorescence signals of any molecular cluster A and the molecular cluster B meeting the predetermined conditions, the mixing coefficient C (A←B) of any molecular cluster A and molecular cluster B meeting the predetermined conditions is used to measure the molecular Severity of confounding in cluster A derived from molecular cluster B. The inventors found that, for any molecular cluster A and molecular cluster B with scrambling in the fluorescence signal of A, in any sequencing round and channel, the ratio of the scrambling of molecular cluster B in A to the fluorescence signal of molecular cluster B itself Approximately unchanged, therefore, the inventors used this ratio as the value of the confounding coefficient C(A←B) in the examples. The inventors also found that only the molecular clusters that are close to each other can mix with each other. Therefore, only the confounding coefficient between any molecular cluster and other molecular clusters whose distance does not exceed a predetermined pixel value is calculated. At the same time, since the fluorescence signal of molecular clusters without confounding can only have a large value in the channel corresponding to its sequence through preprocessing, and is approximately 0 in the remaining channels, the following formula is used to calculate the confounding coefficient C(A ←B):;

C(A←B)=argminc(f(IA-cIB)+h(c));C(A←B)=argmin c (f(I A -cI B )+h(c));

其中,h(c)是预先设定的单调非减函数,IA和IB为分子簇A和分子簇B在预先指定的测序轮和测序频道的荧光信号,其中n为测序轮的数量,对测序轮数j,rj为预先设定的函数,用于计算测序轮j中混杂的严重程度,wj为根据所有分子簇在第j轮测序中的荧光信号计算出的标量或是预先设定的常数,为计算混杂系数时测序轮j的权重,c为预定区间内的实数。Among them, h(c) is a pre-set monotone non-decreasing function, I A and I B are the fluorescent signals of molecular cluster A and molecular cluster B in the pre-specified sequencing round and sequencing channel, where n is the number of sequencing rounds, for the number of sequencing rounds j, rj is a preset function for calculating the severity of confounding in sequencing round j, and wj is the fluorescence in the jth sequencing round according to all molecular clusters The scalar calculated from the signal or a preset constant is the weight of the sequencing round j when calculating the confounding coefficient, and c is a real number within a predetermined interval.

对通过实施例的预处理方式进行预处理的荧光信号,rj可以为如下形式:For the fluorescent signal preprocessed by the preprocessing method of the embodiment, r j can be in the following form:

其中,r为频道的数量,I(j,k)为输入荧光信号在第j个测序轮、第k个频道中的数值。Among them, r is the number of channels, and I(j,k) is the value of the input fluorescent signal in the jth sequencing round and the kth channel.

在通过上述公式计算混杂系数时,argminc(f(IA-cIB)+h(c))可通过使用分位数法求f(IA-cIB)+h(c)的导函数零点的方法得到。When calculating the confounding coefficient by the above formula, argmin c (f(I A -cI B )+h(c)) can be obtained by using the quantile method to obtain the derivative function of f(I A -cI B )+h(c) The zero method is obtained.

步骤S3,根据混杂系数对不同分子簇的荧光信号进行处理。In step S3, the fluorescence signals of different molecular clusters are processed according to the mixing coefficient.

在实施例中,发明人通过该混杂系数校正分子簇荧光信号中的信号混杂。使用的校正方式为,通过下述公式计算分子簇的没有信号混杂的荧光信号所组成的矩阵IIIn an embodiment, the inventors correct signal confounding in molecular cluster fluorescence signals by this confounding coefficient. The correction method used is to calculate the matrix II composed of the fluorescent signals of the molecular clusters without signal confounding by the following formula:

C·II=IOC · I I = I O ;

其中在矩阵II中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号;C为由各个分子簇之间的混杂系数所组成的矩阵;IO为需要进行校正的分子簇荧光信号所组成的矩阵,在矩阵IO中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号。Wherein in matrix II , the element of each row corresponds to the fluorescent signal of a molecular cluster, and the element of each column corresponds to the fluorescent signal of all molecular clusters of one channel in a sequencing round; C is determined by the confounding coefficient between each molecular cluster I O is a matrix formed by the fluorescent signals of molecular clusters that need to be corrected. In the matrix I O , the elements in each row correspond to the fluorescent signals of a molecular cluster, and the elements in each column correspond to the fluorescence signals of a channel in a sequencing round. Fluorescence signals of all molecular clusters.

在通过混杂系数校正分子簇荧光信号中的信号混杂后,还可以采用预定方式对校正过信号混杂的分子簇荧光信号进行后续的处理以完成序列的识别和相关信息的计算。After correcting the signal confounding in the fluorescent signal of the molecular cluster by the confounding coefficient, the fluorescent signal of the molecular cluster that has been corrected for signal confounding can also be processed in a predetermined manner to complete sequence identification and calculation of related information.

另外,在计算不同分子簇的荧光信号之间的混杂系数之前,根据使用的计算混杂系数的方法和输入数据的特征,还需要对输入数据进行相应的预处理操作,包括:In addition, before calculating the mixing coefficient between the fluorescent signals of different molecular clusters, according to the method used to calculate the mixing coefficient and the characteristics of the input data, it is necessary to perform corresponding preprocessing operations on the input data, including:

步骤S1,在计算不同分子簇的荧光信号之间的混杂系数之前,采用预定校正方式对分子簇荧光信号进行校正,预定校正方式包括以下至少之一:Step S1, before calculating the mixing coefficient between the fluorescent signals of different molecular clusters, the fluorescent signals of the molecular clusters are corrected by a predetermined correction method, and the predetermined correction method includes at least one of the following:

步骤S121,校正光谱串色;Step S121, correcting spectral cross-color;

步骤S122,校正相位失相;Step S122, correcting phase loss;

步骤S11,对原始图像数据进行预处理生成分子簇荧光信号。Step S11, preprocessing the original image data to generate molecular cluster fluorescence signals.

其中,对原始图像数据进行预处理生成分子簇荧光信号包括:Among them, preprocessing the raw image data to generate molecular cluster fluorescence signals includes:

步骤S111,读取原始图像数据,进行正规化,具体作法为:Step S111, read the original image data and perform normalization, the specific method is as follows:

根据前k轮的测序图像数据计算不同频道中的图像的各个位置的荧光强度尺度,其中k≥1,具体的,查找图像中的亮点,其中亮点为在同一幅图像中,根据预先设定的规则筛选出的像素,且筛选出的像素的荧光强度超过其周围像素的荧光强度;Calculate the fluorescence intensity scale of each position of the image in different channels according to the sequencing image data of the first k rounds, where k≥1, specifically, find the bright spots in the image, where the bright spots are in the same image, according to the preset Regularly screened pixels, and the fluorescence intensity of the screened pixels exceeds the fluorescence intensity of its surrounding pixels;

将图像的平面区域分割成多个不重叠的区域,并在每个频道中计算前k轮测序中,图像中每个区域所包含的亮点的荧光强度的中位数;Divide the planar region of the image into multiple non-overlapping regions, and calculate the median of the fluorescence intensity of the bright spots contained in each region in the image in the previous k rounds of sequencing in each channel;

根据预定规则移除图像中目标区域预定范围内的不符合预定规则的区域中的亮点;removing bright spots in areas that do not meet the predetermined rules within the predetermined range of the target area in the image according to predetermined rules;

通过最小二乘的方法使用图像中剩余区域中计算出的亮点的中位数拟合出高次曲面,并根据高次曲面计算在图像中剩余区域的荧光强度尺度,其中,高次曲面的曲面次数与图像中的区域数量成正比。Use the median of the bright spots calculated in the remaining area of the image to fit a high-order surface by the method of least squares, and calculate the fluorescence intensity scale in the remaining area of the image according to the high-order surface, wherein the surface of the high-order surface The number of times is proportional to the number of regions in the image.

将图像各像素的光强值除以当前测序频道对应位置的荧光强度尺度。Divide the light intensity value of each pixel of the image by the fluorescence intensity scale of the corresponding position of the current sequencing channel.

此外,对原始图像数据进行预处理生成分子簇荧光信号进一步包括:In addition, preprocessing the raw image data to generate molecular cluster fluorescence signals further includes:

步骤S112,计算原始图像数据的背景光,并移除背景光;Step S112, calculating the background light of the original image data, and removing the background light;

步骤S113,生成对准模版,具体步骤为:首先对准预定测序轮中存在光谱串色的频道,然后校正存在光谱串色的频道的图像的光谱串色,将各个校正过光谱串色的图像中相同位置的像素的荧光信号进行比较,保留相同位置中荧光信号最大的值,生成对准模板。对准任意两幅图像的步骤为,选取需要对准的两幅校正过光谱串色的图像中坐标相同的区域,将其中一幅图像的所选区域进行位移操作;搜索所选区域的整点坐标的位移,并将最大相关对应的位移坐标作为初始点,通过BFGS或其他求解非约束最优化问题的算法定位位移。Step S113, generate an alignment template, the specific steps are: first align the channels with spectral cross-color in the predetermined sequencing round, then correct the spectral cross-color of the images of the channels with spectral cross-color, and convert each corrected spectral cross-color image Compare the fluorescence signals of the pixels at the same position in the same position, keep the value with the largest fluorescence signal in the same position, and generate an alignment template. The step of aligning any two images is to select the area with the same coordinates in the two images that need to be aligned and corrected for spectral cross-color, and perform a displacement operation on the selected area of one of the images; search for the entire point of the selected area The displacement of the coordinate, and the displacement coordinate corresponding to the maximum correlation is used as the initial point, and the displacement is positioned by BFGS or other algorithms for solving unconstrained optimization problems.

步骤S114,在对准的图像上进行分子簇定位操作。Step S114, performing a molecular cluster localization operation on the aligned images.

具体的,在对准的图像上校正光谱串色,然后查找校正过光谱串色的图像中的亮点,并通过目标亮点及目标亮点周围像素的荧光信号,分别在两个方向上拟合抛物线,并计算抛物线的对称轴,将对称轴作为目标亮点的坐标;Specifically, the spectral cross-color is corrected on the aligned image, and then the bright spot in the corrected image is searched, and the target bright spot and the fluorescence signal of the pixels around the target bright spot are used to fit a parabola in two directions respectively, And calculate the symmetry axis of the parabola, and use the symmetry axis as the coordinates of the target bright spot;

通过不存在邻居的亮点的坐标均值计算各个亮点所对应的分子簇的坐标,其中不存在邻居的亮点为满足如下条件亮点:在包含亮点的单位像素的周围两个单位像素范围内不存在除自身所包含的亮点外,其他同频道同测序轮的亮点。The coordinates of the molecular clusters corresponding to each bright spot are calculated by the mean value of the coordinates of bright spots without neighbors, where the bright spots without neighbors are bright spots that meet the following conditions: there is no other than itself within the range of two unit pixels around the unit pixel containing the bright spot In addition to the included highlights, other highlights of the same channel and the same sequencing round.

步骤S115,提取分子簇荧光信号。具体方法为,通过将各幅图像同对准模版对准,计算各个分子簇在各幅图像中的位置,以获取各个分子簇的荧光信号。Step S115, extracting fluorescence signals of molecular clusters. The specific method is to calculate the position of each molecular cluster in each image by aligning each image with the alignment template, so as to obtain the fluorescence signal of each molecular cluster.

其中,预处理操作步骤S1还可包括:Wherein, the preprocessing operation step S1 may also include:

步骤S123,在对分子簇荧光信号校正完光谱串色后再对分子簇荧光信号进行相邻基团干扰校正,其中相邻基团干扰为分子簇在前一个位置的基团类别对它后继基团的荧光信号产生的不同干扰的现象。Step S123, after correcting the spectral cross-coloring of the molecular cluster fluorescence signal, the adjacent group interference correction is performed on the molecular cluster fluorescence signal, wherein the adjacent group interference is the group category at the previous position of the molecular cluster and its successor group The phenomenon of different interferences generated by the fluorescent signal of the group.

具体的,在校正完光谱串色后,对任意基团类别a和类别b,对第L测序轮的所有类别为a类型的分子簇,计算第L+1测序轮的所有类别为b类型的分子簇对应的频道上的分子荧光强度的平均数或中位数,得到第L测序轮中a类型的荧光标记对第L+1测序轮中b类型的荧光信号产生干扰时,b类型的荧光标记的平均尺度,其中L≥1;Specifically, after correcting the spectral cross-color, for any group category a and category b, for all the molecular clusters of type a in the Lth sequencing round, calculate the molecular clusters of all the categories in the L+1th sequencing round as type b The average or median of the molecular fluorescence intensities on the channel corresponding to the molecular cluster is used to obtain the fluorescence of type b when the fluorescent label of type a in the Lth sequencing round interferes with the fluorescent signal of type b in the L+1 sequencing round. Average scale of markers, where L ≥ 1;

对任意测序轮M,其中M≥2,根据第M-1轮辨识出的序列类别,将第M轮的每个频道上的分子簇的荧光信号除以受第M-1轮辨识出的类别的干扰下当前频道的荧光标记的平均尺度。For any sequencing round M, where M≥2, according to the sequence category identified in the M-1 round, the fluorescence signal of the molecular cluster on each channel of the M-th round is divided by the category identified by the M-1 round The average scale of the fluorescent markers in the current channel under the interference.

本发明的上述方法适用于任意两个分子簇的荧光信号之间具有任意特征的混杂系数,上述方法通过混杂系数降低信号混杂的干扰,提高了分子簇序列辨识的准确率。The above-mentioned method of the present invention is applicable to the mixing coefficient with arbitrary characteristics between the fluorescence signals of any two molecular clusters. The above-mentioned method reduces the interference of signal mixing through the mixing coefficient, and improves the accuracy of molecular cluster sequence identification.

根据本发明的实施例本发明还提供了一种数据处理装置,该装置可以应用于分子识别领域,用于使用上述本发明的方法更准确的完成对分子序列的辨识。According to an embodiment of the present invention, the present invention also provides a data processing device, which can be applied in the field of molecular recognition, and is used to more accurately identify molecular sequences using the above-mentioned method of the present invention.

如图6所示,该装置包括:As shown in Figure 6, the device includes:

计算混杂系数模块D2,用于计算不同分子簇的荧光信号之间的混杂系数。其中,任意分子簇A与符合预定条件的分子簇B的荧光信号之间的混杂系数C(A←B)用于衡量分子簇B对分子簇A的荧光信号产生的混杂的严重程度。Calculating the mixing coefficient module D2, used to calculate the mixing coefficient between the fluorescent signals of different molecular clusters. Wherein, the mixing coefficient C (A←B) between any molecular cluster A and the fluorescent signal of the molecular cluster B meeting the predetermined conditions is used to measure the severity of the mixing of the molecular cluster B to the fluorescent signal of the molecular cluster A.

处理模块D3,用于通过混杂系数对分子簇荧光信号进行处理,以完成分子序列的识别。The processing module D3 is used to process the fluorescent signal of the molecular cluster through the mixing coefficient, so as to complete the recognition of the molecular sequence.

该装置还可以包括,预处理模块D1,用于在计算不同分子簇的荧光信号之间的混杂系数之前采用预定方式对输入数据进行处理。The device may also include a preprocessing module D1, configured to process the input data in a predetermined manner before calculating the mixing coefficient between the fluorescent signals of different molecular clusters.

其中,计算混杂系数模块D2进一步用于计算如下的混杂系数:对任意分子簇A和符合预定条件的分子簇B,混杂系数C(A←B)为E(A←B)与E(B←B)的比,其中,E(A←B)为分子簇A的荧光信号中来源于分子簇B的荧光标记的荧光信号,E(B←B)为分子簇B的荧光信号中来源于分子簇B中分子荧光标记的荧光信号。Wherein, the module D2 for calculating the mixing coefficient is further used to calculate the following mixing coefficient: for any molecular cluster A and the molecular cluster B meeting the predetermined conditions, the mixing coefficient C (A←B) is E(A←B) and E(B← The ratio of B), wherein, E(A←B) is the fluorescent signal of the fluorescent label derived from molecular cluster B in the fluorescent signal of molecular cluster A, and E(B←B) is the fluorescent signal of molecular cluster B derived from the molecular Fluorescence signal of fluorescently labeled molecules in cluster B.

计算混杂系数模块D2进一步用于通过下述公式计算C(A←B):Calculating the mixing coefficient module D2 is further used to calculate C(A←B) by the following formula:

C(A←B)=argminc(f(IA-cIB)+h(c));C(A←B)=argmin c (f(I A -cI B )+h(c));

其中,h(c)是预先设定的单调非减函数,IA和IB为分子簇A和分子簇B在预先指定的测序轮和测序频道的荧光信号,其中n为测序轮的数量,对测序轮数j,rj为预先设定的函数,wj为根据所有分子簇在第j轮测序中的荧光信号计算出的标量或是预先设定的常数,其中j≥1,c为预定区间内的实数。Among them, h(c) is a pre-set monotone non-decreasing function, I A and I B are the fluorescent signals of molecular cluster A and molecular cluster B in the pre-specified sequencing round and sequencing channel, where n is the number of sequencing rounds, for the number of sequencing rounds j, r j is a preset function, w j is a scalar calculated from the fluorescence signals of all molecular clusters in the j-round sequencing or a preset constant , where j≥1, c is a real number within a predetermined interval.

处理模块D3可进一步包括校正单元D31,用于对分子簇荧光信号中的信号混杂进行校正,校正方法包括:The processing module D3 may further include a correction unit D31, which is used to correct the signal mixing in the fluorescence signal of the molecular cluster, and the correction method includes:

通过下述公式计算校正过信号混杂的不同分子簇的荧光信号所组成的矩阵IIThe matrix I I composed of the fluorescent signals of different molecular clusters corrected for signal mixing is calculated by the following formula:

C·II=IOC · I I = I O ;

其中在矩阵II中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号;C为由各个分子簇之间的混杂系数所组成的矩阵;IO需要进行校正的分子簇荧光信号所组成的矩阵,在矩阵IO中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号。Wherein in matrix II , the element of each row corresponds to the fluorescent signal of a molecular cluster, and the element of each column corresponds to the fluorescent signal of all molecular clusters of one channel in a sequencing round; C is determined by the confounding coefficient between each molecular cluster A matrix composed of; I O needs to be corrected The matrix formed by the fluorescent signals of the molecular clusters, in the matrix I O , the elements of each row correspond to the fluorescence signals of a molecular cluster, and the elements of each column correspond to all the elements of a channel in a sequencing round Fluorescence signal of molecular clusters.

处理模块D3可进一步包括下游处理单元D32,用于对校正过信号混杂的分子簇荧光信号进行后续处理,进而可完成分子序列的识别。The processing module D3 may further include a downstream processing unit D32, configured to perform subsequent processing on the fluorescent signal of the molecular cluster corrected for signal confusion, and then complete the identification of the molecular sequence.

其中,预处理模块D1包括图像处理单元D11和预处理单元D12,图像处理单元用于当输入数据为测序得到的图像时,对图像进行处理以生成分子簇荧光信号,预处理单元用于对分子簇荧光信号进行处理以符合计算混杂系数所需的条件。Among them, the preprocessing module D1 includes an image processing unit D11 and a preprocessing unit D12. The image processing unit is used to process the image to generate molecular cluster fluorescence signals when the input data is an image obtained by sequencing. Cluster fluorescence signals are processed to meet the conditions required to calculate confounding coefficients.

其中,图像处理单元D11进一步用于采用本发明的方法,对测序得到的图像进行以下操作以生成分子簇荧光信号:移除背景光,正规化,生成对准模版,分子簇定位和提取分子簇荧光信号。Wherein, the image processing unit D11 is further used to use the method of the present invention to perform the following operations on the images obtained by sequencing to generate molecular cluster fluorescence signals: remove background light, normalize, generate alignment templates, locate molecular clusters and extract molecular clusters fluorescent signal.

其中,图像处理单元D11包括校正子单元DC和定位子单元D114:Among them, the image processing unit D11 includes a correction subunit DC and a positioning subunit D114:

校对单元DC用于校正存在光谱串色的频道对应的图像的光谱串色;The proofreading unit DC is used to correct the spectral cross-color of the image corresponding to the channel with spectral cross-color;

定位子单元D114用于对校正过光谱串色的图像进行分子簇定位操作。The locating subunit D114 is used to perform molecular cluster locating operations on the image that has been corrected for spectral cross-color.

该装置的不同模块可通过不同的硬件或软件及其组合实现。该装置可配置多个相同功能的子单元,通过将任务分配给这些子单元同时处理以加快数据处理速度。例如,可通过OPENMP将模块D2中计算各混杂系数的部分并行化,或将计算各混杂系数的部分在GPU、FPGA或DSP上实现从而可同时处理多个计算混杂系数的请求,也可通过同时配置多个该装置的实例以加快数据处理速度。Different modules of the device can be realized by different hardware or software and combinations thereof. The device can be configured with multiple subunits with the same function, and the data processing speed can be accelerated by assigning tasks to these subunits for simultaneous processing. For example, the part of calculating each confounding coefficient in module D2 can be parallelized through OPENMP, or the part of calculating each confounding coefficient can be realized on GPU, FPGA or DSP so that multiple requests for calculating confounding coefficients can be processed at the same time, or through simultaneous Configure multiple instances of this device to speed up data processing.

为了更好的理解本发明的方案构成,下面将以一具体的实施例进行阐述,实施例将本发明应用于DNA分子的测序,通过对输入数据进行处理,提高了测序精度。应当注意的是,下述实施例的大标题只是表达该标题所阐述的内容,但是对于本发明的技术方案的实现顺序并不限定。同样的,实施例中的步骤只代表本发明的技术方案的一种可行实现,通过调整步骤的顺序而对测序结果无实质性的正面影响的实现并不超出本发明的技术方案的范围。In order to better understand the composition of the scheme of the present invention, a specific example will be described below. The example applies the present invention to the sequencing of DNA molecules, and improves the sequencing accuracy by processing the input data. It should be noted that the headlines of the following embodiments only express the contents described in the headlines, but the implementation sequence of the technical solutions of the present invention is not limited. Similarly, the steps in the examples only represent a feasible realization of the technical solution of the present invention, and the realization of having no substantial positive impact on the sequencing results by adjusting the order of the steps does not exceed the scope of the technical solution of the present invention.

图4示出了本发明实施例的数据处理方法的示意性流程图。Fig. 4 shows a schematic flowchart of a data processing method according to an embodiment of the present invention.

一、数据的预处理和确定每个分子簇的位置:1. Data preprocessing and determination of the position of each molecular cluster:

不同频道间平均信号峰值在不同区域上的变化存在差异,如果不对它进行处理,则不同区域的光谱串色矩阵将会不一致,因而用估计出的光谱串色矩阵对串色进行校正时,偏差将会出现,从而对结果产生影响。然而由于信号峰值受分子簇中分子数量等因素影响,估计出的子区域的平均信号强度方差较大,因此本发明采用前四轮的测序数据对其进行估计,并用多项式拟合的方法对估计值进行平滑化。There are differences in the changes of the average signal peak value between different channels in different areas. If it is not processed, the spectral color cross-over matrix in different areas will be inconsistent, so when the estimated spectral cross-color matrix is used to correct the cross-color, the deviation will appear and thus affect the outcome. However, because the signal peak value is affected by factors such as the number of molecules in the molecular cluster, the estimated average signal intensity variance of the sub-region is relatively large, so the present invention uses the sequencing data of the first four rounds to estimate it, and uses the method of polynomial fitting to estimate The value is smoothed.

本步骤的流程如下:The flow of this step is as follows:

步骤S111,首先读入图像数据,然后用前四轮的数据估计不同频道中图像各个位置的光强尺度。Step S111, first read in the image data, and then use the data of the first four rounds to estimate the light intensity scale of each position of the image in different channels.

这一估计步骤如下:The estimation steps are as follows:

S1111.找出每幅图像中的亮点。一个像素被看作为一个亮点:如果它的光强值比周围8个像素都大并且光强值超过这幅图光强的均值加上标准差的四分之一。S1111. Find out the bright spots in each image. A pixel is considered a bright spot: if its light intensity value is greater than that of the surrounding 8 pixels and the light intensity value exceeds the mean value of the image light intensity plus a quarter of the standard deviation.

S1112.将整个区域切割成小正方形,在每个频道中,对每个小正方形,计算前四轮数据落在该正方形内亮点光强的中位数。将中位数看作该小正方形的尺度估计。S1112. Cut the entire area into small squares, and in each channel, for each small square, calculate the median of the light intensity of the bright spots falling within the square in the first four rounds of data. Think of the median as an estimate of the scale of that small square.

S1113.移除那些与周围正方形光强尺度估计值偏离过远的估计值。一个估计值被认为偏离过远:如果它的值与周围至多8个邻居的均值的差大于邻居中最大值与最小值的差。S1113. Remove those estimated values that deviate too far from the estimated values of the light intensity scale of the surrounding squares. An estimate is considered to be too far away: if its value differs from the mean of at most 8 surrounding neighbors by more than the difference between the largest and smallest value in the neighbors.

S1114.在每个频道中,对剩余的估计值,用最小二乘拟合出高次曲面,并将曲面在每个像素处的值作为光强尺度的估计。曲面的次数取决于每幅图中正方形的数量。S1114. In each channel, for the remaining estimated values, use least squares to fit a high-order surface, and use the value of the surface at each pixel as an estimate of the light intensity scale. The degree of the surface depends on the number of squares in each figure.

然后估计读入数据的背景光,并将这一背景光减去,然后将每个像素都除以对应频道的光强尺度。The background light of the incoming data is then estimated and subtracted, and each pixel is divided by the light intensity scale for the corresponding channel.

步骤S112,估计背景光的方法如下:Step S112, the method for estimating background light is as follows:

S1121.将每幅图分成小正方形。使用小正方形中所有光强值的第k小的点作为该小正方形背景光的估计。S1121. Divide each picture into small squares. Use the kth smallest point of all light intensity values in the small square as the estimation of the background light of the small square.

S1122.移除那些与周围正方形背景光估计值偏离过远的估计值。“偏离过远”的定义同光强尺度估计中的定义。S1122. Remove those estimated values that deviate too far from the estimated values of the surrounding square background light. The definition of "too far away" is the same as that in intensity scale estimation.

S1123.使用周围邻居的背景光估计的均值代替被移除的估计值。S1123. Use the mean value of the estimated background light of surrounding neighbors to replace the removed estimated value.

S1124.使用双线性插值计算每个像素的背景光。S1124. Use bilinear interpolation to calculate the background light of each pixel.

接下来生成对准模版并对准前五轮图片:Next, generate an alignment template and align the first five rounds of images:

对准图片的基础是芯片不同照片中发光的地方均为分子簇所在位置。因此对准的照片存在相关性,从而可以通过使用求最大相关的办法找到照片位移量。然而由于同一轮的A,C频道照片发光的位置,G,T频道不会发光,因此两者无法直接对准。同时由于不同频道间的照片同一分子簇的位置不一定同时发光,因此相关性较弱,为实现高精度的对准,需设法加强此相关性,因此本发明通过求最大值生成模版来提高对准精度。The basis for aligning the pictures is that the luminescent places in different pictures of the chip are the positions of the molecular clusters. Therefore, there is a correlation between the aligned photos, so that the displacement of the photos can be found by using the method of seeking the maximum correlation. However, because the positions of the A and C channel photos of the same round are illuminated, the G and T channels will not be illuminated, so the two cannot be directly aligned. At the same time, because the position of the same molecular cluster in the photos between different channels does not necessarily emit light at the same time, the correlation is weak. In order to achieve high-precision alignment, it is necessary to try to strengthen this correlation. quasi-precision.

在对准过程中,涉及到非整数像素时,光强值通过先后在x轴和y轴方向作分段三次插值估计得到。步骤S113,生成对准模版并对准前五轮图片的方法如下:During the alignment process, when non-integer pixels are involved, the light intensity value is estimated by segmented cubic interpolation successively in the x-axis and y-axis directions. Step S113, the method of generating an alignment template and aligning the first five rounds of pictures is as follows:

S1131.通过步骤S11R将每轮C频道的图片与A频道对准。估计A,C频道间的光谱串色。校正对准图片的串色,然后通过对每两幅A和C图片按像素取最大值生成对应测序轮的AC频道模版,即将每幅图片相同位置的光强相比较,保留其中取值最大的,从而生成对准模板。S1131. Through step S11R, align the pictures of channel C and channel A in each round. Estimation of spectral crosstalk between A and C channels. Correct the cross-color of the alignment picture, and then generate the AC channel template corresponding to the sequencing round by taking the maximum value of every two pictures A and C by pixel, that is, compare the light intensity at the same position of each picture, and keep the one with the largest value , resulting in an alignment template.

S1132.将第二轮的模版同第一轮的模版对准。将第四轮的模版同第三轮对准。用对准的第一轮和第二轮模版每个像素的最大值生成模版一,用第三和第四轮的模版生成模版二。将模版二同模版一对准。S1132. Align the template of the second round with the template of the first round. Align the fourth round template with the third round. The maximum value of each pixel of the aligned first-round and second-round templates is used to generate template one, and the templates of the third and fourth rounds are used to generate template two. Align template two with template one.

S1133.将前两轮的G和T频道图片同模版二对准,将其它剩余的图片同模版一对准。S1133. Align the G and T channel pictures of the first two rounds with template 2, and align the remaining pictures with template 1.

步骤S11R,将两幅图片对准的算法如下:Step S11R, the algorithm for aligning the two pictures is as follows:

S11R1.将两幅图片正中间的小块儿对准。对准的标准是这时两图之间的相关值最大。首先搜索整格点的位移,然后将最大相关对应的位移作为初始点用BFGS方法搜索更精确的位移。S11R1. Align the small piece in the middle of the two pictures. The criterion for alignment is that the correlation value between the two graphs is the largest at this time. First search the displacement of the whole grid point, and then use the displacement corresponding to the maximum correlation as the initial point to search for a more accurate displacement with the BFGS method.

S11R2.以两幅图片正中间的小正方形的位移为初始点,分别通过最大化相关的方法搜索位于两幅图片四角附近的小正方形之间的位移。S11R2. Taking the displacement of the small square in the middle of the two pictures as the initial point, search for the displacement between the small squares located near the four corners of the two pictures by the method of maximizing correlation respectively.

S11R3.将两幅图间的坐标差异看作仿射变换,使用Robust回归分别计算x轴方向和y轴方向变换从而计算出两幅图间的仿射变换。S11R3. Treat the coordinate difference between the two images as an affine transformation, and use Robust regression to calculate the transformation in the x-axis direction and the y-axis direction respectively to calculate the affine transformation between the two images.

最后识别各个分子簇的位置,计算每个分子簇在各频道对应的光强尺度。Finally, the position of each molecular cluster is identified, and the light intensity scale corresponding to each molecular cluster in each channel is calculated.

步骤S114,识别分子簇的步骤如下:Step S114, the steps of identifying molecular clusters are as follows:

S1141.通过步骤SC估计光谱串色。并校正光谱串色。校正方法为,将每个像素四个频道的光强值看作四维向量,然后左乘估计出串色矩阵的逆。S1141. Estimating spectral cross-color through step SC. And correct spectral cross-color. The correction method is to treat the light intensity values of the four channels of each pixel as a four-dimensional vector, and then multiply by left to estimate the inverse of the cross-color matrix.

S1142.找到每幅图中的亮点。使用亮点中心和它上下左右共5个光强值通过分别在两个方向上拟合抛物线并计算抛物线对称轴的方法确定更精确的亮点坐标。一个像素点被确定为亮点:如果它的光强值比周围8个相邻像素的光强值都大并且它的光强值超过根据整幅图片确定的某一阀值。S1142. Find the bright spots in each picture. Use the center of the bright spot and its five light intensity values up, down, left, and right to determine more accurate coordinates of the bright spot by fitting a parabola in two directions and calculating the symmetry axis of the parabola. A pixel is determined as a bright spot: if its light intensity value is greater than the light intensity values of the surrounding 8 adjacent pixels and its light intensity value exceeds a certain threshold determined based on the entire picture.

S1143.将每个像素看作一个格子,把找到的亮点放到这些格子中去。如果两个相邻格子满足:在每一轮中至多存在一个频道有亮点,则将两个格子合并。合并指的是将包含的亮点总光强值低的格子中的亮点移到另一格子中去。S1143. Treat each pixel as a grid, and put the found bright spots into these grids. If two adjacent grids meet: there is at most one channel with a bright spot in each round, the two grids will be merged. Merging refers to moving the bright spots contained in the grid with a low total light intensity value to another grid.

S1144.删除连同周围格子中所有亮点光强值总和过低的格子。删除在五轮测序中光强值过大且光强无明显变化的格子。删除与邻近格子相比,包含亮点光强均值过低的格子。S1144. Delete the grid whose sum of light intensity values of all bright spots in the surrounding grids is too low. Grids with excessively large light intensity values and no significant change in light intensity during the five rounds of sequencing were deleted. Delete cells that contain bright spots with low mean intensity compared to neighboring cells.

S1145.将剩下的所有包含光点的格子看作分子簇。使用包含的与邻近格子位于不同频道的光点的坐标均值作为该分子簇的坐标。S1145. Treat all remaining lattices containing light spots as molecular clusters. The coordinate mean of the included spots located in different channels from the adjacent grid is used as the coordinates of the molecular cluster.

步骤SC,估计m个频道间光谱串色的方法如下:In step SC, the method for estimating spectral crosstalk between m channels is as follows:

SC1.正规化每个频道使不同频道上的方差相同。将输入看作m维向量构成的总体。SC1. Normalize each channel so that the variance across the different channels is the same. Treat the input as a population of m-dimensional vectors.

SC2.以四个频道上的单位向量为初始点,对所有输入向量做k=m的k-means聚类。聚类时用到的距离定义为d(x,y)=1-cos<x,y>SC2. Using the unit vectors on the four channels as initial points, perform k-means clustering with k=m on all input vectors. The distance used in clustering is defined as d(x,y)=1-cos<x,y>

SC3.计算每一类在每个频道上的中位数,从而得到每一类向量的估计。用这些向量构成正规化后数据的串色矩阵。SC3. Calculate the median of each class on each channel, so as to obtain the estimation of each class vector. These vectors are used to form the cross-color matrix of the normalized data.

SC4.根据正规化的信息计算正规化前的串色矩阵。SC4. Calculate the cross-color matrix before normalization according to the normalization information.

二、步骤S115,提取分子簇荧光信号2. Step S115, extracting molecular cluster fluorescence signals

本步骤的流程如下:The flow of this step is as follows:

对读入的每幅图像,首先通过S112,去除其背景光,然后通过S11R计算它与模版对准所需变换。之后根据仿射变换计算出每个分子簇在这幅图上的坐标。使用插值算法计算出每个分子簇的光强,再将这一光强除以对应频道对应分子簇的平均尺度。相关算法上述内容已经介绍过或可根据叙述直接实现,在此不再赘述。For each image that is read in, first pass through S112 to remove its background light, and then use S11R to calculate the transformation needed to align it with the template. Then the coordinates of each molecular cluster on this map are calculated according to the affine transformation. The light intensity of each molecular cluster is calculated using an interpolation algorithm, and then this light intensity is divided by the average size of the molecular cluster corresponding to the corresponding channel. The above content of related algorithms has been introduced or can be directly implemented according to the description, and will not be repeated here.

三、步骤S12,分子簇荧光信号的预处理3. Step S12, preprocessing of molecular cluster fluorescence signals

CIF文件中每个分子簇包含一系列离散数字,共n行4列,每个数字表示一个测序轮一个频道上的光强。在处理光谱串色和相位失相时,对第i个分子簇进行描述的如下的概率模型被广泛接受:Each molecular cluster in the CIF file contains a series of discrete numbers, a total of n rows and 4 columns, and each number represents the light intensity on one channel of one sequencing round. When dealing with spectral bleed-through and phase dephasing, the following probabilistic model for the i-th molecular cluster is widely accepted:

Ii=λiPSiMT+N+εi I i =λ i PS i M T +N+ε i

这里Ii表示CIF文件中记录的光强值,Si表示该分子簇的碱基序列,它和Ii一样,是n行4列的矩阵,每行只有一个元素为1,其余三个元素均为0,1所在的位置对应该行表示的测序轮中该分子簇的碱基类别。P是n×n的相位矩阵,其中第j行第l列的元素表示第l个位置的碱基在第j轮测序中发光的概率。而M是4×4的光谱串色矩阵,第j行第l列的元素表示第l种碱基在第j个频道的荧光强度。εi则是n行4列的白噪声矩阵,代表测量误差。Here I i represents the light intensity value recorded in the CIF file, S i represents the base sequence of the molecular cluster, it is the same as I i , it is a matrix of n rows and 4 columns, each row has only one element as 1, and the remaining three elements The positions where both are 0 and 1 correspond to the base type of the molecular cluster in the sequencing round indicated by the row. P is an n×n phase matrix, where the element in the jth row and the lth column represents the probability that the base at the lth position emits light in the jth round of sequencing. And M is a 4×4 spectral cross-color matrix, and the elements in the jth row and the lth column represent the fluorescence intensity of the lth base in the jth channel. ε i is a white noise matrix with n rows and 4 columns, which represents the measurement error.

本步骤的流程如下:The flow of this step is as follows:

步骤S121,估计并校正光谱串色,具体步骤为:Step S121, estimating and correcting spectral cross-color, the specific steps are:

步骤S1211,使用SC估计串色矩阵,步骤S1212,校正光谱串色。Step S1211, use the SC to estimate the cross-color matrix, and step S1212, correct the spectral cross-color.

步骤S122,全计并校正相位失相。具体步骤为:Step S122, fully calculating and correcting phase loss. The specific steps are:

步骤S1221,估计相位矩阵。使用此相位矩阵作为初值,然后通过迭代加权最小二乘算法估计更精确的包含相位和光谱串色现象的4m×4m矩阵。这里m指测序轮数。Step S1221, estimating the phase matrix. Using this phase matrix as an initial value, a more accurate 4m × 4m matrix including phase and spectral cross-color phenomena is estimated by iterative weighted least squares algorithm. Here m refers to the number of sequencing rounds.

步骤S1222,使用新的矩阵校正荧光信号。Step S1222, using the new matrix to calibrate the fluorescence signal.

步骤S123,校正相邻碱基干扰现象,校正这一现象的步骤如下:Step S123, correcting the adjacent base interference phenomenon, the steps for correcting this phenomenon are as follows:

步骤S1231.根据每个分子簇每个测序轮最大的光强值确定其碱基类别。Step S1231. Determine the base type of each molecular cluster according to the maximum light intensity value of each sequencing round.

步骤S1232.使用前四轮的数据,计算当前一轮为某一种碱基时,本轮每种碱基在对应频道上的光强的中位值。Step S1232. Using the data of the first four rounds, calculate the median value of the light intensity of each base in the current round on the corresponding channel when the current round is a certain base.

步骤S1233.对每个分子簇每一轮数据,根据上一轮辨识出的碱基类别,将本轮每个频道数据分别除以对应的光强中位值。然后重新完成本轮的辨识。Step S1233. For each round of data of each molecular cluster, according to the identified base type in the previous round, divide the data of each channel in the current round by the corresponding median light intensity. Then complete the current round of identification again.

其中,步骤S12可替换为:Wherein, step S12 can be replaced by:

步骤S12R,使用现有的其它方法完成对分子簇荧光信号中除信号混杂以外的其它问题进行校正。In step S12R, other existing methods are used to correct other problems in the fluorescent signal of the molecular cluster except signal mixing.

四、步骤S2,校正分子簇间的信号混杂4. Step S2, correcting signal mixing between molecular clusters

本步骤依赖于模型:This step depends on the model:

其中M是光谱串色矩阵,P是相位矩阵,两者定义在步骤S12中给出;C是信号混杂矩阵,其两个维度的长度均等于分子簇的数量;ξ是观测误差构成的三维数组,S是由序列构成的非0即1的三维状态数组,代表所有分子簇的序列,O为提取的光强构成的三维数组,以上三个三维数组三个维度的长度分别为分子簇的数量、测序轮的数量和频道的数量。M、P的具体意义不再赘述,C中第i行第l列的元素表示第l个分子簇的荧光标记在第i个分子簇的CIF数据中的发光情况,记作混杂系数C(i←l),或cil将H中固定除第r维外另两维下标,遍历第r维得到的向量左乘矩阵A得到新数组中对应位置的向量。这一运算满足的性质包括同维运算时的相合性不同维运算时的可交换性可逆性(对可逆的A,)等。而通过使用这一运算的可交换性(即先算哪个维度再算哪个维度结果不变),可以得到:Wherein M is a spectral cross-color matrix, P is a phase matrix, and both are defined in step S12; C is a signal mixing matrix, and the length of its two dimensions is equal to the number of molecular clusters; ξ is a three-dimensional array composed of observation errors , S is a three-dimensional state array composed of sequences that are either 0 or 1, representing the sequence of all molecular clusters, O is a three-dimensional array composed of extracted light intensities, and the lengths of the three dimensions of the above three three-dimensional arrays are respectively the number of molecular clusters , the number of sequencing rounds and the number of channels. The specific meanings of M and P will not be described in detail. The element in the i-th row and l-column of C represents the luminescence of the fluorescent label of the l-th molecular cluster in the CIF data of the i-th molecular cluster, which is recorded as the confounding coefficient C(i ←l), or c il . Fix the subscripts of the other two dimensions except the r-th dimension in H, and multiply the matrix A by the vector obtained by traversing the r-th dimension to obtain the vector at the corresponding position in the new array. The properties that this operation satisfies include the consistency of the same dimension operation Exchangeability when operating in different dimensions Reversibility (for reversible A, )Wait. And by using the exchangeability of this operation (that is, which dimension is calculated first and then which dimension is calculated), the result can be obtained:

其中I是校正了光谱串色和相位问题的数据。因此可以先进行其它现象的校正再估计出分子簇相互混杂并通过求解或直接计算完成对混杂的校正。in I is the data corrected for spectral bleed and phase issues. Therefore, other phenomena can be corrected before estimating that the molecular clusters are mixed with each other. and by solving or calculate directly Correction for confounding is done.

在估计信号混杂矩阵时,可以通过建立衡量分子簇信号质量的目标函数,然后优化这一函数的方法确定两个分子簇间的混杂系数,从而估计出混杂矩阵、求解模型方程以移除混杂。具体地,首先设定混杂矩阵对角线的元素均为1,而离得较远的分子簇间不存在相互混杂(值为0)。而对近距离的分子簇之间,以分子簇1和分子簇2为例,使用如下两分子簇模型:When estimating the signal confounding matrix, the confounding coefficient between two molecular clusters can be determined by establishing an objective function to measure the signal quality of molecular clusters, and then optimizing this function, so as to estimate the confounding matrix and solve the model equation to remove confounding. Specifically, it is first set that the elements on the diagonal of the mixing matrix are all 1, and there is no mutual mixing between molecular clusters that are far away (the value is 0). For close-range molecular clusters, taking molecular cluster 1 and molecular cluster 2 as examples, the following two molecular cluster models are used:

通过变形得到:Obtained by deformation:

I1=c12I2+(1-c12c21)S1+(ξ1-c12ξ2)I 1 =c 12 I 2 +(1-c 12 c 21 )S 1 +(ξ 1 -c 12 ξ 2 )

这里ξ1-c12ξ2期望为0,而S1在除对应第1个分子簇的碱基类别外的频道中值为0。因此可以找到第1个分子簇各位置碱基类别,然后移除对应的频道,在剩下的频道中完成对c12的估计,这一估计可通过建立目标函数并求其极值实现。而在校正分子簇信号相互混杂时,较大的混杂系数会为四个频道的光强数据带来额外精度损失,因此,需要在目标函数中引入对大混杂系数的惩罚。而注意到在分子簇1各个碱基对应频道以外的频道上,时它的期望值均为0,所以可选择形如g(1,2)(t)=f(I1-tI2)+h(t)的目标函数,其中h(t)是单调增的函数而函数f可写作如下的形式:wj是对第j轮测序精度的度量,而函数rj衡量第j轮信号的被混杂的严重程度。Here ξ 1 -c 12 ξ 2 is expected to be 0, and S 1 is 0 in the channel except the base category corresponding to the first molecular cluster. Therefore, the base category of each position of the first molecular cluster can be found, and then the corresponding channel can be removed, and the estimation of c 12 can be completed in the remaining channels. This estimation can be realized by establishing an objective function and calculating its extreme value. When correcting the mutual mixing of molecular cluster signals, a larger mixing coefficient will bring additional precision loss to the light intensity data of the four channels. Therefore, it is necessary to introduce a penalty for a large mixing coefficient in the objective function. And notice that on the channel other than the channel corresponding to each base of molecular cluster 1, exist When its expected value is 0, so you can choose an objective function of the form g(1,2)(t)=f(I 1 -tI 2 )+h(t), where h(t) is a monotonically increasing function And the function f can be written as follows: w j is a measure of the accuracy of the j-th round of sequencing, and the function r j measures the severity of the j-th round of signal being confounded.

通过使用加权LAD方法完成对混杂比例的估计,假设中每轮测序最大的信号所在的频道对应分子簇1该位置的碱基类别,同时令h(t)取线性函数,则可得到目标函数:Estimation of the proportion of confounding is done by using the weighted LAD method, assuming The channel of the largest signal in each round of sequencing corresponds to the base category of the position of molecular cluster 1, and let h(t) take a linear function, then the objective function can be obtained:

g(1,2)(t)=f(I1-tI2)+utg(1,2)(t)=f(I 1 -tI 2 )+ut

其中的u为根据权重或分子簇荧光信号的观测误差计算得到的正常数,函数f定义如下:Among them, u is a normal number calculated according to the weight or the observation error of the fluorescence signal of the molecular cluster, and the function f is defined as follows:

它代表对输入信号纯净程度的衡量。通过对目标函数的优化,可以得到各混杂系数的估计算法。It represents a measure of the purity of the input signal. By optimizing the objective function, the estimation algorithm of each confounding coefficient can be obtained.

步骤S2的方法如下:The method of step S2 is as follows:

在完成初步的除相互混杂之外问题的校正后,进行下述工作。假设每次需要处理的图片中包含n个分子簇。After completing the initial corrections for problems other than intermixing, proceed as follows. Assume that each image to be processed contains n molecular clusters.

步骤S21,进行预处理工作,计算出计算混杂系数时所需的参数,步骤如下:Step S21, perform preprocessing work, and calculate the parameters required for calculating the confounding coefficient, the steps are as follows:

S211.对每个分子簇,取出每一测序轮中并非为最大信号的其它三个信号值,计算这些信号的中位值,然后通过中位值估计方差。S211. For each molecular cluster, take out the other three signal values that are not the maximum signal in each sequencing round, calculate the median value of these signals, and then estimate the variance through the median value.

S212.对每一轮测序j,计算C为任意正常数,其值不影响计算结果;为前一步估计出的第j轮测序时的方差。S212. For each round of sequencing j, calculate C is any normal number whose value does not affect the calculation result; It is the variance at the jth round of sequencing estimated in the previous step.

S213.对参数ink(事先给定,在0.5到0.8的范围内,这一值越高则测序精度略微提高但序列重复率增加,越低则相反),计算thr=(-0.75+1.5ink)∑wjS213. For the parameter ink (given in advance, within the range of 0.5 to 0.8, the higher the value, the sequencing accuracy will be slightly improved but the sequence repetition rate will be increased, and the lower the reverse), calculate thr=(-0.75+1.5ink) ∑ w j .

步骤S214,建立空的稀疏矩阵S。将分子簇编号赋值给一个与图片大小相同的数组中与分子簇位置对应的元素。对每个分子簇,通过数组找到与之距离不超过一定像素的所有分子簇,然后估计这些分子簇对它的混杂。Step S214, creating an empty sparse matrix S. Assign the molecular cluster number to the element corresponding to the molecular cluster position in an array with the same size as the picture. For each molecular cluster, find all molecular clusters with a distance of no more than a certain pixel through the array, and then estimate the mixing of these molecular clusters to it.

步骤S22,对任意分子簇i和与其距离小于预定常数的分子簇j,估计混杂系数C(i←j),即cij。估计方法如下:Step S22, for any molecular cluster i and the molecular cluster j whose distance to it is smaller than a predetermined constant, estimate the confounding coefficient C(i←j), ie c ij . The estimation method is as follows:

S221.如果i=j,则将cij赋值为1;否则进行下面的步骤。S221. If i= j , assign cij as 1; otherwise, proceed to the following steps.

S222.定义这里Ii和Ij分别为分子簇i和分子簇j校正过其它混杂后的光强。将变量l设为0,r设为1,然后进行下一步工作。S222. Definition here I i and I j are the light intensities of molecular cluster i and molecular cluster j corrected for other confounding, respectively. Set variable l to 0, r to 1, and then proceed to the next step.

S223.计算g(0.6l+0.4r),若其值大于thr,则将l的值改变为0.6l+0.4r,否则将r的值改变为0.6l+0.4r,然后如果|l-r|>0.001,则重复本步骤,否则进行下面的步骤。S223. Calculate g(0.6l+0.4r), if its value is greater than thr, then change the value of l to 0.6l+0.4r, otherwise change the value of r to 0.6l+0.4r, and then if |l-r|> 0.001, repeat this step, otherwise proceed to the following steps.

S224.将l赋值给cijS224. Assign l to c ij .

其中,步骤S2对不同混杂系数的估计可并行完成。这一并行可通过GPU编程,多核CPU或FPGA实现。Wherein, the estimation of different confounding coefficients in step S2 can be completed in parallel. This parallelism can be achieved through GPU programming, multi-core CPUs or FPGAs.

四、步骤S3,进行后续处理Four, step S3, carry out follow-up processing

本步骤包括:This step includes:

步骤S31,在完成步骤S2得到C的估计后,对未做任何处理的输入的分子簇荧光信号,或通过步骤S115得到的分子簇荧光信号O,求解CD=O得到校正过相互混杂的光强DStep S31, after the completion of step S2 to obtain the estimation of C, for the input molecular cluster fluorescence signal without any processing, or the molecular cluster fluorescence signal O obtained through step S115, solve CD=O to obtain the light intensity corrected for mutual mixing D.

步骤S32,对校正过分子簇信号相互混杂的光强数据重复步骤S12,以进行光谱串色、相位失相等的校正操作。Step S32, repeating step S12 for the light intensity data that has been corrected for the intermixing of molecular cluster signals, so as to perform spectral cross-color and phase mismatch correction operations.

步骤S33,对每个分子簇的每一轮数据,根据最大光强值所在频道确定对应位置的碱基类别。根据分子簇信号的纯度确定其质量值。输出碱基类别和质量值。Step S33, for each round of data of each molecular cluster, determine the base type of the corresponding position according to the channel where the maximum light intensity value is located. Determine the mass value of the molecular cluster signal according to its purity. Output base class and quality value.

其中,步骤S2和S31可通过如下方式完成:Wherein, steps S2 and S31 can be completed in the following manner:

步骤S2P,将分子簇坐标所在的平面区域通过预定方式进行分割,对每一子区域,选取子区域包含的所有分子簇和与该子区域的距离不超过预定数值的所有分子簇,对选取的分子簇执行步骤S2和步骤S31,然后将该子区域包含的分子簇的计算结果作为所述校正过信号混杂的光强。对每一子区域的操作可并行完成,步骤S211到S213可以对每一子区域分别执行也可先于步骤S2P执行。In step S2P, the plane area where the molecular cluster coordinates are located is divided by a predetermined method, and for each sub-area, all the molecular clusters contained in the sub-area and all the molecular clusters whose distance from the sub-area does not exceed a predetermined value are selected, and the selected Steps S2 and S31 are executed for the molecular clusters, and then the calculation results of the molecular clusters included in the sub-region are used as the corrected signal mixing light intensity. The operations on each sub-area can be completed in parallel, and steps S211 to S213 can be performed on each sub-area separately or before step S2P.

其中,步骤S31和步骤S32可替换为:Wherein, step S31 and step S32 can be replaced by:

步骤S3R1,对步骤S123得到的分子簇荧光信号I,求解CS=I得到可直接用于进行辨识碱基的信号。Step S3R1, solving CS=I for the molecular cluster fluorescence signal I obtained in step S123 to obtain a signal that can be directly used for base identification.

步骤S32和步骤S33可替换为:Step S32 and step S33 can be replaced by:

步骤S3R2,输出D,使用第三方工具,如AYB(Massingham&Goldman,2012)等完成测序。Step S3R2, output D, use a third-party tool, such as AYB (Massingham&Goldman, 2012) to complete the sequencing.

发明人采用本发明的技术方案对分子簇测序的荧光信号数据进行了仿真测试,如图5所示:图5是根据本发明实施例的数据处理结果示意图,其中横轴表示离最近分子簇的距离,纵轴表示数量,黑色部分(CACC improved PF reads)为采用本发明实施例后对测序精度的提高。x轴坐标代表离最近分子簇中心的距离。左侧长条为通过本发明处理数据后完美匹配序列比例,中间的是通过本发明的方案但不校正分子簇信号相互混杂的结果,右侧为识别出的分子簇的总数量。可见,离最近分子簇距离在1到3个像素的这部分分子簇的映射正确率提高最显著。The inventor used the technical solution of the present invention to perform a simulation test on the fluorescent signal data of molecular cluster sequencing, as shown in Figure 5: Figure 5 is a schematic diagram of the data processing results according to the embodiment of the present invention, where the horizontal axis represents the distance from the nearest molecular cluster The distance, the vertical axis represents the quantity, and the black part (CACC improved PF reads) is the improvement of sequencing accuracy after adopting the embodiment of the present invention. The x-axis coordinate represents the distance from the center of the nearest molecular cluster. The long bar on the left is the ratio of perfect matching sequences after data processing by the present invention, the middle bar is the result of the solution of the present invention without correcting the mutual mixing of molecular cluster signals, and the right is the total number of identified molecular clusters. It can be seen that the mapping accuracy of the molecular clusters whose distance from the nearest molecular cluster is 1 to 3 pixels increases most significantly.

同时,发明人制作了应用本发明技术方案的软件。该软件可以输入测序图像数据或分子簇荧光信号数据,通过计算混杂系数,完成对信号混杂的校正,并输出校正了信号混杂的分子簇荧光信号或序列识别结果及质量值。根据本发明的技术方案,软件分为预处理模块、计算混杂系数模块和处理模块,分别用于对输入数据的预处理、计算混杂系数和根据混杂数据对输入数据进行后续处理。预处理模块分为图像处理单元和预处理单元,图像处理单元用于处理输入数据为测序图像的情形,预处理单元用于完成对数据的预处理使之符合计算混杂系数的条件。软件的具体内容如上述步骤所示,不再赘述。该软件的一个版本通过C++代码编译实现,该软件的另一个版本通过Matlab程序实现。软件的各部分步骤通过OPENMP实现并行处理,加快了执行速度。At the same time, the inventor produced software applying the technical solutions of the present invention. The software can input sequencing image data or molecular cluster fluorescence signal data, complete the correction of signal mixing by calculating the mixing coefficient, and output the molecular cluster fluorescence signal or sequence recognition results and quality values corrected for signal mixing. According to the technical solution of the present invention, the software is divided into a preprocessing module, a mixing coefficient calculation module and a processing module, which are respectively used for preprocessing the input data, calculating the mixing coefficient and performing subsequent processing on the input data according to the mixing data. The preprocessing module is divided into an image processing unit and a preprocessing unit. The image processing unit is used to process the case where the input data is a sequencing image, and the preprocessing unit is used to complete the preprocessing of the data so that it meets the conditions for calculating the confounding coefficient. The specific content of the software is shown in the above steps and will not be repeated here. One version of the software is implemented by compiling C++ code, and the other version of the software is implemented by Matlab program. The steps of each part of the software are processed in parallel through OPENMP, which speeds up the execution speed.

综上所述,借助于本发明的上述技术方案,通过对邻近分子簇间的信号混杂的自适应的校正,从而可以更准确地完成对分子序列的辨识。此外,本发明还可以读入原始图片数据或分子簇荧光信号数据,并输出校正过信号混杂的分子簇荧光信号数据,或输出最终的有质量评估的分子序列,本技术可直接应用于处理采用桥式扩增技术的DNA测序仪器产生的数据,并可应用于处理其他辨识多个分子的结构或序列的装置产生的数据。To sum up, with the help of the above-mentioned technical solution of the present invention, through the self-adaptive correction of signal mixing between adjacent molecular clusters, the identification of molecular sequences can be completed more accurately. In addition, the present invention can also read in the original image data or molecular cluster fluorescence signal data, and output the molecular cluster fluorescence signal data corrected for signal mixing, or output the final molecular sequence with quality evaluation. This technology can be directly applied to processing Data generated by bridge amplification technology DNA sequencing instruments can be applied to process data generated by other devices that identify the structure or sequence of multiple molecules.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内,并且本工作得到了国家自然科学基金委员会重大研究计划培育项目91130008的资助。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the scope of the present invention. It is within the scope of protection, and this work has been supported by the National Natural Science Foundation of China Major Research Program Cultivation Project 91130008.

参考文献references

Anastasi,C.(2008).Accurate whole human genome sequencing usingreversible terminator chemistry.Nature,456(7218),53-59.Anastasi, C.(2008). Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218), 53-59.

Bentley,D.R.,Balasubramanian,S.,Swerdlow,H.P.,Smith,G.P.,Milton,J.,Brown,C.G.,...&Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., Brown, C.G.,...&

Li,L.,&Speed,T.P.(1999).An estimate of the crosstalk matrix in four-dye fluorescence-based DNA sequencing.Electrophoresis,20(7),1433-1442.Li, L., & Speed, T.P.(1999). An estimate of the crosstalk matrix in four-dye fluorescence-based DNA sequencing. Electrophoresis, 20(7), 1433-1442.

Massingham,T.,&Goldman,N.(2012).All Your Base:a fast and accurateprobabilistic approach to base calling.Genome Biol,13,R13.Massingham, T., & Goldman, N. (2012). All Your Base: a fast and accurate probabilistic approach to base calling. Genome Biol, 13, R13.

Whiteford,N.,Skelly,T.,Curtis,C.,Ritchie,M.E.,A.,Zaranek,A.W.,...&Brown,C.(2009).Swift:primary data analysis for the Illumina Solexasequencing platform.Bioinformatics,25(17),2194-2199.Whiteford, N., Skelly, T., Curtis, C., Ritchie, ME, A.,Zaranek,AW,...&Brown,C.(2009).Swift: primary data analysis for the Illumina Solexasequencing platform.Bioinformatics,25(17),2194-2199.

Claims (14)

1.一种用于多个分子信号的数据处理方法,其特征包括:1. A data processing method for multiple molecular signals, characterized in that it comprises: 计算任意分子簇A与符合预定条件的分子簇B的荧光信号之间的混杂系数C(A←B),其中,所述符合预定条件的分子簇B为在所述分子簇A的荧光信号中具有混杂的所述分子簇B;Calculating the mixing coefficient C (A←B) between any molecular cluster A and the fluorescent signal of the molecular cluster B meeting the predetermined condition, wherein the molecular cluster B meeting the predetermined condition is in the fluorescent signal of the molecular cluster A having promiscuous said molecular cluster B; 根据所述混杂系数,对不同分子簇的荧光信号进行处理;According to the mixing coefficient, the fluorescent signals of different molecular clusters are processed; 其中对任意所述分子簇A和符合预定条件的所述分子簇B,所述C(A←B)用于衡量所述分子簇A的荧光信号中来源于所述分子簇B的混杂的严重程度,所述混杂是指所述分子簇A的荧光信号中出现的所述分子簇B中荧光标记的荧光信号;分子簇为特定分子的集合,该集合内包含具有相同序列的所述特定分子,并且这些所述特定分子之间的平均距离小于不同分子簇的分子之间的平均距离;对任意所述分子簇A,其荧光信号指通过预定方式得到的、可用于对所述分子簇A包含分子的序列或子序列进行识别的数据;分子的序列为分子中预定的一个或多个位置的分子基本元件的类型;Wherein, for any of the molecular cluster A and the molecular cluster B meeting the predetermined conditions, the C(A←B) is used to measure the severity of the confounding in the fluorescent signal of the molecular cluster A originating from the molecular cluster B degree, the mixing refers to the fluorescent signal of the fluorescent label in the molecular cluster B that appears in the fluorescent signal of the molecular cluster A; a molecular cluster is a collection of specific molecules, which contains the specific molecules with the same sequence , and the average distance between these specific molecules is smaller than the average distance between molecules of different molecular clusters; for any molecular cluster A, its fluorescence signal refers to the obtained by a predetermined method, which can be used to detect the molecular cluster A Data containing the sequence or subsequence of a molecule for identification; the sequence of a molecule is the type of molecular basic element at one or more predetermined positions in the molecule; 其中,所述根据所述混杂系数对不同分子簇的荧光信号进行处理,包括:Wherein, the processing of fluorescent signals of different molecular clusters according to the mixing coefficient includes: 通过所述混杂系数校正所述不同分子簇的荧光信号中的信号混杂;correcting for signal confounding in the fluorescent signals of the different molecular clusters by the confounding coefficient; 其中,所述信号混杂是指任意分子簇的荧光信号中出现属于其他分子簇中分子荧光标记的荧光信号;Wherein, the signal mixing refers to the occurrence of fluorescent signals belonging to molecular fluorescent labels in other molecular clusters among the fluorescent signals of any molecular cluster; 所述通过所述混杂系数校正所述不同分子簇的荧光信号中的信号混杂包括:The correction of the signal confounding in the fluorescent signals of the different molecular clusters by the confounding coefficient comprises: 通过下述公式计算校正过信号混杂的所述不同分子簇的荧光信号所组成的矩阵IIThe matrix II composed of the fluorescent signals of the different molecular clusters corrected for signal mixing is calculated by the following formula: C·II=IOC · I I = I O ; 其中在所述矩阵II中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号;C为由各个分子簇之间的混杂系数所组成的矩阵;IO为需要进行所述校正的分子簇的荧光信号所组成的矩阵,在所述矩阵IO中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号。Wherein in the matrix II , the elements of each row correspond to the fluorescence signals of a molecular cluster, and the elements of each column correspond to the fluorescence signals of all molecular clusters of one channel in a sequencing round; C is the mixture between each molecular cluster A matrix composed of coefficients; I O is a matrix composed of the fluorescent signals of the molecular clusters that need to be corrected. In the matrix I O , the elements in each row correspond to the fluorescent signals of a molecular cluster, and the elements in each column correspond to Fluorescent signal of all molecular clusters in one channel in one sequencing round. 2.根据权利要求1的所述方法,其特征在于,对任意所述分子簇A和所述分子簇B,所述混杂系数C(A←B)为E(A←B)与E(B←B)的比,其中,所述E(A←B)为所述分子簇A的荧光信号中来源于所述分子簇B的荧光标记的荧光信号,所述E(B←B)为所述分子簇B的荧光信号中来源于所述分子簇B中分子荧光标记的荧光信号。2. according to the described method of claim 1, it is characterized in that, for any described molecular cluster A and described molecular cluster B, described confounding coefficient C (A←B) is E(A←B) and E(B ←B), wherein, the E(A←B) is the fluorescent signal of the fluorescent label derived from the molecular cluster B in the fluorescent signal of the molecular cluster A, and the E(B←B) is the The fluorescent signal of the molecular cluster B is derived from the fluorescent signal of the molecular fluorescent label in the molecular cluster B. 3.根据权利要求1的所述方法,其特征在于,通过下述公式计算所述混杂系数C(A←B):3. according to the described method of claim 1, it is characterized in that, calculate described confounding coefficient C (A←B) by following formula: C(A←B)=argminc(f(IA-cIB)+h(c));C(A←B)=argmin c (f(I A -cI B )+h(c)); 其中,h(c)为预先设定的单调非减函数,IA和IB分别表示所述分子簇A和所述分子簇B在预先指定的测序轮和测序频道的荧光信号,其中n为测序轮的数量,对测序轮数j,rj为预先设定的函数,wj为根据所有分子簇在第j轮测序中的荧光信号计算出的标量或是预先设定的常数,c为预定区间内的实数;Wherein, h(c) is a preset monotone non-decreasing function, I A and I B respectively represent the fluorescent signals of the molecular cluster A and the molecular cluster B in the pre-specified sequencing round and sequencing channel, where n is the number of sequencing rounds, for the number of sequencing rounds j, r j is a preset function, w j is a scalar calculated from the fluorescence signals of all molecular clusters in the j-round sequencing or a preset constant , c is a real number in a predetermined interval; rj的函数公式如下:The function formula of r j is as follows: 其中,r为频道的数量,I(j,k)为输入荧光信号在第j个测序轮、第k个频道中的数值。Among them, r is the number of channels, and I(j,k) is the value of the input fluorescent signal in the jth sequencing round and the kth channel. 4.根据权利要求3的所述方法,其特征在于,argminc(f(IA-cIB)+h(c))通过使用分位数法求f(IA-cIB)+h(c)的导函数零点的方法得到。4. according to the described method of claim 3, it is characterized in that, argmin c (f(IA - cIB )+h(c)) obtains f(IA - cIB )+h( c) The zero point of the derivative function is obtained. 5.根据权利要求1的所述方法,其特征在于,在计算所述不同分子簇的荧光信号之间的混杂系数之前进一步包括:5. The method according to claim 1, further comprising: before calculating the mixing coefficient between the fluorescent signals of the different molecular clusters: 采用预定方式对输入数据进行处理,所述预定方式包括以下至少之一:The input data is processed in a predetermined manner, and the predetermined manner includes at least one of the following: 校正光谱串色、校正相位失相、对原始图像数据进行预处理生成分子簇的荧光信号。Correct spectral cross-color, correct phase loss, and preprocess raw image data to generate fluorescence signals of molecular clusters. 6.根据权利要求5的所述方法,其特征在于,对所述原始图像数据进行预处理生成分子簇的荧光信号,包括:6. The method according to claim 5, wherein the raw image data is preprocessed to generate a fluorescent signal of a molecular cluster, comprising: 校正存在光谱串色的频道对应的图像的光谱串色;Correct the spectral cross-color of the image corresponding to the channel with spectral cross-color; 对所述校正过光谱串色的图像进行分子簇定位操作,performing a molecular cluster localization operation on the image corrected for spectral cross-color, 其中,所述分子簇定位操作是指确定图像中符合预定条件的分子簇,以及确定所述符合预定条件的分子簇的坐标,所述确定符合预定条件的分子簇为各个图像中的目标亮点所对应的分子簇。Wherein, the molecular cluster positioning operation refers to determining the molecular clusters meeting the predetermined conditions in the image, and determining the coordinates of the molecular clusters meeting the predetermined conditions, and the determining molecular clusters meeting the predetermined conditions are located by target bright spots in each image. corresponding molecular clusters. 7.根据权利要求1的所述方法,其特征在于,所述根据所述混杂系数对不同分子簇的荧光信号进行处理,进一步包括:7. The method according to claim 1, wherein the processing of fluorescent signals of different molecular clusters according to the mixing coefficient further comprises: 通过校正过信号混杂的分子簇荧光信号对分子簇中分子的序列进行识别。The sequences of the molecules in the clusters are identified by the cluster fluorescence signals corrected for signal scrambling. 8.一种用于多个分子信号的数据处理装置,其特征在于,包括:8. A data processing device for multiple molecular signals, comprising: 计算混杂系数模块,用于计算任意分子簇A与符合预定条件的分子簇B的荧光信号之间的混杂系数C(A←B),其中,所述符合预定条件的分子簇B为在所述分子簇A的荧光信号中具有混杂的所述分子簇B;Calculate the mixing coefficient module, which is used to calculate the mixing coefficient C (A←B) between the fluorescence signal of any molecular cluster A and the molecular cluster B meeting the predetermined conditions, wherein the molecular cluster B meeting the predetermined conditions is in the The fluorescent signal of molecular cluster A has mixed molecular cluster B; 处理模块,用于根据所述混杂系数对不同分子簇的荧光信号进行处理;A processing module, configured to process the fluorescent signals of different molecular clusters according to the mixing coefficient; 其中对任意所述分子簇A和符合预定条件的所述分子簇B,所述C(A←B)用于衡量所述分子簇A的荧光信号中来源于所述分子簇B的混杂的严重程度,所述混杂是指所述分子簇A的荧光信号中出现的所述分子簇B中荧光标记的荧光信号;分子簇为特定分子的集合,该集合内包含具有相同序列的所述特定分子,并且这些所述特定分子之间的平均距离小于不同分子簇的分子之间的平均距离;对任意所述分子簇A,其荧光信号指通过预定方式得到的、可用于对所述分子簇A包含分子的序列或子序列进行识别的数据;分子的序列为分子中预定的一个或多个位置的分子基本元件的类型;Wherein, for any of the molecular cluster A and the molecular cluster B meeting the predetermined conditions, the C(A←B) is used to measure the severity of the confounding in the fluorescent signal of the molecular cluster A originating from the molecular cluster B degree, the mixing refers to the fluorescent signal of the fluorescent label in the molecular cluster B that appears in the fluorescent signal of the molecular cluster A; a molecular cluster is a collection of specific molecules, which contains the specific molecules with the same sequence , and the average distance between these specific molecules is smaller than the average distance between molecules of different molecular clusters; for any molecular cluster A, its fluorescence signal refers to the obtained by a predetermined method, which can be used to detect the molecular cluster A Data containing the sequence or subsequence of a molecule for identification; the sequence of a molecule is the type of molecular basic element at one or more predetermined positions in the molecule; 其中,校正单元,用于通过所述混杂系数校正所述不同分子簇的荧光信号中的信号混杂,Wherein, the correction unit is used to correct the signal confounding in the fluorescent signals of the different molecular clusters through the confounding coefficient, 其中,所述信号混杂为任意分子簇的荧光信号中出现属于其他分子簇中分子荧光标记的荧光信号;Wherein, the signals are mixed with the fluorescent signals belonging to molecular fluorescent labels in other molecular clusters appearing in the fluorescent signals of any molecular clusters; 所述校正单元进一步用于,通过下述公式计算校正过信号混杂的所述不同分子簇的荧光信号所组成的矩阵IIThe correction unit is further used to calculate the matrix I I composed of the fluorescent signals of the different molecular clusters that have corrected signal mixing by the following formula: C·II=IOC · I I = I O ; 其中在所述矩阵II中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号;所述C为由各个分子簇之间的混杂系数所组成的矩阵;所述IO为需要进行所述校正的分子簇荧光信号所组成的矩阵,在所述矩阵IO中,每行的元素对应一个分子簇的荧光信号,每列的元素对应一个测序轮中一个频道的所有分子簇的荧光信号。Wherein in the matrix II , the elements of each row correspond to the fluorescence signals of a molecular cluster, and the elements of each column correspond to the fluorescence signals of all molecular clusters of one channel in a sequencing round; A matrix composed of mixing coefficients; the I O is a matrix formed by the fluorescent signals of the molecular clusters that need to be corrected, in the matrix I O , the elements in each row correspond to the fluorescent signals of a molecular cluster, and each column The elements correspond to the fluorescent signals of all molecular clusters in one channel in one sequencing round. 9.根据权利要求8的所述装置,其特征在于,所述混杂系数C(A←B)为E(A←B)与E(B←B)的比,其中,所述E(A←B)为所述分子簇A的荧光信号中属于所述分子簇B中分子荧光标记的荧光信号,所述E(B←B)为所述分子簇B的荧光信号中属于所述分子簇B中分子荧光标记的荧光信号。9. The device according to claim 8, wherein the confounding coefficient C(A←B) is the ratio of E(A←B) to E(B←B), wherein the E(A←B) B) is the fluorescent signal of the molecular fluorescent marker belonging to the molecular cluster B in the fluorescent signal of the molecular cluster A, and the E(B←B) is the fluorescent signal of the molecular cluster B belonging to the molecular cluster B Fluorescent signal of a mid-molecule fluorescent label. 10.根据权利要求8的所述装置,其特征在于,所述计算混杂系数模块进一步用于,通过下述公式计算所述混杂系数C(A←B):10. according to the described device of claim 8, it is characterized in that, described calculation confounding coefficient module is further used for, calculates described confounding coefficient C (A←B) by following formula: C(A←B)=argminc(f(IA-cIB)+h(c));C(A←B)=argmin c (f(I A -cI B )+h(c)); 其中,h(c)为预先设定的单调非减函数,IA和IB为分子簇A和分子簇B在预先指定的测序轮和测序频道的荧光信号,其中n为测序轮的数量,对测序轮数j,rj为预先设定的函数,wj为根据所有分子簇在第j轮测序中的荧光信号计算出的标量或是预先设定的常数,c为预定区间内的实数;Among them, h(c) is a preset monotone non-decreasing function, I A and I B are the fluorescent signals of molecular cluster A and molecular cluster B in the pre-specified sequencing round and sequencing channel, where n is the number of sequencing rounds, for the number of sequencing rounds j, r j is a preset function, w j is a scalar calculated from the fluorescence signals of all molecular clusters in the j-round sequencing or a preset constant , c is a real number in a predetermined interval; rj的函数公式如下:The function formula of r j is as follows: 其中,r为频道的数量,I(j,k)为输入荧光信号在第j个测序轮、第k个频道中的数值。Among them, r is the number of channels, and I(j,k) is the value of the input fluorescent signal in the jth sequencing round and the kth channel. 11.根据权利要求10的所述装置,其特征在于,argminc(f(IA-cIB)+h(c))通过使用分位数法求f(IA-cIB)+h(c)的导函数零点的方法得到。11. according to the described device of claim 10, it is characterized in that, argmin c (f(IA - cIB )+h(c)) finds f(IA - cIB )+h( c) The zero point of the derivative function is obtained. 12.根据权利要求8的所述装置,其特征在于,进一步包括:12. The apparatus of claim 8, further comprising: 预处理模块,用于在计算所述不同分子簇的荧光信号之间的混杂系数之前,采用预定方式对输入数据进行处理,所述预定方式包括以下至少之一:A preprocessing module, used to process the input data in a predetermined manner before calculating the mixing coefficient between the fluorescent signals of the different molecular clusters, and the predetermined manner includes at least one of the following: 校正光谱串色、校正相位失相、对原始图像数据进行预处理生成分子簇的荧光信号。Correct spectral cross-color, correct phase loss, and preprocess raw image data to generate fluorescence signals of molecular clusters. 13.根据权利要求12的所述装置,其特征在于,所述预处理模块进一步包括:13. The device according to claim 12, wherein the preprocessing module further comprises: 图像处理单元,用于对原始图像数据进行预处理生成所述分子簇的荧光信号;以及所述图像处理单元进一步包括:The image processing unit is used to preprocess the raw image data to generate the fluorescence signal of the molecular cluster; and the image processing unit further includes: 校正子单元,用于校正存在光谱串色的频道对应的图像的光谱串色;A correcting subunit, configured to correct spectral cross-color of images corresponding to channels with spectral cross-color; 定位子单元,用于对所述校正过光谱串色的图像进行分子簇定位操作,a positioning subunit, configured to perform a molecular cluster positioning operation on the image corrected for spectral cross-color, 其中,所述分子簇定位操作是指确定图像中符合预定条件的分子簇,以及确定所述符合预定条件的分子簇的坐标,所述确定符合预定条件的分子簇为各个图像中的目标亮点所对应的分子簇。Wherein, the molecular cluster positioning operation refers to determining the molecular clusters meeting the predetermined conditions in the image, and determining the coordinates of the molecular clusters meeting the predetermined conditions, and the determining molecular clusters meeting the predetermined conditions are located by target bright spots in each image. corresponding molecular clusters. 14.根据权利要求8的所述装置,其特征在于,所述处理模块进一步包括:14. The device according to claim 8, wherein the processing module further comprises: 下游处理单元,用于根据所述校正单元校正过信号混杂的分子簇荧光信号对分子簇中分子的序列进行识别。The downstream processing unit is used to identify the sequence of the molecules in the molecular cluster according to the fluorescent signal of the molecular cluster corrected by the correction unit for signal confusion.
CN201510061908.8A 2015-02-05 2015-02-05 Data processing method and device for multiple molecular signals Active CN105989248B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510061908.8A CN105989248B (en) 2015-02-05 2015-02-05 Data processing method and device for multiple molecular signals

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510061908.8A CN105989248B (en) 2015-02-05 2015-02-05 Data processing method and device for multiple molecular signals

Publications (2)

Publication Number Publication Date
CN105989248A CN105989248A (en) 2016-10-05
CN105989248B true CN105989248B (en) 2018-11-27

Family

ID=57036285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510061908.8A Active CN105989248B (en) 2015-02-05 2015-02-05 Data processing method and device for multiple molecular signals

Country Status (1)

Country Link
CN (1) CN105989248B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020139848A1 (en) * 2018-12-28 2020-07-02 Becton, Dickinson And Company Methods for spectrally resolving fluorophores of a sample and systems for same
US11210554B2 (en) * 2019-03-21 2021-12-28 Illumina, Inc. Artificial intelligence-based generation of sequencing metadata
CN115035952B (en) * 2022-05-20 2023-04-18 深圳赛陆医疗科技有限公司 Base recognition method and device, electronic device, and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1161490A2 (en) * 1999-02-05 2001-12-12 University of Maryland- Baltimore LUMINESCENCE SPECTRAL PROPERTIES OF CdS NANOPARTICLES

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1161490A2 (en) * 1999-02-05 2001-12-12 University of Maryland- Baltimore LUMINESCENCE SPECTRAL PROPERTIES OF CdS NANOPARTICLES

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
A System for Rapid DNA Sequencing with Fluorescent Chain-Terminating Dideoxynucleotides;JAMES M.PROBER 等;《SCIENCE》;19871016;第238卷;第336-341页 *
Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry;Bentley 等;《Nature》;20081106;第1-21页 *
All Your Base: a fast and accurate probabilistic approach to base calling;Tim Massingham 等;《Massingham and Goldman Genome Biology》;20121231;第1-15页 *
SEME: A Fast Mapper of Illumina Sequencing Reads with Statistical Evaluation;SHIJIAN CHEN 等;《COMPUTATIONAL BIOLOGY》;20131231;第20卷(第11期);第847-860页 *
Swift: Primary Data Analysis for the Illumina Solexa Sequencing Platform;Nava Whiteford 等;《Bioinformatics》;20090630;第1-7页 *
Two-Hybrid Fluorescence Cross-Correlation Spectroscopy Detects Protein–Protein Interactions In Vivo;Nina Baudendistel 等;《CHEMPHYSCHEM》;20051231;第984-990页 *

Also Published As

Publication number Publication date
CN105989248A (en) 2016-10-05

Similar Documents

Publication Publication Date Title
US11972583B2 (en) Fluorescence image registration method, gene sequencing instrument, and storage medium
CN115035952B (en) Base recognition method and device, electronic device, and storage medium
Yang et al. A synthetic datasets based instance segmentation network for high-throughput soybean pods phenotype investigation
WO2023115550A1 (en) Deep learning based methods and systems for nucleic acid sequencing
US8787671B2 (en) Character recognition preprocessing method and apparatus
CN111971711B (en) Fluorescence image registration method, gene sequencer, system and storage medium
WO2018068600A1 (en) Image processing method and system
JP2016533475A (en) System and method for adaptive histopathology image decomposition
WO2017113692A1 (en) Method and device for image matching
CN105989248B (en) Data processing method and device for multiple molecular signals
CN114719966A (en) Light source determination method, device, electronic device and storage medium
CN102214294A (en) Image processing apparatus, image processing method, and program
US20100034444A1 (en) Image analysis
CN110363176A (en) An image analysis method and device
CN108564569B (en) A kind of distress in concrete detection method and device based on multicore classification learning
CN113470749A (en) Method and device for correcting uneven brightness of cavity
WO2023097685A1 (en) Base recognition method and device for nucleic acid sample
CN111311602A (en) Lip image segmentation device and method for traditional Chinese medicine facial diagnosis
Wang et al. Objective evaluation of low-light-level image intensifier resolution based on a model of image restoration and an applied model of image filtering
CN110232660A (en) A kind of new infrared image recognition pretreatment gray scale stretching method
CN114792383A (en) A microfluidic chip digital PCR fluorescent image recognition method and device
CN113506266A (en) Method, device and equipment for detecting tongue greasy coating and storage medium
JP4346620B2 (en) Image processing apparatus, image processing method, and image processing program
JP6832933B2 (en) A method for aligning at least a portion of one digital image with at least a portion of another digital image
CN108473925A (en) Base sequence determining device, capillary array electrophoresis device and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant