CN115827221A

CN115827221A - BAM file parallel reading method, system and medium

Info

Publication number: CN115827221A
Application number: CN202211432664.6A
Authority: CN
Inventors: 黄立磊; 康佳琪; 冯博伦; 杨仁武; 万斌; 谢金武; 王振国
Original assignee: Genetalks Bio Tech Changsha Co ltd
Current assignee: Genetalks Bio Tech Changsha Co ltd
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-03-21

Abstract

The invention discloses a method, a system and a medium for reading BAM files in parallel, which sequentially read compressed data blocks from a BAM file to be analyzed, group the compressed data blocks according to a preset rule and generate a queue to be decompressed consisting of grouped compressed blocks; the reading thread takes the packet compression block as a unit and parallelly acquires data from the queue to be decompressed; the reading thread decompresses the packet compression block and analyzes the corresponding BAM data until the data analysis of the whole packet compression block is completed to generate an analysis data block; the method and the device have the advantages that the decompression and analysis work of the grouped data blocks is processed simultaneously through the multiple threads, the reading efficiency of the whole BAM file is improved, the read compressed data blocks are grouped according to the preset rule, the size of data in a decompression queue is adjusted, the thread scheduling time ratio is reduced, and the parallel efficiency is improved.

Description

A BAM file parallel reading method, system and medium

技术领域technical field

本发明涉及生物信息领域，具体涉及一种BAM文件并行读取方法、系统及介质，用于实现BAM文件的读取。The invention relates to the field of biological information, in particular to a BAM file parallel reading method, system and medium for realizing the reading of BAM files.

背景技术Background technique

在生物信息学中尤其是测序数据分析中，SAM(Sequence Alignment/Map)格式是常用的，用来记录短片段序列与参考序列的比对(mapping)结果的数据格式，SAM格式的文件大小通常在100G以上，非常不便于存储，因此与之对应的BAM(Binary SAM)格式应运而生，BAM格式是SAM格式的二进制压缩格式，它在保留了与SAM格式完全相同内容信息的同时，在文件大小上通常会缩小4倍以上。BAM格式在一定程度上改善了SAM格式的存储问题，但是BAM文件大小依然较大，BAM文件的读取效率依然是基因测序工作中非常重要的一个环节。In bioinformatics, especially in sequencing data analysis, the SAM (Sequence Alignment/Map) format is commonly used to record the data format of the alignment (mapping) results of short fragment sequences and reference sequences. The file size of SAM format is usually It is very inconvenient to store more than 100G, so the corresponding BAM (Binary SAM) format came into being. The BAM format is a binary compression format of the SAM format. It retains exactly the same content information as the SAM format. The size is usually reduced by more than 4 times. The BAM format has improved the storage problem of the SAM format to a certain extent, but the size of the BAM file is still large, and the reading efficiency of the BAM file is still a very important link in the gene sequencing work.

目前业界普遍采用htslib或者samtools库进行BAM文件的读取，实现流程如图1所示，包括以下步骤：At present, the industry generally uses htslib or samtools library to read BAM files. The implementation process is shown in Figure 1, including the following steps:

步骤A：从BAM文件中读取压缩数据块，并传递到解压队列；Step A: read the compressed data block from the BAM file, and pass it to the decompression queue;

步骤B：由多个解压线程并行从解压队列中获取压缩数据快进行解压，并将解压数据块传递到解压数据队列；Step B: multiple decompression threads obtain the compressed data from the decompression queue in parallel for decompression, and transfer the decompressed data block to the decompression data queue;

步骤C：顺序获取解压数据队列中的解压数据块，并进行解析。Step C: sequentially acquire the decompressed data blocks in the decompressed data queue, and analyze them.

上述现有技术中对于BAM文件读取过程存在以下问题：In the above-mentioned prior art, there are the following problems for the BAM file reading process:

1.目前只有数据的解压为并行处理，数据的读取和数据的解析都是顺序执行，导致效率很低；1. At present, only data decompression is processed in parallel, and data reading and data analysis are executed sequentially, resulting in low efficiency;

2.标准的BAM文件中，数据块的大小为65KB左右，每次读取的压缩数据块太小，使解压线程调度时间过多，虽是并行处理，但并行效率降低；2. In a standard BAM file, the size of the data block is about 65KB. The compressed data block read each time is too small, which makes the decompression thread scheduling time too much. Although it is parallel processing, the parallel efficiency is reduced;

3.从解压队列中读取数据块并解析的操作是顺序执行的，由于解析完数据后还需要进行大量其他任务，如压缩、计算加速等，后续任务会卡壳在解析数据步骤。3. The operations of reading and parsing data blocks from the decompression queue are performed sequentially. Since a large number of other tasks are required after parsing the data, such as compression, calculation acceleration, etc., subsequent tasks will be stuck in the data parsing step.

综上所述，在数据量巨大的生物信息领域，现有的BAM文件读取方法的效率有待提高。To sum up, in the field of biological information with a huge amount of data, the efficiency of the existing BAM file reading method needs to be improved.

发明内容Contents of the invention

本发明要解决的技术问题：针对现有技术的上述问题，提供一种BAM文件并行读取方法、系统及介质，本发明更好地将BAM文件读取过程并行，提高BAM文件读取效率。Technical problem to be solved by the present invention: Aiming at the above-mentioned problems in the prior art, a method, system and medium for parallel reading of BAM files are provided. The present invention better parallelizes the reading process of BAM files and improves the efficiency of reading BAM files.

为了解决上述技术问题，本发明采用的技术方案为：In order to solve the problems of the technologies described above, the technical solution adopted in the present invention is:

一种BAM文件并行读取方法，实施步骤包括：A kind of BAM file parallel reading method, implementation step comprises:

1)从待解析BAM文件中顺序读取压缩数据块，将上述压缩数据块按照预设规则进行分组，生成由分组压缩块组成的待解压队列；1) sequentially read the compressed data blocks from the BAM file to be parsed, group the above compressed data blocks according to preset rules, and generate a queue to be decompressed composed of grouped compressed blocks;

2)多个读取线程以分组压缩块为单位，并行从待解压队列中获取数据；2) Multiple reading threads obtain data from the queue to be decompressed in parallel in units of compressed blocks;

3)读取线程对获取的分组压缩块进行解压，并解析对应的BAM数据，直到将整个分组压缩块的数据解析完成生成对应的解析数据块；3) The reading thread decompresses the obtained packet compression block, and parses the corresponding BAM data, until the data analysis of the entire packet compression block is completed to generate a corresponding analysis data block;

4)合并解析数据块，完成BAM文件的读取。4) Merge and analyze the data blocks to complete the reading of the BAM file.

可选地，步骤1)所述预设规则为设定所述压缩数据块的个数。Optionally, the preset rule in step 1) is setting the number of compressed data blocks.

可选地，步骤1)所述预设规则为预设分组压缩块的大小，将所述压缩数据块填入分组压缩块进行分组。Optionally, the preset rule in step 1) is to preset the size of the packet compression block, and fill the compressed data block into the packet compression block for grouping.

可选地，所述预设分组压缩块的大小根据处理设备的性能动态调整。Optionally, the size of the preset packet compression block is dynamically adjusted according to the performance of the processing device.

本发明还提供一种BAM文件并行读取系统，包括计算机设备，该计算机设备被编程或配置以执行上述BAM文件并行读取方法的步骤，或该计算机设备的存储器上存储有被编程或配置以执行上述BAM文件并行读取方法的计算机程序。The present invention also provides a BAM file parallel reading system, including computer equipment, the computer equipment is programmed or configured to perform the steps of the above-mentioned BAM file parallel reading method, or stored on the memory of the computer equipment is programmed or configured to A computer program for executing the above BAM file parallel reading method.

本发明还提供一种计算机可读存储介质，该计算机可读存储介质上存储有被编程或配置以执行上述BAM文件并行读取方法的计算机程序。The present invention also provides a computer-readable storage medium, on which a computer program programmed or configured to execute the above-mentioned method for reading BAM files in parallel is stored.

和现有技术相比，本发明具有下述优点：本发明通过将数据解压和数据解析处理放在同一个读取线程，由多个线程同时处理分组后的数据块的解压和解析工作，提升了整个BAM文件读取的效率。同时本发明通过将读取的压缩数据块按照预设规则分组，合理地调整了解压队列中数据的大小，降低了线程调度时间占比，提升了读取线程的并行效率。Compared with the prior art, the present invention has the following advantages: the present invention puts the data decompression and data parsing processing in the same reading thread, and multiple threads simultaneously process the decompression and parsing of the grouped data blocks, improving Improve the efficiency of reading the entire BAM file. At the same time, the invention reasonably adjusts the size of data in the decompression queue by grouping the read compressed data blocks according to preset rules, reduces the proportion of thread scheduling time, and improves the parallel efficiency of reading threads.

附图说明Description of drawings

图1为现有技术中BAM文件读取的流程示意图。FIG. 1 is a schematic flow chart of reading a BAM file in the prior art.

图2为发明实施例中的BAM文件并行读取的流程示意图。FIG. 2 is a schematic flow chart of parallel reading of BAM files in an embodiment of the invention.

具体实施方式Detailed ways

如图2所示，本实施例BAM文件并行读取方法的实施步骤包括：As shown in Figure 2, the implementation steps of the BAM file parallel reading method of the present embodiment include:

本实施例通过将数据解压和数据解析处理放在同一个读取线程，由多个线程同时处理分组后的数据块的解压和解析工作，提升了整个BAM文件读取的效率。同时本实施例通过将读取的压缩数据块按照预设规则分组，合理地调整了解压队列中数据的大小，降低了线程调度时间占比，提升了读取线程的并行效率。In this embodiment, the data decompression and data parsing are placed in the same reading thread, and multiple threads simultaneously process the decompression and parsing of the grouped data blocks, thereby improving the efficiency of reading the entire BAM file. At the same time, this embodiment reasonably adjusts the size of data in the decompression queue by grouping the read compressed data blocks according to preset rules, reduces the proportion of thread scheduling time, and improves the parallel efficiency of reading threads.

本实施例中步骤1)所述预设规则可以为根据所述压缩数据块的个数进行分组，如将压缩数据块按照每100个分为一组。The preset rule in step 1) in this embodiment may be grouping according to the number of the compressed data blocks, for example, grouping every 100 compressed data blocks into groups.

本实施例中步骤1)所述预设规则也可以为预设分组压缩块的大小，将所述压缩数据块填入分组压缩块进行分组，如预设分组压缩块大小为200MB。The preset rule in step 1) of this embodiment may also be the size of the preset packet compression block, and the compressed data block is filled into the packet compression block for grouping, for example, the preset packet compression block size is 200MB.

本实施例中步骤1)所述预设分组压缩块的大小根据处理设备的性能动态调整，如可以根据CPU性能调整分组压缩块的大小。The size of the preset packet compression block in step 1) of this embodiment is dynamically adjusted according to the performance of the processing device, for example, the size of the packet compression block can be adjusted according to the CPU performance.

此外，本实施例还提供一种BAM文件并行读取系统，包括计算机设备，该计算机设备被编程或配置以执行上述BAM文件并行读取方法的步骤，或该计算机设备的存储器上存储有被编程或配置以执行上述BAM文件并行读取方法的计算机程序。In addition, this embodiment also provides a BAM file parallel reading system, including a computer device, the computer device is programmed or configured to execute the steps of the above-mentioned BAM file parallel reading method, or the memory of the computer device stores the programmed Or a computer program configured to perform the above BAM file parallel reading method.

此外，本实施例还提供一种计算机可读存储介质，该计算机可读存储介质上存储有被编程或配置以执行上述BAM文件并行读取方法的计算机程序。In addition, this embodiment also provides a computer-readable storage medium, on which a computer program programmed or configured to execute the above method for reading BAM files in parallel is stored.

以上所述仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above descriptions are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention should also be regarded as the protection scope of the present invention.

Claims

1. A BAM file parallel reading method is characterized by comprising the following implementation steps:

1) Sequentially reading compressed data blocks from a BAM file to be analyzed, grouping the compressed data blocks according to a preset rule, and generating a queue to be decompressed consisting of grouped compressed blocks;

2) A plurality of reading threads take a packet compression block as a unit and acquire data from a queue to be decompressed in parallel;

3) The reading thread decompresses the obtained packet compression block and analyzes the corresponding BAM data until the data of the whole packet compression block is analyzed to generate a corresponding analysis data block;

4) And merging and analyzing the data blocks to finish reading the BAM file.

2. The method for reading BAM files in parallel as claimed in claim 1, wherein the predetermined rule of step 1) is to set the number of the compressed data blocks.

3. The method of claim 1, wherein the predetermined rule in step 1) is a predetermined size of a packet compression block, and the compressed data block is filled into the packet compression block for grouping.

4. The BAM file parallel reading method according to claim 3, wherein the size of the preset packet compression block is dynamically adjusted according to the performance of a processing device.

5. A BAM file parallel reading system comprising a computer device, characterized in that the computer device is programmed or configured to perform the steps of the BAM file parallel reading method of any one of claims 1 to 4, or that a memory of the computer device has stored thereon a computer program programmed or configured to perform the BAM file parallel reading method of any one of claims 1 to 4.

6. A computer-readable storage medium having stored thereon a computer program programmed or configured to perform the BAM file parallel reading method of any one of claims 1 to 4.