[go: up one dir, main page]

CN115827221A - BAM file parallel reading method, system and medium - Google Patents

BAM file parallel reading method, system and medium Download PDF

Info

Publication number
CN115827221A
CN115827221A CN202211432664.6A CN202211432664A CN115827221A CN 115827221 A CN115827221 A CN 115827221A CN 202211432664 A CN202211432664 A CN 202211432664A CN 115827221 A CN115827221 A CN 115827221A
Authority
CN
China
Prior art keywords
reading
bam
data
bam file
compression block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211432664.6A
Other languages
Chinese (zh)
Inventor
黄立磊
康佳琪
冯博伦
杨仁武
万斌
谢金武
王振国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genetalks Bio Tech Changsha Co ltd
Original Assignee
Genetalks Bio Tech Changsha Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genetalks Bio Tech Changsha Co ltd filed Critical Genetalks Bio Tech Changsha Co ltd
Priority to CN202211432664.6A priority Critical patent/CN115827221A/en
Publication of CN115827221A publication Critical patent/CN115827221A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a system and a medium for reading BAM files in parallel, which sequentially read compressed data blocks from a BAM file to be analyzed, group the compressed data blocks according to a preset rule and generate a queue to be decompressed consisting of grouped compressed blocks; the reading thread takes the packet compression block as a unit and parallelly acquires data from the queue to be decompressed; the reading thread decompresses the packet compression block and analyzes the corresponding BAM data until the data analysis of the whole packet compression block is completed to generate an analysis data block; the method and the device have the advantages that the decompression and analysis work of the grouped data blocks is processed simultaneously through the multiple threads, the reading efficiency of the whole BAM file is improved, the read compressed data blocks are grouped according to the preset rule, the size of data in a decompression queue is adjusted, the thread scheduling time ratio is reduced, and the parallel efficiency is improved.

Description

一种BAM文件并行读取方法、系统及介质A BAM file parallel reading method, system and medium

技术领域technical field

本发明涉及生物信息领域,具体涉及一种BAM文件并行读取方法、系统及介质,用于实现BAM文件的读取。The invention relates to the field of biological information, in particular to a BAM file parallel reading method, system and medium for realizing the reading of BAM files.

背景技术Background technique

在生物信息学中尤其是测序数据分析中,SAM(Sequence Alignment/Map)格式是常用的,用来记录短片段序列与参考序列的比对(mapping)结果的数据格式,SAM格式的文件大小通常在100G以上,非常不便于存储,因此与之对应的BAM(Binary SAM)格式应运而生,BAM格式是SAM格式的二进制压缩格式,它在保留了与SAM格式完全相同内容信息的同时,在文件大小上通常会缩小4倍以上。BAM格式在一定程度上改善了SAM格式的存储问题,但是BAM文件大小依然较大,BAM文件的读取效率依然是基因测序工作中非常重要的一个环节。In bioinformatics, especially in sequencing data analysis, the SAM (Sequence Alignment/Map) format is commonly used to record the data format of the alignment (mapping) results of short fragment sequences and reference sequences. The file size of SAM format is usually It is very inconvenient to store more than 100G, so the corresponding BAM (Binary SAM) format came into being. The BAM format is a binary compression format of the SAM format. It retains exactly the same content information as the SAM format. The size is usually reduced by more than 4 times. The BAM format has improved the storage problem of the SAM format to a certain extent, but the size of the BAM file is still large, and the reading efficiency of the BAM file is still a very important link in the gene sequencing work.

目前业界普遍采用htslib或者samtools库进行BAM文件的读取,实现流程如图1所示,包括以下步骤:At present, the industry generally uses htslib or samtools library to read BAM files. The implementation process is shown in Figure 1, including the following steps:

步骤A:从BAM文件中读取压缩数据块,并传递到解压队列;Step A: read the compressed data block from the BAM file, and pass it to the decompression queue;

步骤B:由多个解压线程并行从解压队列中获取压缩数据快进行解压,并将解压数据块传递到解压数据队列;Step B: multiple decompression threads obtain the compressed data from the decompression queue in parallel for decompression, and transfer the decompressed data block to the decompression data queue;

步骤C:顺序获取解压数据队列中的解压数据块,并进行解析。Step C: sequentially acquire the decompressed data blocks in the decompressed data queue, and analyze them.

上述现有技术中对于BAM文件读取过程存在以下问题:In the above-mentioned prior art, there are the following problems for the BAM file reading process:

1.目前只有数据的解压为并行处理,数据的读取和数据的解析都是顺序执行,导致效率很低;1. At present, only data decompression is processed in parallel, and data reading and data analysis are executed sequentially, resulting in low efficiency;

2.标准的BAM文件中,数据块的大小为65KB左右,每次读取的压缩数据块太小,使解压线程调度时间过多,虽是并行处理,但并行效率降低;2. In a standard BAM file, the size of the data block is about 65KB. The compressed data block read each time is too small, which makes the decompression thread scheduling time too much. Although it is parallel processing, the parallel efficiency is reduced;

3.从解压队列中读取数据块并解析的操作是顺序执行的,由于解析完数据后还需要进行大量其他任务,如压缩、计算加速等,后续任务会卡壳在解析数据步骤。3. The operations of reading and parsing data blocks from the decompression queue are performed sequentially. Since a large number of other tasks are required after parsing the data, such as compression, calculation acceleration, etc., subsequent tasks will be stuck in the data parsing step.

综上所述,在数据量巨大的生物信息领域,现有的BAM文件读取方法的效率有待提高。To sum up, in the field of biological information with a huge amount of data, the efficiency of the existing BAM file reading method needs to be improved.

发明内容Contents of the invention

本发明要解决的技术问题:针对现有技术的上述问题,提供一种BAM文件并行读取方法、系统及介质,本发明更好地将BAM文件读取过程并行,提高BAM文件读取效率。Technical problem to be solved by the present invention: Aiming at the above-mentioned problems in the prior art, a method, system and medium for parallel reading of BAM files are provided. The present invention better parallelizes the reading process of BAM files and improves the efficiency of reading BAM files.

为了解决上述技术问题,本发明采用的技术方案为:In order to solve the problems of the technologies described above, the technical solution adopted in the present invention is:

一种BAM文件并行读取方法,实施步骤包括:A kind of BAM file parallel reading method, implementation step comprises:

1)从待解析BAM文件中顺序读取压缩数据块,将上述压缩数据块按照预设规则进行分组,生成由分组压缩块组成的待解压队列;1) sequentially read the compressed data blocks from the BAM file to be parsed, group the above compressed data blocks according to preset rules, and generate a queue to be decompressed composed of grouped compressed blocks;

2)多个读取线程以分组压缩块为单位,并行从待解压队列中获取数据;2) Multiple reading threads obtain data from the queue to be decompressed in parallel in units of compressed blocks;

3)读取线程对获取的分组压缩块进行解压,并解析对应的BAM数据,直到将整个分组压缩块的数据解析完成生成对应的解析数据块;3) The reading thread decompresses the obtained packet compression block, and parses the corresponding BAM data, until the data analysis of the entire packet compression block is completed to generate a corresponding analysis data block;

4)合并解析数据块,完成BAM文件的读取。4) Merge and analyze the data blocks to complete the reading of the BAM file.

可选地,步骤1)所述预设规则为设定所述压缩数据块的个数。Optionally, the preset rule in step 1) is setting the number of compressed data blocks.

可选地,步骤1)所述预设规则为预设分组压缩块的大小,将所述压缩数据块填入分组压缩块进行分组。Optionally, the preset rule in step 1) is to preset the size of the packet compression block, and fill the compressed data block into the packet compression block for grouping.

可选地,所述预设分组压缩块的大小根据处理设备的性能动态调整。Optionally, the size of the preset packet compression block is dynamically adjusted according to the performance of the processing device.

本发明还提供一种BAM文件并行读取系统,包括计算机设备,该计算机设备被编程或配置以执行上述BAM文件并行读取方法的步骤,或该计算机设备的存储器上存储有被编程或配置以执行上述BAM文件并行读取方法的计算机程序。The present invention also provides a BAM file parallel reading system, including computer equipment, the computer equipment is programmed or configured to perform the steps of the above-mentioned BAM file parallel reading method, or stored on the memory of the computer equipment is programmed or configured to A computer program for executing the above BAM file parallel reading method.

本发明还提供一种计算机可读存储介质,该计算机可读存储介质上存储有被编程或配置以执行上述BAM文件并行读取方法的计算机程序。The present invention also provides a computer-readable storage medium, on which a computer program programmed or configured to execute the above-mentioned method for reading BAM files in parallel is stored.

和现有技术相比,本发明具有下述优点:本发明通过将数据解压和数据解析处理放在同一个读取线程,由多个线程同时处理分组后的数据块的解压和解析工作,提升了整个BAM文件读取的效率。同时本发明通过将读取的压缩数据块按照预设规则分组,合理地调整了解压队列中数据的大小,降低了线程调度时间占比,提升了读取线程的并行效率。Compared with the prior art, the present invention has the following advantages: the present invention puts the data decompression and data parsing processing in the same reading thread, and multiple threads simultaneously process the decompression and parsing of the grouped data blocks, improving Improve the efficiency of reading the entire BAM file. At the same time, the invention reasonably adjusts the size of data in the decompression queue by grouping the read compressed data blocks according to preset rules, reduces the proportion of thread scheduling time, and improves the parallel efficiency of reading threads.

附图说明Description of drawings

图1为现有技术中BAM文件读取的流程示意图。FIG. 1 is a schematic flow chart of reading a BAM file in the prior art.

图2为发明实施例中的BAM文件并行读取的流程示意图。FIG. 2 is a schematic flow chart of parallel reading of BAM files in an embodiment of the invention.

具体实施方式Detailed ways

如图2所示,本实施例BAM文件并行读取方法的实施步骤包括:As shown in Figure 2, the implementation steps of the BAM file parallel reading method of the present embodiment include:

1)从待解析BAM文件中顺序读取压缩数据块,将上述压缩数据块按照预设规则进行分组,生成由分组压缩块组成的待解压队列;1) sequentially read the compressed data blocks from the BAM file to be parsed, group the above compressed data blocks according to preset rules, and generate a queue to be decompressed composed of grouped compressed blocks;

2)多个读取线程以分组压缩块为单位,并行从待解压队列中获取数据;2) Multiple reading threads obtain data from the queue to be decompressed in parallel in units of compressed blocks;

3)读取线程对获取的分组压缩块进行解压,并解析对应的BAM数据,直到将整个分组压缩块的数据解析完成生成对应的解析数据块;3) The reading thread decompresses the obtained packet compression block, and parses the corresponding BAM data, until the data analysis of the entire packet compression block is completed to generate a corresponding analysis data block;

4)合并解析数据块,完成BAM文件的读取。4) Merge and analyze the data blocks to complete the reading of the BAM file.

本实施例通过将数据解压和数据解析处理放在同一个读取线程,由多个线程同时处理分组后的数据块的解压和解析工作,提升了整个BAM文件读取的效率。同时本实施例通过将读取的压缩数据块按照预设规则分组,合理地调整了解压队列中数据的大小,降低了线程调度时间占比,提升了读取线程的并行效率。In this embodiment, the data decompression and data parsing are placed in the same reading thread, and multiple threads simultaneously process the decompression and parsing of the grouped data blocks, thereby improving the efficiency of reading the entire BAM file. At the same time, this embodiment reasonably adjusts the size of data in the decompression queue by grouping the read compressed data blocks according to preset rules, reduces the proportion of thread scheduling time, and improves the parallel efficiency of reading threads.

本实施例中步骤1)所述预设规则可以为根据所述压缩数据块的个数进行分组,如将压缩数据块按照每100个分为一组。The preset rule in step 1) in this embodiment may be grouping according to the number of the compressed data blocks, for example, grouping every 100 compressed data blocks into groups.

本实施例中步骤1)所述预设规则也可以为预设分组压缩块的大小,将所述压缩数据块填入分组压缩块进行分组,如预设分组压缩块大小为200MB。The preset rule in step 1) of this embodiment may also be the size of the preset packet compression block, and the compressed data block is filled into the packet compression block for grouping, for example, the preset packet compression block size is 200MB.

本实施例中步骤1)所述预设分组压缩块的大小根据处理设备的性能动态调整,如可以根据CPU性能调整分组压缩块的大小。The size of the preset packet compression block in step 1) of this embodiment is dynamically adjusted according to the performance of the processing device, for example, the size of the packet compression block can be adjusted according to the CPU performance.

此外,本实施例还提供一种BAM文件并行读取系统,包括计算机设备,该计算机设备被编程或配置以执行上述BAM文件并行读取方法的步骤,或该计算机设备的存储器上存储有被编程或配置以执行上述BAM文件并行读取方法的计算机程序。In addition, this embodiment also provides a BAM file parallel reading system, including a computer device, the computer device is programmed or configured to execute the steps of the above-mentioned BAM file parallel reading method, or the memory of the computer device stores the programmed Or a computer program configured to perform the above BAM file parallel reading method.

此外,本实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有被编程或配置以执行上述BAM文件并行读取方法的计算机程序。In addition, this embodiment also provides a computer-readable storage medium, on which a computer program programmed or configured to execute the above method for reading BAM files in parallel is stored.

以上所述仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above descriptions are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention should also be regarded as the protection scope of the present invention.

Claims (6)

1. A BAM file parallel reading method is characterized by comprising the following implementation steps:
1) Sequentially reading compressed data blocks from a BAM file to be analyzed, grouping the compressed data blocks according to a preset rule, and generating a queue to be decompressed consisting of grouped compressed blocks;
2) A plurality of reading threads take a packet compression block as a unit and acquire data from a queue to be decompressed in parallel;
3) The reading thread decompresses the obtained packet compression block and analyzes the corresponding BAM data until the data of the whole packet compression block is analyzed to generate a corresponding analysis data block;
4) And merging and analyzing the data blocks to finish reading the BAM file.
2. The method for reading BAM files in parallel as claimed in claim 1, wherein the predetermined rule of step 1) is to set the number of the compressed data blocks.
3. The method of claim 1, wherein the predetermined rule in step 1) is a predetermined size of a packet compression block, and the compressed data block is filled into the packet compression block for grouping.
4. The BAM file parallel reading method according to claim 3, wherein the size of the preset packet compression block is dynamically adjusted according to the performance of a processing device.
5. A BAM file parallel reading system comprising a computer device, characterized in that the computer device is programmed or configured to perform the steps of the BAM file parallel reading method of any one of claims 1 to 4, or that a memory of the computer device has stored thereon a computer program programmed or configured to perform the BAM file parallel reading method of any one of claims 1 to 4.
6. A computer-readable storage medium having stored thereon a computer program programmed or configured to perform the BAM file parallel reading method of any one of claims 1 to 4.
CN202211432664.6A 2022-11-16 2022-11-16 BAM file parallel reading method, system and medium Pending CN115827221A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211432664.6A CN115827221A (en) 2022-11-16 2022-11-16 BAM file parallel reading method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211432664.6A CN115827221A (en) 2022-11-16 2022-11-16 BAM file parallel reading method, system and medium

Publications (1)

Publication Number Publication Date
CN115827221A true CN115827221A (en) 2023-03-21

Family

ID=85528379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211432664.6A Pending CN115827221A (en) 2022-11-16 2022-11-16 BAM file parallel reading method, system and medium

Country Status (1)

Country Link
CN (1) CN115827221A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118100955A (en) * 2024-04-26 2024-05-28 深圳鲲云信息科技有限公司 Method for preprocessing compressed data by parallel decompression
CN119336718A (en) * 2024-11-08 2025-01-21 南京集成电路设计服务产业创新中心有限公司 A method and system for quickly parsing compressed file content

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050138091A1 (en) * 2003-12-22 2005-06-23 Jean-Pierre Bono Prefetching and multithreading for improved file read performance
CN102497597A (en) * 2011-12-05 2012-06-13 中国华录集团有限公司 Integrity verification method for high-definition video files
CN107169313A (en) * 2017-03-29 2017-09-15 中国科学院深圳先进技术研究院 The read method and computer-readable recording medium of DNA data files
CN111767258A (en) * 2020-06-30 2020-10-13 深圳前海微众银行股份有限公司 File compression method, device, device and storage medium applied to massive files
CN113238711A (en) * 2021-04-17 2021-08-10 西安电子科技大学 Efficient hash calculation method in electronic data evidence obtaining field
CN113626092A (en) * 2021-10-14 2021-11-09 广州匠芯创科技有限公司 Embedded system starting method and SOC chip
CN113673192A (en) * 2021-10-22 2021-11-19 南京集成电路设计服务产业创新中心有限公司 Parallel accelerated extraction method for SPEF parasitic parameters of ultra-large scale integrated circuit
CN114416666A (en) * 2022-03-28 2022-04-29 山东大学 BAM file analysis and restoration method and system under multi-core platform

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050138091A1 (en) * 2003-12-22 2005-06-23 Jean-Pierre Bono Prefetching and multithreading for improved file read performance
CN102497597A (en) * 2011-12-05 2012-06-13 中国华录集团有限公司 Integrity verification method for high-definition video files
CN107169313A (en) * 2017-03-29 2017-09-15 中国科学院深圳先进技术研究院 The read method and computer-readable recording medium of DNA data files
CN111767258A (en) * 2020-06-30 2020-10-13 深圳前海微众银行股份有限公司 File compression method, device, device and storage medium applied to massive files
CN113238711A (en) * 2021-04-17 2021-08-10 西安电子科技大学 Efficient hash calculation method in electronic data evidence obtaining field
CN113626092A (en) * 2021-10-14 2021-11-09 广州匠芯创科技有限公司 Embedded system starting method and SOC chip
CN113673192A (en) * 2021-10-22 2021-11-19 南京集成电路设计服务产业创新中心有限公司 Parallel accelerated extraction method for SPEF parasitic parameters of ultra-large scale integrated circuit
CN114416666A (en) * 2022-03-28 2022-04-29 山东大学 BAM file analysis and restoration method and system under multi-core platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GANG YANG ET AL.: "The design and implementation of BAM based on event-driven technology", 2012 IEEE GLOBAL HIGH TECH CONGRESS ON ELECTRONICS, 1 November 2010 (2010-11-01) *
刘宝锺著: "《大数据分类模型和算法研究》", vol. 2020, 31 January 2020, 云南大学出版社, pages: 92 - 100 *
祝君;林庆农;徐造林;: "实时历史数据库中压缩技术的并行化研究", 计算机技术与发展, no. 07, 10 July 2010 (2010-07-10) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118100955A (en) * 2024-04-26 2024-05-28 深圳鲲云信息科技有限公司 Method for preprocessing compressed data by parallel decompression
CN118100955B (en) * 2024-04-26 2024-07-23 深圳鲲云信息科技有限公司 Method for preprocessing compressed data by parallel decompression
CN119336718A (en) * 2024-11-08 2025-01-21 南京集成电路设计服务产业创新中心有限公司 A method and system for quickly parsing compressed file content

Similar Documents

Publication Publication Date Title
CN107609350B (en) Data processing method of second-generation sequencing data analysis platform
CN115827221A (en) BAM file parallel reading method, system and medium
CN112559465B (en) Log compression method, device, electronic device and storage medium
KR20130069427A (en) Method and apparatus for compressing and decompressing genetic information using next generation sequencing(ngs)
CN106852185A (en) Parallelly compressed encoder based on dictionary
CN103150260A (en) Method and device for deleting repeating data
Aronson et al. Towards an engineering approach to file carver construction
CN104077328B (en) The operation diagnostic method and equipment of MapReduce distributed system
US9886561B2 (en) Efficient encoding and storage and retrieval of genomic data
US20170193351A1 (en) Methods and systems for vector length management
US9137336B1 (en) Data compression techniques
CN109901978A (en) A kind of Hadoop log lossless compression method and system
CN112070652A (en) Data compression method, data decompression method, readable storage medium and electronic device
CN115630343B (en) Electronic document information processing method, device and equipment
CN113360911A (en) Malicious code homologous analysis method and device, computer equipment and storage medium
CN114416666B (en) BAM file analysis and restoration method and system under multi-core platform
CN109558735A (en) A kind of rogue program sample clustering method and relevant apparatus based on machine learning
CN105264519B (en) A column database processing method and device
KR102425596B1 (en) Systems and methods for low latency hardware memory management
CN116226047A (en) A method and device for quickly reading MIR information of gzip-compressed stdf files
JP5549177B2 (en) Compression program, method and apparatus, and decompression program, method and apparatus
CN110797082A (en) Method and system for storing and reading gene sequencing data
CN114237911A (en) CUDA-based gene data processing method and device and CUDA framework
CN118964360A (en) Data storage method, device, equipment and medium of bitmap data structure
CN111370070B (en) Compression processing method for big data gene sequencing file

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination