Patent Issued for Methods to compress, encrypt and retrieve genomic alignment data (USPTO 11393559): Sophia Genetics S.A.
2022 AUG 04 (NewsRx) -- By a
Patent number 11393559 is assigned to
The following quote was obtained by the news editors from the background information supplied by the inventors: “Next-Generation Sequencing Data Processing
“Next-generation sequencing (NGS) or massively parallel sequencing (MPS) technologies have significantly decreased the cost of DNA sequencing in the past decade. NGS has broad application in biology and dramatically changed the way of research or diagnosis methodologies. Advances in high-throughput sequencing technologies are spurring the production of a huge amount of genomic data. For example, the 1000
“Moreover, next generation sequencing data are more and more used as a tool in medical practice such as routine diagnosis, where security and privacy come as a major concern. The main threats to genomic data are (i) the disclosure of an individual’s genetic characteristics due to the leakage of his/her genomic data and (ii) the identification of an individual from his/her own genome sequence. For example, as part of a clinical trial, the genetic information of a patient, once leaked, could be linked to the disease under study (or to other diseases), which can have serious consequences such as denial of access to life insurance or to employment for the individual participant. There is therefore a need for more secure genomic data management methods that address the privacy threat models that are specific to the genomic data processing systems and workflows.
“Next Generation Sequencing Data Formats and Workflows
“Next generation sequencers typically output a series of short reads, a few hundred nucleotides sequences with the associated quality score estimates in data files such as the FASTQ files. This raw sequencing data is further analyzed in the bioinformatics pipeline by aligning the raw short reads to a reference genome, and identifying the specific variants as the differences relative to the reference genome.
“In general, geneticists prefer storing aligned, raw genomic data of the patients, in addition to their variant calls (which include each nucleotide on the DNA sequence once, hence is much more compact). Sequence alignment/map files such as the human readable SAM files and their more compact, machine-readable binary version BAM files are the de facto standards used for DNA alignment data produced by next-generation DNA sequencers (http://samtools.github.io/hts-specs/SAMv1.pdf). There are hundreds of millions of short sequencing reads (each including between 100 and 400 nucleotides) in the SAM file of a patient. Each nucleotide is present in several short reads in order to have statistically high coverage of each patient’s DNA.
“Genomic Data Compression
“There are different approaches to dealing with the compression of genomic data. Before high-throughput technologies were introduced, there existed algorithms designed for compressing genomic sequences of relatively small size (e.g., tens of megabases), for instance BioCompress (in Grumbach, S. & Tahi, F. Compression of DNA sequences, in
“More recently, various advanced compression algorithms have been proposed to further improve the compression of high-throughput DNA sequence data, such as Quip (Jones, D. C., Ruzzo, W. L., Peng, X. & Katze,
“Genomic Data Security
“Some genomic data encryption solutions have been proposed on top of some compression algorithms, such as for instance the encryption option in cramtools for the CRAM genomic data compression format (http://www.ebi.ac.uk/ena/software/cram-toolkit), but they remain straightforward applications of encryption standards and do not take into consideration the specific genomic data storage and genomic data processing threat models even if the solution uses highly secure encryption primitives (e.g., the AES encryption method). In particular, the data retrieval process may cause incidental leakage of sensitive genomic information. Once leaked, genomic information could be abused in various ways, such as for denial of employment and health insurance, blackmail or even genetic discrimination. Establishing a secure and privacy-preserving solution for genomic data storage is therefore needed in order to facilitate the trusted usage, storage and transmission of genomic data.”
There is additional summary information. Please visit full patent to read further.
In addition to the background information obtained for this patent, NewsRx journalists also obtained the inventors’ summary information for this patent: “Some embodiments of the present disclosure are directed to methods to encode genomic data alignment information organized as a read-based alignment information data stream, comprising the steps of: transposing, with a processor, the read-based alignment information data stream into a position-based alignment information data stream; encoding, with a processor, the position-based alignment information data stream into a reference-based compressed position data stream; and encrypting, with a processor, the reference-based compressed position data stream into a compressed encrypted alignment data stream.
“In some embodiments, encoding the position-based alignment information data stream into a reference-based compressed position data stream may comprise a step of differential encoding. In a possible embodiment, differential encoding may comprise recording, for each position in the reference-based compressed position data stream, the alignment differences relative to the alignment reference sequence. In a possible embodiment, encoding the position-based alignment information data stream into a reference-based compressed position data file may comprise a step of entropy coding.
“In some embodiments, encrypting the reference-based compression position data stream into a compressed encrypted alignment data stream may comprise a step of encrypting the position information with an order-preserving encryption scheme. In a possible embodiment, encrypting the reference-based compression position data stream into a compressed encrypted alignment data stream may comprise a step of encrypting the position-based alignment information with a symmetric encryption scheme. The symmetric encryption scheme may be a stream cipher, such as the AES scheme in CTR mode.
“Some embodiments of the present disclosure are directed to methods to retrieve genomic data alignment information from a compressed encrypted alignment data stream, recorded on a storage unit, comprising the steps of: receiving a genomic alignment range query [Pos1, Pos2] from a genomic data analysis system; retrieving from the storage unit, with a processor, the subset of the compressed encrypted alignment data stream corresponding to the genomic alignment range [Pos1, Pos2] in the compressed encrypted alignment data stream; decrypting, with a processor, the compressed encrypted alignment data stream into a reference-based compressed position data stream corresponding to the genomic alignment range [Pos1, Pos2]; and decoding, with a processor, the reference-based compressed position data stream into a position-based alignment information data stream corresponding to the genomic alignment range [Pos1, Pos2].
“In a possible embodiment, retrieving genomic data alignment information from a compressed encrypted alignment data stream, recorded on a storage unit, may further comprise a step of reverse transposing, with a processor, the position-based alignment information data stream into a read-based alignment information data.”
The claims supplied by the inventors are:
“1. A method to encode genomic data alignment information organized as a read-based alignment information data stream, comprising: Transposing, with a processor, the read-based alignment information data stream into a position-based alignment information data stream, wherein a character is a start marker for each short read in the position-based alignment information data stream, the start marker followed by metadata information regarding at least a nucleotide base identified at a position with an associated quality score; Encoding, with a processor, the position-based alignment information data stream into a reference-based compressed position data stream; and Encrypting, with a processor, the reference-based compressed position data stream into a compressed encrypted alignment data stream, including independently encrypting variant information for each row of a data structure in a storage that stores the compressed encrypted alignment data stream, providing privacy control of specific compressed encrypted alignment data within the stored compressed encrypted alignment data stream, the encrypting the reference-based compressed position data stream into a compressed encrypted alignment data stream comprising, first, an order-preserving encryption scheme, and second, encrypting sensitive information at each position, wherein the method results in increased storage efficiency or faster genomic data queries.
“2. The method of claim 1, wherein encoding the position-based alignment information data stream into a reference-based compressed position data stream comprises differential encoding.
“3. The method of claim 2, wherein differential encoding comprises recording, for each position in the reference-based compressed position data stream, the alignment differences relative to the alignment reference sequence, and wherein only the differences for each position with respect to the reference-based compressed position data stream are recorded.
“4. The method of claim 1, wherein encoding the position-based alignment information data stream into a reference-based compressed position data file further comprises entropy coding.
“5. The method of claim 1, wherein the order preserving encryption scheme is configured to retrieve resulting encrypted data for each row of the data structure without decrypting a whole block data.
“6. The method of claim 1, wherein encrypting the reference-based compression position data stream into a compressed encrypted alignment data stream comprises encrypting the position-based alignment information with a symmetric encryption scheme.
“7. The method of claim 6, wherein the symmetric encryption scheme is a stream cipher.
“8. The method of claim 7, wherein the symmetric encryption scheme is a block cipher operating in a stream cipher mode.
“9. A method to retrieve genomic data alignment information from a compressed encrypted alignment data stream, recorded on a storage, comprising: Receiving a genomic alignment range query [Pos1, Pos2] from a genomic data analysis system; Retrieving from the storage, with a processor, the subset of the compressed encrypted alignment data stream corresponding to the genomic alignment range [Pos1, Pos2] in the compressed encrypted alignment data stream; Decrypting, with a processor, the compressed encrypted alignment data stream into a reference-based compressed position data stream corresponding to the genomic alignment range [Pos1, Pos2], including independently decrypting variant information for each row of a data structure in the storage that stores the compressed encrypted alignment data stream, providing privacy control of specific compressed encrypted alignment data within the stored compressed encrypted alignment data stream; and Decoding, with a processor, the reference-based compressed position data stream into a position-based alignment information data stream corresponding to the genomic alignment range [Pos1, Pos2], wherein the method results in increased storage efficiency or faster genomic data queries, and wherein decoding the reference-based compressed position data stream comprises retrieving a metadata information block and decoding the metadata information block in accordance with an encoding embodiment.
“10. The method of claim 9, further comprising: reverse transposing, with a processor, the position-based alignment information data stream into a read-based alignment information data, wherein a character is a start marker for each short read in the portion-based alignment information data stream, the start marker followed by metadata information regarding at least a nucleotide base identified at a position with an associated quality score.
“11. The method of claim 9, wherein retrieving the subset of the compressed encrypted alignment data stream for the genomic alignment range [Pos1, Pos2] comprises retrieving the symmetric encrypted data and the metadata stored in data blocks between the order-preserving encrypted position associated with Pos1 and the order-preserving encrypted position associated with Pos2.
“12. The method of claim 11, wherein decrypting the compressed encrypted alignment data stream into a reference-based compressed position data stream corresponding to the genomic alignment range [Pos1, Pos2] comprises symmetric decryption of the symmetric encrypted data between the order-preserving encrypted position associated with Pos1 and the order-preserving encrypted position associated with Pos2.
“13. The method of claim 12, wherein the symmetric decryption scheme is a stream decipher.
“14. The method of claim 12, wherein the symmetric decryption scheme is a block decipher operating in a stream decipher mode.
“15. The method of claim 9, wherein decoding the position-based alignment information data stream into reference-based compressed position data stream comprises entropy decoding.
“16. The method of claim 9, wherein decoding the position-based alignment information data stream into reference-based compressed position data stream comprises differential decoding.
“17. The method of claim 1, wherein encoding the position-based alignment information data stream into a reference-based compressed position data file further comprises text coding algorithms, and wherein the reference-based compressed position data file is a compact binary reference-based compressed position data file.
“18. The method of claim 1, wherein encoding the position-based alignment information data stream into a reference-based compressed position data file further comprises variable length coding, wherein the variable length coding is configured to compress differences found in reference-based compression, and wherein the variable length coding is configured to compress differences found in mapping quality scores.
“19. The method of claim 15, wherein the entropy decoding is VLC decoding.
“20. The method of claim 9, wherein the encoding embodiment is a gunzip reverse algorithm, and wherein decoding comprises concatenating the reference-based compressed position data stream, the position-based alignment information data stream, and the metadata information block to reconstruct the genomic data alignment information.”
URL and more information on this patent, see: Ayday, Erman. Methods to compress, encrypt and retrieve genomic alignment data.
(Our reports deliver fact-based news of research and discoveries from around the world.)
Patent Issued for Unmanned vehicle security guard (USPTO 11392145): United Services Automobile Association
Washington Utilities & Transportation Commission Issues Penalty Assessment Involving JFS Transport
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News