“Methods To Compress, Encrypt And Retrieve Genomic Alignment Data” in Patent Application Approval Process (USPTO 20220344005): Sophia Genetics S.A.
2022 NOV 15 (NewsRx) -- By a
This patent application is assigned to
The following quote was obtained by the news editors from the background information supplied by the inventors: “
“Next-Generation Sequencing Data Processing
“Next-generation sequencing (NGS) or massively parallel sequencing (MPS) technologies have significantly decreased the cost of DNA sequencing in the past decade. NGS has broad application in biology and dramatically changed the way of research or diagnosis methodologies. Advances in high-throughput sequencing technologies are spurring the production of a huge amount of genomic data. For example, the 1000
“Moreover, next generation sequencing data are more and more used as a tool in medical practice such as routine diagnosis, where security and privacy come as a major concern. The main threats to genomic data are (i) the disclosure of an individual’s genetic characteristics due to the leakage of his/her genomic data and (ii) the identification of an individual from his/her own genome sequence. For example, as part of a clinical trial, the genetic information of a patient, once leaked, could be linked to the disease under study (or to other diseases), which can have serious consequences such as denial of access to life insurance or to employment for the individual participant. There is therefore a need for more secure genomic data management methods that address the privacy threat models that are specific to the genomic data processing systems and workflows.
“Next Generation Sequencing Data Formats and Workflows
“Next generation sequencers typically output a series of short reads, a few hundred nucleotides sequences with the associated quality score estimates in data files such as the FASTQ files. This raw sequencing data is further analyzed in the bioinformatics pipeline by aligning the raw short reads to a reference genome, and identifying the specific variants as the differences relative to the reference genome.
“In general, geneticists prefer storing aligned, raw genomic data of the patients, in addition to their variant calls (which include each nucleotide on the DNA sequence once, hence is much more compact). Sequence alignment/map files such as the human readable SAM files and their more compact, machine-readable binary version BAM files are the de facto standards used for DNA alignment data produced by next-generation DNA sequencers (http://samtools.github.io/hts-specs/SAMv1.pdf). There are hundreds of millions of short sequencing reads (each including between 100 and 400 nucleotides) in the SAM file of a patient. Each nucleotide is present in several short reads in order to have statistically high coverage of each patient’s DNA.
“Genomic Data Compression
“There are different approaches to dealing with the compression of genomic data. Before high-throughput technologies were introduced, there existed algorithms designed for compressing genomic sequences of relatively small size (e.g., tens of megabases), for instance BioCompress (in Grumbach, S. & Tahi, F. Compression of DNA sequences. in
“More recently, various advanced compression algorithms have been proposed to further improve the compression of high-throughput DNA sequence data, such as Quip (Jones, D. C., Ruzzo, W. L., Peng, X. & Katze,
“Genomic Data Security
“Some genomic data encryption solutions have been proposed on top of some compression algorithms, such as for instance the encryption option in cramtools for the CRAM genomic data compression format (http://www.cbi.ac.uk/cna/software/cram-toolkit), but they remain straightforward applications of encryption standards and do not take into consideration the specific genomic data storage and genomic data processing threat models even if the solution uses highly secure encryption primitives (e.g., the AES encryption method). In particular, the data retrieval process may cause incidental leakage of sensitive genomic information. Once leaked, genomic information could be abused in various ways, such as for denial of employment and health insurance, blackmail or even genetic discrimination. Establishing a secure and privacy-preserving solution for genomic data storage is therefore needed in order to facilitate the trusted usage, storage and transmission of genomic data.”
There is additional background information. Please visit full patent to read further.”
In addition to the background information obtained for this patent application, NewsRx journalists also obtained the inventors’ summary information for this patent application: “Some embodiments of the present disclosure are directed to methods to encode genomic data alignment information organized as a read-based alignment information data stream, comprising the steps of: transposing, with a processor, the read-based alignment information data stream into a position-based alignment information data stream; encoding, with a processor, the position-based alignment information data stream into a reference-based compressed position data stream; and encrypting, with a processor, the reference-based compressed position data stream into a compressed encrypted alignment data stream.
“In some embodiments, encoding the position-based alignment information data stream into a reference-based compressed position data stream may comprise a step of differential encoding. In a possible embodiment, differential encoding may comprise recording, for each position in the reference-based compressed position data stream, the alignment differences relative to the alignment reference sequence. In a possible embodiment, encoding the position-based alignment information data stream into a reference-based compressed position data file may comprise a step of entropy coding.
“In some embodiments, encrypting the reference-based compression position data stream into a compressed encrypted alignment data stream may comprise a step of encrypting the position information with an order-preserving encryption scheme. In a possible embodiment, encrypting the reference-based compression position data stream into a compressed encrypted alignment data stream may comprise a step of encrypting the position-based alignment information with a symmetric encryption scheme. The symmetric encryption scheme may be a stream cipher, such as the AES scheme in CTR mode.
“Some embodiments of the present disclosure are directed to methods to retrieve genomic data alignment information from a compressed encrypted alignment data stream, recorded on a storage unit, comprising the steps of: receiving a genomic alignment range query [Pos1, Pos2] from a genomic data analysis system; retrieving from the storage unit, with a processor, the subset of the compressed encrypted alignment data stream corresponding to the genomic alignment range [Pos1, Pos2] in the compressed encrypted alignment data stream; decrypting, with a processor, the compressed encrypted alignment data stream into a reference-based compressed position data stream corresponding to the genomic alignment range [Pos1, Pos2]; and decoding, with a processor, the reference-based compressed position data stream into a position-based alignment information data stream corresponding to the genomic alignment range [Pos1, Pos2].
“In a possible embodiment, retrieving genomic data alignment information from a compressed encrypted alignment data stream, recorded on a storage unit, may further comprise a step of reverse transposing, with a processor, the position-based alignment information data stream into a read-based alignment information data.
“In a possible embodiment, retrieving the subset of the compressed encrypted alignment data stream for the genomic alignment range [Pos1, Pos2] comprises retrieving the symmetric encrypted data and the metadata stored in data blocks between the order-preserving encrypted position associated with Post and the order-preserving encrypted position associated with Pos2.
“In a possible embodiment, decrypting the compressed encrypted alignment data stream into a reference-based compressed position data stream corresponding to the genomic alignment range [Pos1, Pos2] comprises symmetric decryption of the symmetric encrypted data between the order-preserving encrypted position associated with Post and the order-preserving encrypted position associated with Pos2. In a possible embodiment, the symmetric decryption scheme may be a stream decipher, such as the AES scheme in CTR mode.
“In a possible embodiment, decoding the position-based alignment information data stream into reference-based compressed position data stream may comprise a step of entropy decoding. In a possible embodiment, decoding the position-based alignment information data stream into reference-based compressed position data stream may comprise a step of differential decoding.”
The claims supplied by the inventors are:
“1. A method to encode genomic data alignment information organized as a read-based alignment information data stream, comprising the steps of: Transposing, with a processor, the read-based alignment information data stream into a position-based alignment information data stream; Encoding, with a processor, the position-based alignment information data stream into a reference-based compressed position data stream; Encrypting, with a processor, the reference-based compressed position data stream into a compressed encrypted alignment data stream.
“2. The method of claim 1, wherein encoding the position-based alignment information data stream into a reference-based compressed position data stream comprises a step of differential encoding.
“3. The method of claim 2, wherein differential encoding comprises recording, for each position in the reference-based compressed position data stream, the alignment differences relative to the alignment reference sequence.
“4. The method claim 1, wherein encoding the position-based alignment information data stream into a reference-based compressed position data file further comprises a step of entropy coding.
“5. The method of claim 1, wherein encrypting the reference-based compression position data stream into a compressed encrypted alignment data stream comprises a step of encrypting the position information with an order-preserving encryption scheme.
“6. The method of claim 1, wherein encrypting the reference-based compression position data stream into a compressed encrypted alignment data stream comprises a step of encrypting the position-based alignment information with a symmetric encryption scheme.
“7. The method of claim 6, wherein the symmetric encryption scheme is a stream cipher.
“8. The method of claim 7, wherein the symmetric encryption scheme is a block cipher operating in a stream cipher mode.
“9. A method to retrieve genomic data alignment information from a compressed encrypted alignment data stream, recorded on a storage unit, comprising the steps of: Receiving a genomic alignment range query [Pos1, Pos2] from a genomic data analysis system; Retrieving from the storage unit, with a processor, the subset of the compressed encrypted alignment data stream corresponding to the genomic alignment range [Pos1, Pos2] in the compressed encrypted alignment data stream; Decrypting, with a processor, the compressed encrypted alignment data stream into a reference-based compressed position data stream corresponding to the genomic alignment range [Pos1, Pos2]; Decoding, with a processor, the reference-based compressed position data stream into a position-based alignment information data stream corresponding to the genomic alignment range [Pos1, Pos2].
“10. The method of claim 9, further comprising the step of reverse transposing, with a processor, the position-based alignment information data stream into a read-based alignment information data.
“11. The method of claim 9, wherein retrieving the subset of the compressed encrypted alignment data stream for the genomic alignment range [Pos1, Pos2] comprises retrieving the symmetric encrypted data and the metadata stored in data blocks between the order-preserving encrypted position associated with Pos1 and the order-preserving encrypted position associated with Pos2.
“12. The method of claim 11, wherein decrypting the compressed encrypted alignment data stream into a reference-based compressed position data stream corresponding to the genomic alignment range [Pos1, Pos2] comprises symmetric decryption of the symmetric encrypted data between the order-preserving encrypted position associated with Pos1 and the order-preserving encrypted position associated with Pos2.
“13. The method of claim 12, wherein the symmetric decryption scheme is a stream decipher.
“14. The method of claim 12, wherein the symmetric decryption scheme is a block decipher operating in a stream decipher mode.
“15. The method of claim 9, wherein decoding the position-based alignment information data stream into reference-based compressed position data stream comprises a step of entropy decoding.
“16. The method of claim 9, wherein decoding the position-based alignment information data stream into reference-based compressed position data stream comprises a step of differential decoding.”
URL and more information on this patent application, see: AYDAY, Erman; GARCIA, Jesus; HUANG, Zhicong; HUBAUX, Jean-Pierre; LIN, Huang; MOLYNEAUX, Adam. Methods To Compress, Encrypt And Retrieve Genomic Alignment Data. Filed
(Our reports deliver fact-based news of research and discoveries from around the world.)
Patent Issued for Home event detection and processing (USPTO 11481847): Allstate Insurance Company
New Findings from Financial University under the Government of the Russian Federation in the Area of Risk Management Described (Barriers And Prospects For The Use of New Genetic Technologies For Food Production: Regulatory Options in The …): Insurance – Risk Management
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News