“Methods To Compress, Encrypt And Retrieve Genomic Alignment Data” in Patent Application Approval Process (USPTO 20190087601)
2019 APR 10 (NewsRx) -- By a
This patent application is assigned to
The following quote was obtained by the news editors from the background information supplied by the inventors: “Next-Generation Sequencing Data Processing
“Next-generation sequencing (NGS) or massively parallel sequencing (MPS) technologies have significantly decreased the cost of DNA sequencing in the past decade. NGS has broad application in biology and dramatically changed the way of research or diagnosis methodologies. Advances in high-throughput sequencing technologies are spurring the production of a huge amount of genomic data. For example, the 1000
“Moreover, next generation sequencing data are more and more used as a tool in medical practice such as routine diagnosis, where security and privacy come as a major concern. The main threats to genomic data are (i) the disclosure of an individual’s genetic characteristics due to the leakage of his/her genomic data and (ii) the identification of an individual from his/her own genome sequence. For example, as part of a clinical trial, the genetic information of a patient, once leaked, could be linked to the disease under study (or to other diseases), which can have serious consequences such as denial of access to life insurance or to employment for the individual participant. There is therefore a need for more secure genomic data management methods that address the privacy threat models that are specific to the genomic data processing systems and workflows.
“Next Generation Sequencing Data Formats and Workflows
“Next generation sequencers typically output a series of short reads, a few hundred nucleotides sequences with the associated quality score estimates in data files such as the FASTQ files. This raw sequencing data is further analyzed in the bioinformatics pipeline by aligning the raw short reads to a reference genome, and identifying the specific variants as the differences relative to the reference genome.
“In general, geneticists prefer storing aligned, raw genomic data of the patients, in addition to their variant calls (which include each nucleotide on the DNA sequence once, hence is much more compact). Sequence alignment/map files such as the human readable SAM files and their more compact, machine-readable binary version BAM files are the de facto standards used for DNA alignment data produced by next-generation DNA sequencers (http://samtools.github.io/hts-specs/SAMv1.pdf). There are hundreds of millions of short sequencing reads (each including between 100 and 400 nucleotides) in the SAM file of a patient. Each nucleotide is present in several short reads in order to have statistically high coverage of each patient’s DNA.
“Genomic Data Compression
“There are different approaches to dealing with the compression of genomic data. Before high-throughput technologies were introduced, there existed algorithms designed for compressing genomic sequences of relatively small size (e.g., tens of megabases), for instance BioCompress (in Grumbach, S. & Tahi, F. Compression of DNA sequences, in
“More recently, various advanced compression algorithms have been proposed to further improve the compression of high-throughput DNA sequence data, such as Quip (Jones, D. C., Ruzzo, W. L., Peng, X. & Katze,
“Genomic Data Security
“Some genomic data encryption solutions have been proposed on top of some compression algorithms, such as for instance the encryption option in cramtools for the CRAM genomic data compression format (http://www.ebi.ac.uk/ena/software/cram-toolkit), but they remain straightforward applications of encryption standards and do not take into consideration the specific genomic data storage and genomic data processing threat models even if the solution uses highly secure encryption primitives (e.g., the AES encryption method). In particular, the data retrieval process may cause incidental leakage of sensitive genomic information. Once leaked, genomic information could be abused in various ways, such as for denial of employment and health insurance, blackmail or even genetic discrimination. Establishing a secure and privacy-preserving solution for genomic data storage is therefore needed in order to facilitate the trusted usage, storage and transmission of genomic data.
“Recent research works have thus highlighted a number of specific threats to be addressed by genomic data security and privacy-preserving technologies. For instance, public aggregated statistics in genome-wide association studies (GWAS) may lead to a potential privacy breach for participants of the study, because attackers can determine, through powerful statistical tests, whether a participant is in a case group (Homer, N. et al. Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays. PLoS Genet 4, e1000167-2008). Data de-identification (removal of the personal identifiers) has also been proven insufficient for protecting genetic privacy (Erlich, Y. & Narayanan, A. Routes for breaching and protecting genetic privacy. Nat Rev Genet 15, 409-421-2014). Coarse-grained encryption and access control to genomic data may also lead to incidental genomic findings that doctors would prefer to avoid (Ayday, E., Raisaro, J. L., Hengartner, U., Molyneaux, A. & Hubaux, J.-P. Privacy-Preserving Processing of Raw Genomic Data. in Data Privacy Management and Autonomous Spontaneous Security, 133-147-Springer Berlin Heidelberg, 2014).
“Storing sequenced data on a cloud seems to be an attractive option, considering the size and the required availability of the data, so that it can be more easily shared by different parties. Accessing the remote data stored with standard compression schemes require to decrypt it first so the data owner has to trust insiders on the cloud (e.g., the cloud administrator, or high-privileged system software) to access the genomic information in the clear and multi-party key management systems have to be carefully designed accordingly. Ayday et al. (Ayday, E., Raisaro, J. L., Hengartner, U., Molyneaux, A. & Hubaux, J.-P. Privacy-Preserving Processing of Raw Genomic Data. in Data Privacy Management and Autonomous Spontaneous Security, 133-147--Springer Berlin Heidelberg, 2014 and WO2014/202615) proposed the use of order-preserving encryption to enable genetic data retrieval without requiring full decryption of the genomic data. While it addresses the security issues associated with genomic information privacy threats, the latter scheme requires further data overhead, which induces extra storage and processing costs requirements and make it impractical for certain clinical genomic applications.
“Therefore, it will be of great benefit to the future development of sequenced data analysis if a compression solution can also integrate encryption methods that are secure and suitable for preserving the privacy of genomic information in the decompression and decryption process.
“Genomic Data Access
“For clinical or research purpose, the most valuable information from human genomic data is the set of genetic variants that are identified among the three billion genome positions. However, current state-of-the-art sequencers produce millions of short reads, scattered over the whole genome and covering each position multiple times. Typically, sequenced data are taken as input for a pipeline and retrieved for downstream analyses, e.g., variant calling. Taking this usage scenario into account, it is also crucial to have a storage format that is efficient for downstream analyses. For example, it is a common practice to aggregate information on a position from all short reads that cover the position (pileup), hence it is desirable that the storage format organizes information in this manner.
“An increasing number of medical units (pharmaceutical companies or physicians) are willing to outsource the storage of genomes generated in clinical trials. As the medical unit would not own the genome, this is a good argument to convince clinical-trial participants to be sequenced and use their genomes to stratify clinical trials. Acting as a third party, a biobank storage unit could store patients’ genomic data that would be used by the medical units for clinical trials. In the meantime, the patient can also benefit from the stored genomic information by interrogating his own genomic data, together with his family doctor, for specific genetic predispositions, susceptibilities and metabolical capacities. The major challenge here is to manage the data access rights to preserve the privacy of patients’ genomic data while allowing the medical units to operate on specific parts of the genome (for which they are authorized). In WO2014/202615, Ayday et al. proposed a privacy-preserving genomic data processing system based on order-preserving encryption that is suitable to encrypt, store and facilitate the private partial retrieval of aligned genomic data files such as SAM files in a biobank. However this system does not address the storage compression efficiency and there is therefore a need to further improve it.
“In addition to combining the compression and encryption of next-generation-sequencing (NGS) data for efficient and privacy-preserving storage, this would also require us to consider who can access the data, which part they can access and how the data can be partially retrieved. Without these precautions, there could be incidental leakage during data retrieval, even if it is stored in an encrypted form. Furthermore, the storage and access efficiency of sequenced data needs to be further optimized without compromising the security requirements. Better methods and systems to process genomic data information are needed that consistently address all these problems (security and privacy, storage, partial retrieval) to minimize the storage cost without compromising the privacy of genomic data information while optimizing the performance of downstream analysis (e.g., variant calling).”
In addition to the background information obtained for this patent application, NewsRx journalists also obtained the inventors’ summary information for this patent application: “Some embodiments of the present disclosure are directed to methods to encode genomic data alignment information organized as a read-based alignment information data stream, comprising the steps of: transposing, with a processor, the read-based alignment information data stream into a position-based alignment information data stream; encoding, with a processor, the position-based alignment information data stream into a reference-based compressed position data stream; and encrypting, with a processor, the reference-based compressed position data stream into a compressed encrypted alignment data stream.
“In some embodiments, encoding the position-based alignment information data stream into a reference-based compressed position data stream may comprise a step of differential encoding. In a possible embodiment, differential encoding may comprise recording, for each position in the reference-based compressed position data stream, the alignment differences relative to the alignment reference sequence. In a possible embodiment, encoding the position-based alignment information data stream into a reference-based compressed position data file may comprise a step of entropy coding.
“In some embodiments, encrypting the reference-based compression position data stream into a compressed encrypted alignment data stream may comprise a step of encrypting the position information with an order-preserving encryption scheme. In a possible embodiment, encrypting the reference-based compression position data stream into a compressed encrypted alignment data stream may comprise a step of encrypting the position-based alignment information with a symmetric encryption scheme. The symmetric encryption scheme may be a stream cipher, such as the AES scheme in CTR mode.
“Some embodiments of the present disclosure are directed to methods to retrieve genomic data alignment information from a compressed encrypted alignment data stream, recorded on a storage unit, comprising the steps of: receiving a genomic alignment range query [Pos1, Pos2] from a genomic data analysis system; retrieving from the storage unit, with a processor, the subset of the compressed encrypted alignment data stream corresponding to the genomic alignment range [Pos1, Pos2] in the compressed encrypted alignment data stream; decrypting, with a processor, the compressed encrypted alignment data stream into a reference-based compressed position data stream corresponding to the genomic alignment range [Pos1, Pos2]; and decoding, with a processor, the reference-based compressed position data stream into a position-based alignment information data stream corresponding to the genomic alignment range [Pos1, Pos2].
“In a possible embodiment, retrieving genomic data alignment information from a compressed encrypted alignment data stream, recorded on a storage unit, may further comprise a step of reverse transposing, with a processor, the position-based alignment information data stream into a read-based alignment information data.
“In a possible embodiment, retrieving the subset of the compressed encrypted alignment data stream for the genomic alignment range [Pos1, Pos2] comprises retrieving the symmetric encrypted data and the metadata stored in data blocks between the order-preserving encrypted position associated with Pos1 and the order-preserving encrypted position associated with Pos2.
“In a possible embodiment, decrypting the compressed encrypted alignment data stream into a reference-based compressed position data stream corresponding to the genomic alignment range [Pos1, Pos2] comprises symmetric decryption of the symmetric encrypted data between the order-preserving encrypted position associated with Pos1 and the order-preserving encrypted position associated with Pos2. In a possible embodiment, the symmetric decryption scheme may be a stream decipher, such as the AES scheme in CTR mode.
“In a possible embodiment, decoding the position-based alignment information data stream into reference-based compressed position data stream may comprise a step of entropy decoding. In a possible embodiment, decoding the position-based alignment information data stream into reference-based compressed position data stream may comprise a step of differential decoding.”
The claims supplied by the inventors are:
“1. A method to encode genomic data alignment information organized as a read-based alignment information data stream, comprising: Transposing, with a processor, the read-based alignment information data stream into a position-based alignment information data stream; Encoding, with a processor, the position-based alignment information data stream into a reference-based compressed position data stream; Encrypting, with a processor, the reference-based compressed position data stream into a compressed encrypted alignment data stream.
“2. The method of claim 1, wherein encoding the position-based alignment information data stream into a reference-based compressed position data stream comprises differential encoding.
“3. The method of claim 2, wherein differential encoding comprises recording, for each position in the reference-based compressed position data stream, the alignment differences relative to the alignment reference sequence.
“4. The method of claim 1, wherein encoding the position-based alignment information data stream into a reference-based compressed position data file further comprises entropy coding.
“5. The method of claim 1, wherein encrypting the reference-based compression position data stream into a compressed encrypted alignment data stream comprises encrypting the position information with an order-preserving encryption scheme.
“6. The method of claim 1, wherein encrypting the reference-based compression position data stream into a compressed encrypted alignment data stream comprises encrypting the position-based alignment information with a symmetric encryption scheme.
“7. The method of claim 6, wherein the symmetric encryption scheme is a stream cipher.
“8. The method of claim 7, wherein the symmetric encryption scheme is a block cipher operating in a stream cipher mode.
“9. A method to retrieve genomic data alignment information from a compressed encrypted alignment data stream, recorded on a storage unit, comprising: Receiving a genomic alignment range query [Pos1, Pos2] from a genomic data analysis system; Retrieving from the storage unit, with a processor, the subset of the compressed encrypted alignment data stream corresponding to the genomic alignment range [Pos1, Pos2] in the compressed encrypted alignment data stream; Decrypting, with a processor, the compressed encrypted alignment data stream into a reference-based compressed position data stream corresponding to the genomic alignment range [Pos1, Pos2]; Decoding, with a processor, the reference-based compressed position data stream into a position-based alignment information data stream corresponding to the genomic alignment range [Pos1, Pos2].
“10. The method of claim 9, further comprising reverse transposing, with a processor, the position-based alignment information data stream into a read-based alignment information data
“11. The method of claim 9, wherein retrieving the subset of the compressed encrypted alignment data stream for the genomic alignment range [Pos1, Pos2] comprises retrieving the symmetric encrypted data and the metadata stored in data blocks between the order-preserving encrypted position associated with Pos1 and the order-preserving encrypted position associated with Pos2.
“12. The method of claim 11, wherein decrypting the compressed encrypted alignment data stream into a reference-based compressed position data stream corresponding to the genomic alignment range [Pos1, Pos2] comprises symmetric decryption of the symmetric encrypted data between the order-preserving encrypted position associated with Pos1 and the order-preserving encrypted position associated with Pos2;
“13. The method of claim 12, wherein the symmetric decryption scheme is a stream decipher.
“14. The method of claim 12, wherein the symmetric decryption scheme is a block decipher operating in a stream decipher mode.
“15. The method of claim 9, wherein decoding the position-based alignment information data stream into reference-based compressed position data stream comprises entropy decoding.
“16. The method of claim 9, wherein decoding the position-based alignment information data stream into reference-based compressed position data stream comprises differential decoding.”
URL and more information on this patent application, see: MOLYNEAUX, Adam; AYDAY, Erman; HUBAUX, Jean-Pierre; GARCIA, Jesus; HUANG, Zhicong; LIN, Huang. Methods To Compress, Encrypt And Retrieve Genomic Alignment Data. Filed
(Our reports deliver fact-based news of research and discoveries from around the world.)



Automotive Repair Software Market 2019 with Revenue Impact 2024: Alldata, Autodesk, AutoTraker, Nexsyis Collision, eGenuity, CCC ONE Total Repair Platform, Etc
Advisor News
- DOL proposes new independent contractor rule; industry is ‘encouraged’
- Trump proposes retirement savings plan for Americans without one
- Millennials seek trusted financial advice as they build and inherit wealth
- NAIFA: Financial professionals are essential to the success of Trump Accounts
- Changes, personalization impacting retirement plans for 2026
More Advisor NewsAnnuity News
- F&G joins Voya’s annuity platform
- Regulators ponder how to tamp down annuity illustrations as high as 27%
- Annual annuity reviews: leverage them to keep clients engaged
- Symetra Enhances Fixed Indexed Annuities, Introduces New Franklin Large Cap Value 15% ER Index
- Ancient Financial Launches as a Strategic Asset Management and Reinsurance Holding Company, Announces Agreement to Acquire F&G Life Re Ltd.
More Annuity NewsHealth/Employee Benefits News
- After enhanced Obamacare health insurance subsidies expire, the effects are starting to show
- CommunityCare: Your Local Medicare Resource
- AG warns Tennesseans about unlicensed insurance seller
- GOVERNOR HOCHUL LAUNCHES PUBLIC AWARENESS CAMPAIGN TO EDUCATE NEW YORKERS ON ACCESS TO BEHAVIORAL HEALTH TREATMENT
- Researchers from Pennsylvania State University (Penn State) College of Medicine and Milton S. Hershey Medical Center Detail Findings in Aortic Dissection [Health Insurance Payor Type as a Predictor of Clinical Presentation and Mortality in …]: Cardiovascular Diseases and Conditions – Aortic Dissection
More Health/Employee Benefits NewsLife Insurance News
- Baby on Board
- Kyle Busch, PacLife reach confidential settlement, seek to dismiss lawsuit
- AM Best Revises Outlooks to Positive for ICICI Lombard General Insurance Company Limited
- TDCI, AG's Office warn consumers about life insurance policies from LifeX Research Corporation
- Life insurance apps hit all-time high in January, double-digit growth for 40+
More Life Insurance News