Block based compression and encryption of genetic information in VCF files for controlled access
Files
Gautot_33931900_2024.pdf
Open access - Adobe PDF
- 2.32 MB
Details
- Supervisors
- Faculty
- Degree label
- Abstract
- In recent years, the development of new technologies has brought down the cost of genome sequencing by several orders of magnitude. This has ushered in a new era of genomic analysis and has significantly increased the amount of genomic data that needs to be handled. Genomic variants are of utmost importance for research owing to some of them influencing biological functions. The most common way to work with these variants is through the Variant Call Format (VCF) files, which are often impractical to work with because of their size. Despite many compression algorithms having been developed for these files over the last few years, none of them offer a free, open-source solution that compresses the VCF file in its entirety while allowing for random access of the data. Additionally, despite genomic information being considered as sensitive data, security is often not incorporated in these algorithms. We introduce the Secure VCF Compressor (SVC) which fills both those holes: an open-source algorithm designed for compression and encryption of VCF files capable of random access. SVC's novel approach allows users to freely define regions of the files that will be independently compressed and encrypted, and thanks to its random access capabilities, SVC is capable of only decrypting and decompressing small parts of its archives, only those relevant to the users' queries, while keeping the rest of the file confidential. SVC combines GNU zip and GVC, an algorithm specialized in compressing genotype information, to compress the data, and then makes use of AES-GCM 256-bit to provide confidentiality and integrity to the data. While currently not able to compete with commercial solution compression-wise, SVC still consistently performs better than the de-facto solution and goes up to compression ratios of 200:1 on files from the 1000 Genome Project. Its security features are currently un-matched by any other academically available algorithm. The main contribution of this work is to provide a free, open-source, random-access capable algorithm for VCF compression and encryption, to enable an easier and much more secure handling, sharing and processing of sensitive genetic information.