Wednesday, October 13, 2021

Scalable Genomics Annotation Analysis with OpenCRAVAT in Microsoft Azure

Co-authored by Prof.Dr.Rachel Karchin, Kym Pagel,Ph.D. and Rick Kim,Ph.D.

 

In this blog, we will share the integration of OpenCRAVAT on Azure Data Science Virtual Machines. "OpenCRAVAT is a python package that performs genomic variant interpretation including variant impact, annotation, and scoring. There is a web-based version of OpenCRAVAT (https://run.opencravat.org) but it can also be installed locally and is easy to integrate into bioinformatics pipelines. OpenCRAVAT has a modular architecture with a wide variety of analysis modules that can be selected and installed/run based on the needs of a given study. The modules are made available via the CRAVAT Store and are developed both by the CRAVAT team and the broader variant analysis community. OpenCRAVAT is a product of the Karchin Lab at Johns Hopkins University in collaboration with In Silico Solutions with funding provided by the National Cancer Institute’s ITCR program. "

 

Professor Rachel Karchin,  Institute for Computational Medicine, Johns Hopkins University : "Microsoft Azure makes large-scale integrative analysis with OpenCRAVAT easy in interactive environments like Jupyter Notebooks and RStudio"

 

Overview

 

"OpenCRAVAT is a modular python package that is available in the pip PyPI repository. It takes a file of genomic variants as input. The most common input format is a VCF file but other formats are supported including dbSNP identifiers, 23&Me and Ancestry.com file formats. The analysis performed by OpenCRAVAT depends upon user-selected annotation and visualization options, available for download from the free OpenCRAVAT Store. In addition to the interactive user interface, OpenCRAVAT provides several output formats including text reports, Excel spreadsheets, and a SQLite database of results used by cravat_view.

 

There are more than 150 different modules in the app store. These modules can be assigned one or more tags, that include allele frequency, cancer, cardiovascular, clinical relevance, converters, evolution, functional studies, genes, interactions, literature, non coding, reporters, variant effect prediction, variants, and visualization. 

  • Converters (input formats): TSV, VCF, Ancestry.com, 23andMe, FamilyTreeDNA
  • Reporters (output formats): Text format, Excel, TSV, CSV, Annotated VCF"

nn00.JPG

Assistant Research Scientist, Institute for Computational Medicine, Johns Hopkins University - Kym Pagel: ‘ Compared to similar services, the Azure interface for developers is much more intuitive which is incredibly valuable as data get larger and more complex'

 

OpenCRAVAT in Microsoft Azure

 

It is fairly simple to get OpenCRAVAT up-and-running on Microsoft Azure. We recommend selecting the F2s v2 virtual machine (VM) for small jobs, and F16s zV2 VM for heavier loads that include multiple samples with whole genome sequencing. After the VM is started, ssh into the VM and then run a few commands to install all necessary components:

  • To install OpenCRAVAT, run pip3 install open-cravat

We recommend that users pull the store modules from Genomic Data Lake when running a VM on Azure, this dataset is a mirror of the store at https://store.opencravat.org and https://run.opencravat.org. To facilitate this, we provide a small script for pulling and downloading the relevant modules.

  • Download azcopy
  • Determine the annotation and analysis modules that you’d like to download. View all available options with oc module ls -a
  • Download the import_modules.py script, and place it in the same directory as azcopy
  • To run the script, type python3 import_modules.py module1 module2

For more information, consult the genomicsnotesbooks guide to downloading specific databases and deploying a Data Science VM on Azure for OpenCRAVAT at https://github.com/microsoft/genomicsnotebook/blob/main/sample-notebooks/genomics-opencravat.ipynb

 

Lead Architect, Institute for Computational Medicine, Johns Hopkins University – Rick Kim : ‘Microsoft Azure has been fantastic for delivering our genomic analysis solution, OpenCRAVAT, to our clients with ease and convenience.’

 

Azure Deployment Steps

 

Step 1. Visit: https://github.com/microsoft/genomicsnotebook/blob/main/sample-notebooks/genomics-opencravat.ipynb

 

nn1.JPG

 

 

Step 2. Click ‘Deploy To Azure’

nn2.JPG

Step 3. Select the relevant parameters: Subscription, Resource Group, etc.. for VM deployment

 

nn3.JPG

 

Step 4. Once deployment is ready, use RDP for log-in the VM. Below is the Desktop of Azure Data Science VM for OpenCRAVAT. Users can find the installation instructions, documentation of OpenCRAVAT and sample datasets at the mounted folder.

 

nn4.JPG

 

Step 5. OpenCRAVAT landing page. Users need to select ‘Reference Genome’ and path of the file

 

nn5.JPG

 

 

 

Step 6. OpenCRAVAT store for on-demand annotation modules

 

nn7.JPG

 

OpenCRAVAT Datasets on Azure Genomics Data Lake

 

Users can explore and use the OpenCRAVAT datasets from Azure Genomics Data Lake. For further information on Azure Storage Explorer use for this data set, please visit the instructions.

 

 

nn8.JPG

 

References

 

  1. https://opencravat.org/about.html
  2. genomicsnotebook/genomics-opencravat.ipynb at main · microsoft/genomicsnotebook (github.com)
  3. OpenCravat - Azure Open Datasets | Microsoft Docs
  4. Microsoft Genomics
  5. https://github.com/microsoft/genomicsnotebook
  6. https://github.com/microsoft/genomicsnotebook/blob/main/docs/Genomics_Data_Lake_Azure_Storage_Explorer.pdf
Posted at https://sl.advdat.com/3DDNYn5