This post was contributed by guest blogger, Aneesh Karve, CTO at Qult Data. This post was originally published on the Quilt Genomics Blog and is republished here with permission.Quilt is a collaborative database for genomics. In this article, Quilt CTO Aneesh Karve, shows how to design experiments that work anywhere in the genome. Aneesh's research interests include proteomics, machine learning, and visualization for big biology.
GPS for the genome
We can think of the human genome as a map with three coordinates: chromosome, start, and stop. For instance (chr3, 1, 10) indicates a stretch of DNA at the very beginning of the third chromosome, ten base pairs in length. An emerging family of sequencing techniques function as a kind of “GPS for the genome” to compute coordinates for genetic elements like protein, RNA, and DNA (Table 1). As with GPS in the real world, coordinates alone aren’t very useful. We’ll need something like Google Maps to help us identify and visualize addresses. That’s where enhancers and genome math come in. They help us to transform raw genomic coordinates into meaningful experiments.
Table 1: An emerging family of "GPS for the genome" techniques
|Technique||What It Locates|
|ChIP-seq||Proteins (for our example later, histones)|
|Hi-C||DNA (genome-to-genome interactions)|
|DNase-seq||DNA (regions that are accessible for binding)|
Google Maps: Enhancers and genome math
Suppose that you wish to use Google maps to find all coffee shops near your house, excluding Starbucks. Taking a nerdy perspective, you might denote your search as follows:
(my_house + coffee) - starbucks
See how the notation works? The + operator denotes intersection and the – operator denotes set difference. That’s the intuition for how genome math helps us to locate interesting addresses in the genome. Let’s now examine how we can locate powerful stretches of DNA known as enhancers with the help of genome math.
Enhancers are regions of DNA that demonstrate “spooky action at a distance.” Through the marvel of DNA compaction, an enhancer can increase the expression of a gene that is millions of base pairs away. (For details on DNA compaction and the structural proteins that make it possible, see the appendix DNA is a 3D fractal).
Enhancer biology is a complex and dynamic field. We’re going to focus on a tried and true method of finding enhancers by isolating genomic regions that are bound to modified proteins called histones. We can detect modified histones with “GPS for proteins,” ChIP-seq from Table 1. Because of DNA’s 3D geometry and the chemical properties of modified histones, a genomic region that has mono-methylated and acetylated histones, but not tri-methylated histones, functions as an enhancer. We can therefore denote enhancers as follows:
(mono_methylation + acetylation) - tri_methylation
In the next section we’ll apply the above formula to a real-world experiment. We’ll start with ChIP-seq data from the ENCODE project, find enhancers in embryonic stem cells, and conclude with a targeted CRISPR screen that can disrupt these enhancers.
A real-world experiment
Suppose you run a ChIP-seq experiment (think “GPS for proteins”) for NANOG, an essential transcription factor in embryonic stem cells (ESCs). Your ChIP-seq finds just over 13,000 significant binding peaks for NANOG in the human genome. But not all of those 13,000 regions are important for maintaining ESCs. So which of these 13,000 regions are critical? One hypothesis: the enhancers! This leads us to a three-step approach for designing an experiment to identify the critical NANOG binding sites:
1. Find Enhancers that have NANOG binding sites
2. Design a CRISPR screen to target and disrupt the NANOG enhancers
3. CRISPR out the enhancers from step 2. See which ESCs die or differentiate
Step 3 reveals which NANOG-related genes are critical to stem cell survival. Knowing which genes influence the survival of our cell culture is the foundation of modern drug discovery and therapeutics. We’ll have more to say about clinical applications of CRISPR in the next section.
In order to denote the NANOG enhancers from step 1 with genome math, we’ll need a bit of shorthand from the field of epigenomics:
- H3 – one of NANOG’s associated histone proteins
- K4 and K27 – locations of the amino acid lysine in H3
- me1, me3, and ac – denoting mono-methylation, tri-methylation, and aceylation, respectively (these are chemical modifications, or functional groups, found on lysine)
Putting it all together, we get the following expression for step 1:
(H3K4me1 + H3K27ac) – H3K4me3
The following video demonstrates how anyone can find enhancers with Quilt.
Seek and destroy enhancers with CRISPR
Armed with the genome math expression for NANOG enhancers, we’re ready for step 2: design a CRISPR screen to disrupt these enhancers. The third and final step is to conduct our CRISPR screen. We start by infecting millions of embryonic stem cells (ESCs) with a lentiviral vector, an attenuated retrovirus in the same family as HIV. By design, our lentivirii are genetically programmed to CRISPR out the enhancers we identified in step 2. The result is a heterogenous population of stem cells, usually housed in a single flask. Through a bit of stochastic magic and Poisson statistics, each sub-population has, on average, one distinct enhancer disrupted. As our ESCs die and differentiate over time, we periodically use next-generation sequencing to measure the relative proportion of guide RNAs (gRNAs) across the population. Recall that guide RNAs are the targeting mechanism for CRISPR. Therefore if a gRNA drops or disappears over time, we infer that the enhancer it targets is a “pillar of function” for our stem cells. Remove this pillar and the ESC dies.
If you’re interested in designing your own CRISPR screens for enhancers, check out the Appendix.
Conclusion: Shedding light on dark matter
Precise knowledge of which stretches of the genome are pillars for stem cells, or metastasized tumor cells, or alzheimers-affected neurons, or [your cell line of interest], is the foundation of precision medicine. We can apply this knowledge to create targeted disease therapies with minimal side-effects on healthy cells and maximal effect on unhealthy cells.
Until recently, the human genome has been full of dark matter: enhancers, lncRNAs, repetitive elements, repressors, insulators, and more. We know that this matter exists, but traditional approaches to study its function have been prohibitively difficult. CRISPR, in combination with the techniques from Table 1, provides us with powerful GPS-like techniques to explore dark matter in the genome. There are countless unknown regions yet to explore. I hope that this brief guide can help you do just that.
Good luck, and always keep going.
DNAse Hypersensitive Sites and Intergenic CRISPRs
Yours truly generated the DHS data set by starting with all of the DHS sites from 125 different human cell types from the ENCODE Project. DHS sites are the most inclusive markers of regulatory regions in the genome, including enhancers, promoters, insulators, and more. I then identified valid gRNA sequences without off-target effects for the 2+ million DHS sites. See below for further details.
DNA is a 3D Fractal
People commonly think of DNA as a linear polymer of A,T,G, and C nucleotides arranged in a double helix. People are wrong. In reality, cellular DNA is a complex three-dimensional globule of chromatin. Chromatin is a combination of coiled DNA and structural proteins called histones. Chromatin folds into secondary structures (loops) and tertiary structures (globules) to achieve exquisite compaction—on the order of 700 terabytes per gram—and form the X-shaped chromosomes we all know and love.
To understand compaction, suppose you have a piece string that’s 10 meters long. You twist and roll the string into a tight ball. Now the two ends of the string, instead of being 10 meters apart, are mere millimeters apart. Similarly, cells compact DNA in a way that brings linearly distant regions close together.
gRNA Selection and Filtering for Off-target Effects
We first generated a multi-fasta file of the hg19 genome using Bedtools getFasta. These regions and their reverse complements were parsed for spCas9 PAM sites (NGG) and then filtered based on two main criteria: no TTTTT allowed (this is a polymerase terminator), and no off-target effects for the identified 23-mer gRNA. Off-target determination was established with Bowtie2 using the parameters first described in Kearns et al.:
bowtie2 -f -x HG19_GENOME --local -f -k 10 --very-sensitive-local -L 9 -N 1 -U GRNA_23MERS -S GRNA_HITS.sam
Many thanks to our guest blogger, Aneesh Karve!
Aneesh Karve is CTO at Quilt Data. Quilt is a collaborative database. Aneesh's research interests span machine learning, proteomics, and user interfaces for math.
Resources on the Addgene Blog
- Catch Up on Your CRISPR Background with CRISPR 101
- Learn about CRISPR Delivery Methods
- Validate Your Genome Edit
Resources on Addgene.org