QC Sequencing Technologies at Addgene

By Alyssa Shepard

Previously, we provided a general overview of the QC process at Addgene. All plasmids go through this same initial QC process using an Illumina MiSeq, but that’s not the only technology Addgene uses to ensure accuracy of deposited plasmids. To resolve QC issues, assembly issues, and difficult-to-sequence plasmids, other sequencing and bioinformatic technologies are used. 

Improvements to sequencing technologies and bioinformatic analysis are occurring every year. At Addgene, we want to maintain the highest quality and accuracy during the QC process and therefore put in a lot of effort to stay up-to-date with these improvements. In this post, we discuss the sequencing technologies that Addgene uses in the QC process, when these technologies are used, and how Addgene keeps up with the constantly updating technologies.

Sequencing technologies used at Addgene

Sanger sequencing (“First-generation sequencing”) 

Sanger sequencing, developed in 1977 by Fred Sanger, was the first widely used DNA sequencing technology. The process relies on “sequencing-by-synthesis,” where each base is recognized and recorded as a DNA strand is synthesized (Figure 1). Sanger sequencing requires four separate reactions, where a fraction of one of the four bases (A, C, T, or G) is supplied as a dideoxyribonucleoside triphosphate (ddNTP) so that after its addition, DNA synthesis stops. The resulting DNA fragments are pooled and then separated by size using electrophoresis, typically capillary electrophoresis. Capillary electrophoresis reads a fluorescent label on the terminating nucleotide and records it in a “trace” file to determine the sequence.

Steps for Sanger sequencing accompanied by cartoon depictions of each step. Step one, template strand added to reactions with primer, polymerase, dNTPs, and fluorescently-labeled ddNTPs. Step two, strand amplification. ddNTPs stop chain extension. Step three, terminated DNA strands are pooled. Step four, capillary gel electrophoresis and fluorescence detection. Step five, sequence analysis and reconstruction.
Figure 1: Overview of Sanger sequencing technology. Created with BioRender.com.

Sanger sequencing was considered the gold standard until about 2008 when a new, higher-throughput sequencing-by-synthesis technology hit the market: short-read sequencing, also known as next-generation sequencing (NGS). However, Sanger sequencing remains useful for small sample sizes and shorter regions due in part to its low cost, quick results, and high accuracy. Conversely, you can only run one sample at a time, which can become costly and impractical for whole-genome (and often, whole-plasmid) sequencing. 

Prior to short-read sequencing, Addgene relied on Sanger sequencing for QC. We occasionally still use Sanger sequencing to confirm an insert that is repetitive or otherwise hard to sequence using NGS or long-read technologies.

Short-read sequencing (“Second-generation sequencing”) 

Short-read sequencing also relies on sequencing by synthesis, but modifies the original Sanger procedure to improve throughput by increasing the number of reactions. The main steps of short-read sequencing are highlighted in Figure 2.

 Steps for short-read sequencing accompanied by cartoon depictions of each step. Step one, input DNA is fragmented. Step two, sequencing adapters attached (plus barcodes if multiple samples). Step three, samples added to a chip (flow cell). Complementary oligos anchor DNA from adapters. Step four, templates amplified via PCR. Step five, each cycle, one fluorescently-labeled ddNTP is incorporated into the complementary DNA and the instrument images the chip to identify the bases in sequence. Step six, bioinformatic analysis to deconvolute data.
Figure 2: Overview of short-read sequencing technology. Created with BioRender.com.

After sequencing results are deconvoluted to identify the bases (and, if necessary, separate different samples), the resulting data, or reads, are essentially a lot of puzzle pieces. An assembly algorithm is employed to put these puzzle pieces together. Once they have the full picture of the puzzle, the QC team compares it to the plasmid information or depositor-provided sequences to confirm the correct plasmid components. For more information on this process, see our blog post on NGS plasmid quality control.

NGS is generally cheaper if you have many samples and can output more data, and it typically results in high accuracy for full plasmids. However, NGS might not be the best choice for small sample sizes, small regions of interest, or complex sequences such as G-C rich areas, secondary structures, or repetitive regions. Short-read sequencing can also be time-consuming, depending on the library prep and instrumentation used. 

Here at Addgene, we sequence every plasmid we intake using NGS technology, due to the high-throughput compatibility. However, there are some instances where we use long-read sequencing or Sanger sequencing, either because of repetitive regions or for higher accuracy of a particular sample. 

Long-read sequencing (“Third-generation sequencing”) 

In the past decade or so, the newest kid on the block in the sequencing world has been long-read sequencing. Whereas the individual reads generated in short-read and Sanger sequencing are limited to hundreds of base pairs, long-read sequencing can read sequence lengths in the thousands or even millions of base pairs. Several companies have worked on this technology, but Oxford Nanopore is one of the most popular options.

Nanopore sequencing deviates from the synthesis-based sequencing methods (Figure 3). It uses (aptly named) “nanopores” that are attached to the flow cell membrane and have an electrical current running through them. DNA or RNA is pulled through the pore via motor proteins. When each nucleotide passes through the nanopore, the electrical current recorded by the instrument changes. Base-calling software can then identify the nucleotide that passed through based on these characteristic drops.

Steps for nanopore sequencing accompanied by cartoon schematics of each step. Step one, DNA is unwound by the motor protein and one strand is translocated through the pore to the positive side of the membrane. Step two, each base gives a characteristic reduction in the ionic current and is read by base-calling software. For step one, a nanopore is shown with the motor protein above, and the DNA strand passing through. The bottom of the nanopore is anchored to the membrane, which is positively charged on the bottom, negatively charged on the top. The second step shows an example graph of the ionic current output. The bases are on the x-axis, and a line shows the characteristic drop for each base.
Figure 3: Overview of Nanopore sequencing technology. Created with BioRender.com.

Nanopore can provide longer read lengths of 10 kb to 100 kb (or 4 Mb in ultra-long read mode), enough to sequence an entire plasmid in one read. Runs can be completed in anywhere from under an hour to 72 hours, with real-time data output. This method is great for resolving complex and repetitive sequences, but can be costly, low-throughput, and less accurate than short-read sequencing.

At Addgene, we routinely use Nanopore technology to sequence our packaged viral DNA for quality control. This gives us a faster turnaround time than short-read sequencing, as well as the ability to sequence an entire viral genome in one read. We also use Nanopore sequencing to QC plasmids that are highly repetitive, such as those containing cassettes of the same gene multiple times. 

Bioinformatics at Addgene

After a run on our Illumina MiSeq, the data is imported into our assembly pipeline that runs with Seqera Nextflow. The workflow converts the raw sequencing data to FASTQ format, cleans and normalizes reads using the BBTools suite, and assembles circularized plasmid sequences via a combination of open-source software such as SPAdes, Shovill, apc, and NOVOPlasty. The pipeline contains three de novo assemblers, so if one fails to generate a circular plasmid sequence, it moves to the next. If all three attempts fail, the QC team manually checks and assembles the raw data. 

Addgene’s de novo assembly pipeline usually generates the correct plasmid sequence. However, if there are major differences from the expected sequence, the QC team can manually revisit the assembly in a program called Geneious. Within Geneious, the raw FASTQ reads are used for either guided or new de novo assembly. If a full anticipated sequence is provided, it’s used as a scaffold to map the reads, which boosts ‌accuracy, especially in repeated regions. If this doesn’t resolve the assembly issues, we can redo de novo assembly within Geneious. 

While this may seem like it is simply repeating the steps that gave us our initial incorrect assembly, de novo assembly within Geneious gives us additional insight into the assembly. Geneious will show how reads align, reveal multiple contigs, and highlight sequence identities in problem areas, helping us pinpoint the issue and resolve the assembly.

Keeping up the pace

Our QC team constantly faces the challenge of keeping up with the large amount of QC analysis required every week. They are constantly exploring ways to enhance their processes, quality, and efficiency through methods like automation. All assemblies from our pipeline are now run through an auto-align process to match sequence results to full-length predicted sequences submitted by depositors. A sample will automatically pass through our QC process if there is a 100% nucleotide match to base identity and plasmid length. Any mismatch or indel flags the sequence for manual review by one of our QC team members.

This process saves the QC team time, but does not leave much flexibility for discrepancies in backbone sequences. Most theoretical sequences submitted to Addgene have minor differences due to in silico variations and lack of verification. To address this issue, the QC team developed a separate script for manually inputting known innocuous differences. This script compares the sequence for a 100% match to base identity and length, but disregards the specified discrepancies. Any sequences that are a perfect match from this script can then continue in the auto-align process. This process is typically reserved for deposits that contain many plasmids within the same backbone.

The flexible auto-alignment, plus additional automation in the backend, has enabled the QC to keep up with the pace of deposits while maintaining quality.

Maintaining high-quality plasmids for open science

Addgene is committed to maintaining high QC standards for all deposited materials. The QC team stays informed of emerging technologies in sequencing and bioinformatic analysis. At Addgene, we use a mix-and-match approach for sequencing methods and choose the best method based on the plasmid. By working with the Data team, the QC team has also improved efficiency in many of their methods to pass materials through the QC process quickly while maintaining accuracy. 

This post was written by Alyssa Shepard, with significant contributions from Alyssa Cecchetelli, Cary Valley, and Holly McQueary


References and Resources

References

Niedringhaus, T. P., Milanova, D., Kerby, M. B., Snyder, M. P., & Barron, A. E. (2011). Landscape of Next-Generation Sequencing Technologies. Analytical Chemistry, 83(12), 4327–4341. https://doi.org/10.1021/ac2010857 

Sanger, F., Nicklen, S., & Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences, 74(12), 5463–5467. https://doi.org/10.1073/pnas.74.12.5463 

Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology, 26(10), 1135–1145. https://doi.org/10.1038/nbt1486 

Slatko, B. E., Gardner, A. F., & Ausubel, F. M. (2018). Overview of Next‐Generation sequencing Technologies. Current Protocols in Molecular Biology, 122(1). https://doi.org/10.1002/cpmb.59 

Wang, Y., Zhao, Y., Bollas, A., Wang, Y., & Au, K. F. (2021). Nanopore sequencing technology, bioinformatics and applications. Nature Biotechnology, 39(11), 1348–1365. https://doi.org/10.1038/s41587-021-01108-x

Additional Resources on the Addgene Blog

Additional Resources on addgene.org

Topics: Scientific Sharing, Open Science, Material Sharing

Leave a Comment

Sharing science just got easier... Subscribe to our blog