This post was updated on Dec 4, 2017.
At Addgene, we continually use the Basic Local Alignment Search Tool (BLAST) provided by NCBI. BLAST helps us compare the sequencing results of the plasmids in our repository with known reference sequences, such as full plasmid sequences provided by the laboratories that deposit their plasmids with us or other entries in NCBI’s numerous databases.
As our repository has grown over the years (we now have over 60,000 plasmids!), the number of sequencing results we analyze as part of our quality control process has steadily grown. On a busy week, we may need to analyze more than 200 plasmids as part of our quality control process. Consequently our team has refined our use of the BLAST web browser interface to be as efficient as possible.
If you find yourself frequently on the BLAST website to verify plasmids or validating your new clones, try these tips to make the most of your time and sequence! You might also enjoy seeing how our quality control process has changed with the introduction of next generation sequencing!
Choosing a BLAST Program
Of the five BLAST programs available, we primarily use Standard Nucleotide BLAST (blastn), Standard Protein BLAST (blastp), and Translated BLAST (blastx). NCBI has a terrific getting-started guide for BLAST, which includes a simple explanation of the different BLAST programs, databases, and elements of the BLAST search pages.
At Addgene, we use blastn to identify any discrepancies in Sanger sequences, such as mismatches, deletions, or insertions. We use blastp or blastx to compare our sequencing results to protein sequences to check open reading frames (ORFs) and determine the potential effect of any nucleotide discrepancies. The blastp and blastx programs are optimized differently and you may want to select one (or both) depending on the information you want to verify. We will delve into these differences below.
Optimizing blastn Searches
On the Standard Nucleotide BLAST page, the first decision to make is whether to compare a Sanger sequencing result to a single known reference sequence or to a BLAST sequence database. If you know the expected nucleotide sequence, check the “Align two or more sequences” checkbox and paste your reference sequence into the Subject Sequence box that appears. Aligning two nucleotide sequences is probably the fastest BLAST search to perform and will save you time compared to other types of BLAST searches.
If you do not know the exact reference sequence for your result, choose one of the BLAST sequence databases from the dropdown menu. Typically, we use the default nucleotide database “Nucleotide collection (nr/nt)” as it contains a composite of GenBank, EMBL, DDBJ, and PDB sequences and may be the most comprehensive for searching.
Timesaving Tip #1: If you know the species that your sequencing result should match, enter the common or scientific name into the Organism box. This small piece of information can significantly reduce your wait time for blastn, blastp, and blastx searches!
Now, before you click the BLAST button, consider the Program Selection parameter, as this will affect the amount of time to perform the search as well as the overall alignment results. The default setting is “Optimize for Highly similar sequences (megablast)”, which is very fast and works best when the identity between your sequence and the reference/database sequence is ≥ 95%. [Our QC process would be trouble-free and much faster if 95% of our results were always correct!]
Since Sanger sequencing reactions are imperfect and sequence near the beginning or end of a reaction is often unreliable, we suggest using the “Somewhat similar sequences (blastn)” program for blastn so that you can extract practically every single reliable base pair from your results.
This option is not as fast as megablast, but can return longer alignments to compare with your sequencing trace file. Unlike megablast, the regular blastn program uses a smaller word size and lower scoring penalties for mismatches and gaps in the alignment. If you are curious about the differences in the blastn programs, check out the BLAST Help webpage.
Optimizing blastx Searches
Once you have used blastn to determine the reliable portion of a Sanger sequencing result and noted any potential mismatches or gaps in the nucleotide sequence, you can run a Translated BLAST (blastx) search to check for expected ORFs, mutations or truncations. A primary advantage of blastx is that you do not have to decide on a reading frame for your sequencing result – blastx checks all six possible frames against the database. Another benefit is that a frameshift mutation present in the ORF is readily apparent when viewing blastx results.
When using blastx at Addgene, we use the default “Non-redundant protein sequences (nr)” database as it contains the largest number of protein sequences. Just below the BLAST button, you may have noticed the “Algorithm parameters” link. Click on this link to view advanced BLAST options and for our suggested blastx customization. Similar to nucleotide sequences, proteins often have repeated or highly homologous regions, which by default are ignored in a standard blastx search. An alignment omitting repeated regions can be confusing, such as when you attempt to verify the starting methionine of a gene but the blastx results start the alignment at a more distal amino acid. We consistently run our blastx searches with the “Low complexity regions” filter unchecked so that these regions are included in the search to maximize the alignment length. While this recommendation is not infallible, we have found it saves analysis time to remove this default setting.
Timesaving Tip #2: blastx searches are inherently slower than blastn or blastp, due to the additional searches involved in translating the nucleotide sequence into all six possible reading frames. If you know the expected protein sequence, use the “Align two or more sequences” option to drastically reduce waiting time for search results.
Optimizing blastp Searches
Depending on the sequencing result, we often choose between a Standard Protein BLAST (blastp) and blastx search to verify expected protein sequence in a plasmid. If you know which reading frame to choose for your sequencing result and can easily translate it, we recommend using blastp over blastx. The primary advantage is time savings but an added benefit is that blastp searches do not filter low complexity regions by default, meaning that you do not have to remember to adjust any blastp algorithm parameters. We use the default scoring matrix BLOSUM62, but you may want to check the description of the other matrices to see if another would be more advantageous for your search.
Timesaving Tip #3: Note that protein databases available are unlikely to have an exact entry for your favorite gene fused to an epitope tag or fusion protein. If your sequencing primer was chosen to confirm a tag or fusion protein is in-frame, we recommend using blastx with the "Align two or more sequences" option and pasting your expected protein sequence into the Subject Sequence box.
Depending on your sequencing result and desired analysis, BLAST may not always be your optimal choice. For difficult sequence alignments that BLAST is unable to handle, Clustal is our frequent choice for pairwise or multiple sequence alignments of nucleotide or protein sequences. We also use COBALT for aligning multiple protein sequences, particularly for comparing different isoforms. In addition to our favorites, there are a number of sequence alignment tools available.
Try these resources for lists of alternatives to BLAST:
- ExPASy - http://www.expasy.org/genomics/sequence_alignment
- EMBL-EBI - http://www.ebi.ac.uk/services
Do you have any tips for using BLAST to confirm your plasmid sequencing results or comments on our suggestions? Share your thoughts here to help other labs speed up their plasmid and cloning verification steps and free up more time for using your plasmids instead!
All BLAST images are modified screen shots from the NCBI BLAST website.
Additional Resources on the Addgene Blog:
- Plasmids 101: How to Verify Your Plasmid
- Plasmids 101: An Inside Look at NGS Plasmid Quality Control
- Learn about our Snapgene-powered plasmid maps.
Resources on Addgene.org