At Addgene, we continually use the Basic Local Alignment Search Tool (BLAST) provided by NCBI. BLAST helps us compare the sequencing results of the plasmids in our repository with known reference sequences, such as full plasmid sequences provided by the laboratories that deposit their plasmids with us or other entries in NCBI’s numerous databases.
As our repository has grown over the years (we now have over 33,000 plasmids!), the number of sequencing results we analyze as part of our quality control process has steadily grown. On a busy week, we may need to analyze more than 100 sequences each day as part of our quality control process. Consequently our team has refined our use of the BLAST web browser interface as efficiently as possible.
If you find yourself frequently on the BLAST website to verify plasmids or validating your new clones, try these tips to make the most of your time and sequence!
Choosing a BLAST Program
Of the five BLAST programs available, we primarily use Standard Nucleotide BLAST (blastn), Standard Protein BLAST (blastp), and Translated BLAST (blastx), depending on the plasmid region sequenced. NCBI has a terrific getting-started guide for BLAST, which includes a simple explanation of the different BLAST programs, databases, and elements of the BLAST search pages.
At Addgene, we use blastn to compare our sequencing results with empty vector or gene sequences to first determine how much of the result is trustworthy. Then we identify any discrepancies in the nucleotide sequence, such as mismatches, deletions, or insertions. We use blastp or blastx to compare our sequencing results to protein sequences to check open reading frames (ORFs) and determine the potential effect of any nucleotide discrepancies. The blastp and blastx programs are optimized differently and you may want to select one (or both) depending on the information you want to verify. We will delve into these differences below.
Optimizing blastn Searches
On the Standard Nucleotide BLAST page, the first decision we make is whether to compare our sequencing result to a single known reference sequence or to a BLAST sequence database. If you know the expected nucleotide sequence, check the “Align two or more sequences” checkbox and paste your reference sequence into the Subject Sequence box that appears. Aligning two nucleotide sequences is probably the fastest BLAST search to perform and will save you time compared to other types of BLAST searches.
If you do not know the exact reference sequence for your result, choose one of the BLAST sequence databases from the dropdown menu. Typically, we use the default nucleotide database “Nucleotide collection (nr/nt)” as it contains a composite of GenBank, EMBL, DDBJ, and PDB sequences and may be the most comprehensive for searching.
Timesaving Tip #1: If you know the species that your sequencing result should match, enter the common or scientific name into the Organism box. This small piece of information can significantly reduce your wait time for blastn, blastp, and blastx searches!
Now, before you click the BLAST button, consider the Program Selection parameter, as this will affect the amount of time to perform the search as well as the overall alignment results. The default setting is “Optimize for Highly similar sequences (megablast)”, which is very fast and works best when the identity between your sequence and the reference/database sequence is ≥ 95%. [Our QC process would be trouble-free and much faster if 95% of our results were always correct!]
Since sequencing reactions are imperfect and sequence near the beginning or end of a reaction is often unreliable, we routinely select the “Somewhat similar sequences (blastn)” program for blastn so that we can extract practically every single, reliable basepair from our results.
This option is not as fast as megablast, but can return longer alignments to compare with your sequencing trace file. Unlike megablast, the regular blastn program uses a smaller word size and lower scoring penalties for mismatches and gaps in the alignment. If you are curious about the differences in the blastn programs, check out the BLAST Help webpage.
Optimizing blastx Searches
Once we have used blastn to determine the reliable portion of a sequencing result and noted any potential mismatches or gaps in the nucleotide sequence, we typically run a Translated BLAST (blastx) search to check for expected ORFs, mutations or truncations. A primary advantage of blastx is that you do not have to decide on a reading frame for your sequencing result – blastx checks all six possible frames against the database. Another benefit is that a frame shift mutation present in the ORF is readily apparent when viewing blastx results.
When using blastx at Addgene, we use the default “Non-redundant protein sequences (nr)” database as it contains the largest number of protein sequences. Just below the BLAST button, you may have noticed the “Algorithm parameters” link. Click on this link to view advanced BLAST options and for our suggested blastx customization. Similar to nucleotide sequences, proteins often have repeated or highly homologous regions, which by default are ignored in a standard blastx search. An alignment omitting repeated regions can be confusing, such as when you attempt to verify the starting methionine of a gene but the blastx results start the alignment at a more distal amino acid. We consistently run our blastx searches with the “Low complexity regions” filter unchecked so that these regions are included in the search to maximize the alignment length. While this recommendation is not infallible, we have found it saves analysis time to remove this default setting.
Timesaving Tip#2: blastx searches are inherently slower than blastn or blastp, due to the additional searches involved in translating the nucleotide sequence into all six possible reading frames. If you know the expected protein sequence, use the “Align two or more sequences” option to drastically reduce waiting time for search results.
Optimizing blastp Searches
Depending on the sequencing result, we often choose between a Standard Protein BLAST (blastp) and blastx search to verify expected protein sequence in a plasmid. If you know which reading frame to choose for your sequencing result and can easily translate it, we recommend using blastp over blastx. The primary advantage is time savings but an added benefit is that blastp searches do not filter low complexity regions by default, meaning that you do not have to remember to adjust any blastp algorithm parameters. We use the default scoring matrix BLOSUM62, but you may want to check the description of the other matrices to see if another would be more advantageous for your search.
Timesaving Tip #3: Note that protein databases available are unlikely to have an exact entry for your favorite gene fused to an epitope tag or fusion protein. If your sequencing primer was chosen to confirm a tag or fusion protein is in-frame, we recommend using blastx with the “Align two or more sequences” option and pasting your expected protein sequence into the Subject Sequence box.
Depending on your sequencing result and desired analysis, BLAST may not always be your optimal choice. For difficult sequence alignments that BLAST is unable to handle, Clustal is our frequent choice for pairwise or multiple sequence alignments of nucleotide or protein sequences. We also use COBALT for aligning multiple protein sequences, particularly for comparing different isoforms. In addition to our favorites, there are a number of sequence alignment tools available.
Try these resources for lists of alternatives to BLAST:
- ExPASy - http://www.expasy.org/genomics/sequence_alignment
- EMBL-EBI - http://www.ebi.ac.uk/services
Do you have any tips for using BLAST to confirm your plasmid sequencing results or comments on our suggestions? Share your thoughts here to help other labs speed up their plasmid and cloning verification steps and free up more time for using your plasmids instead!
- All BLAST images are modified screen shots from the NCBI BLAST website