Help


Introduction


BLAST stands for Basic Local Alignment Search Tool. The BLAST service of CNGB is developed with NCBI BLAST+ 2.6.0 standalone version, downloaded from NCBI FTP server, providing sequences searching on public data of CNGB applications, BGI projects and external data sources.

The word, BLAST, in the name "the BLAST service of CNGB", is standing for kinds of sequence searching. Many types of sequence searching will be integrated in the future.

Documentation


Query input

The query sequence(s) to be used for a BLAST search should be pasted in the "Query text" text area. Or saved as a file and provided through the "Query file". If both "Query text" and "Query file" are filled, the content from these two fields will be simply combined together as query input.

BLAST accepts a number of different types of input and automatically determines the format or the input. Accepted input types are FASTA or bare sequence.

  1. FASTA

    A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

    >P01013 GENE X PROTEIN (OVALBUMIN-RELATED)
    QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
    KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
    VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
    FLFLIKHNPTNTIVYFGRYWSP

    Blank lines are not allowed in the middle of FASTA input.

    Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic acid codes supported are:

    A  adenosine          C  cytidine             G  guanine
    T  thymidine          N  A/G/C/T (any)        U  uridine 
    K  G/T (keto)         S  G/C (strong)         Y  T/C (pyrimidine) 
    M  A/C (amino)        W  A/T (weak)           R  G/A (purine)        
    B  G/T/C              D  G/A/T                H  A/C/T      
    V  G/C/A              -  gap of indeterminate length

    For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:

    A  alanine               P  proline       
    B  aspartate/asparagine  Q  glutamine      
    C  cystine               R  arginine      
    D  aspartate             S  serine      
    E  glutamate             T  threonine      
    F  phenylalanine         U  selenocysteine
    G  glycine               V  valine        
    H  histidine             W  tryptophan        
    I  isoleucine            Y  tyrosine
    K  lysine                Z  glutamate/glutamine
    L  leucine               X  any
    M  methionine            *  translation stop
    N  asparagine            -  gap of indeterminate length
  2. Bare sequence

    This may be just lines of sequence data, without the FASTA definition line, e.g.:

    QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
    KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
    VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
    FLFLIKHNPTNTIVYFGRYWSP

    It can also be sequence interspersed with numbers and/or spaces, such as the sequence portion of a GenBank/GenPept flatfile report:

      1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn
     61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek
    121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels
    181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp

    Blank lines are not allowed in the middle of bare sequence input.

Database

The BLAST service of CNGB provides sequence data from numerous of CNGB projects for searching. You can search/filter the BLAST databases through the panel "Search/Filter Databases" to find what database you need. Up to 30 search/filter results will be list in the panel "Result". Click the item in the "Result" panel to add it to the "Database" field, and click the label in the "Database" field to remove an item.

Here is the usage of the "Quick search" in the "Search/Filter Databases":

  • When "Project", "Level" and "Keywords" are all empty, no result will be returned.
  • When only "Project" is filled, random results from all "Level" of selected "Project" will be returned.
  • When only "Level" is filled, random results from current "Level" across all "Project" will be returned.
  • When only "Keywords" is filled, results of searching any "Level" across all "Project" will be returned.
  • When only "Project" and "Level" are filled, random results from current "Level" of selected "Project" will be returned.
  • When only "Project" and "Keywords" are filled, results of searching all "Level" of selected "Project" will be returned.
  • When only "Level" and "Keywords" are filled, results of searching current "Level" across all "Project" will be returned.
  • When all three fields are filled, results of searching current "Level" of selected "Project" will be returned.

Note: different "Project" has different "Level" data. When a "Project" is selected, the "Level" will only list the available items of the selected "Project".

For some "Project"s, like PIRD, the "Quick search" may not be suitable for finding out a BLAST database that meets some special conditions, like special age ranges. For these "Project"s, the sub-panel of "Advanced filter", providing limited fields' filter function, is now under development.

Query subrange

A segment of the query sequences can be used in BLAST searching. You can enter the range in the "Query subrange" box to specify the position of this segment. For example to limit matches to the region from 24 (the start) to 200 (the stop) of a query sequence, you would enter "24-200" in the field. If one of the limits you enter is out of range, the intersection of the [start,stop] and [1,length] intervals will be searched, where length is the length of the whole query sequence.

Genetic code

Genetic code to be used in blastx and tblastx translation of the query. Please refer to this NCBI page.

Word size

BLAST is a heuristic that works by finding word-matches between the query and database sequences. One may think of this process as finding "hot-spots" that BLAST can then use to initiate extensions that might eventually lead to full-blown alignments. For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size. For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied.

Expect value

This setting specifies the statistical significance threshold for reporting matches against database sequences. The default value (default is 10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the expect value, the match will not be reported. Lower expect values are more stringent, leading to fewer chance matches being reported.

Matrix

A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The matrix used in a BLAST search can be changed depending on the type of sequences you are searching with.

Match/Mismatch scores

Many nucleotide searches use a simple scoring system that consists of a "reward" for a match and a "penalty" for a mismatch. The (absolute) reward/penalty ratio should be increased as one looks at more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved (States et al., 1991).

Composition based stats

Amino acid substitution matrices may be adjusted in various ways to compensate for the amino acid compositions of the sequences being compared. The simplest adjustment is to scale all substitution scores by an analytically determined constant, while leaving the gap scores fixed; this procedure is called "composition-based statistics" (Schaffer et al., 2001). The resulting scaled scores yield more accurate E-values than standard, unscaled scores. A more sophisticated approach adjusts each score in a standard substitution matrix separately to compensate for the compositions of the two sequences being compared (Yu et al., 2003; Yu and Altschul, 2005; Altschul et al., 2005). Such "compositional score matrix adjustment" may be invoked only under certain specific conditions for which it has been empirically determined to be beneficial (Altschul et al., 2005); under all other conditions, composition-based statistics are used. Alternatively, compositional adjustment may be invoked universally.

Gap cost

The pull down menu shows the Gap Costs for the chosen Matrix. There can only be a limited number of options for these parameters. Increasing the Gap Costs will result in alignments which decrease the number of Gaps introduced.

DUST/SEG

This function mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton and Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman. Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.

Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs.

Mask lower case letters

With this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases.

Database List


The BLAST service of CNGB will integrate more and more public and quality data for sequence searching. Currently, the projects are:

  1. NCBI BLAST database

    The NCBI BLAST data of June is added to the service. And data will be updated at every end of season.

  2. FishT1K

    All available sequence data (scanffold and config) is provided for searching. When FishT1K database is updated, the available sequence resource will be updated to BLAST too.

  3. OneKP

    All available sequence data (scanffold) is provided for searching. When OneKP database is updated, the available sequence resource will be updated to BLAST too.

  4. PIRD

    All available CDR3 sequence data (samples and knowledge repository) is provided for searching. When PIRD database is updated, the available sequence resource will be updated to BLAST too.