| | | 
                | Frequently Asked Questions: Blat |  
 | 
 | 
 
 
        | 
	    | 
	        | 
		    | Blat vs. Blast |   |  |  
	        |  | 
|---|
 |  | 
			Question: "What are the differences between Blat and Blast?"
 
			Response:Blat is an alignment tool like BLAST, but it is structured differently.  On 
			DNA, Blat works by keeping an index of an entire genome in memory. 
			Thus, the target database of BLAT is not a set of GenBank sequences, but instead an 
			index derived from the assembly of the entire genome. The 
			index -- which uses less than a gigabyte of RAM -- consists of all non-overlapping 11-mers except for those heavily 
			involved in repeats. This smaller size means that Blat is far more easily 
			mirrored.  Blat of DNA 
			is designed to quickly find sequences of 95% and greater similarity of length 
			40 bases or more. It may miss more divergent or short sequence alignments.
 
			On proteins, Blat uses 4-mers rather than 11-mers, finding protein sequences 
			of 80% and greater similarity to the query of length 20+ amino acids. The protein index requires slightly more than 2 gigabytes of RAM. 
			In practice -- due to sequence divergence rates over evolutionary time -- DNA
			Blat works well within humans and primates, while protein Blat 
			continues to find good matches within terrestrial vertebrates and even earlier 
			organisms for conserved proteins. Within humans, protein Blat gives a much better 
			picture of gene families (paralogs) than DNA Blat. However, BLAST and 
			psi-BLAST at NCBI can find much more remote matches.
			 
			From a practical standpoint, Blat has several advantages over BLAST: 
			 
			Blat is commonly used to look up the location of a 
			sequence in the genome or determine the exon structure of an mRNA, but expert 
			users can run large batch jobs and make internal parameter sensitivity 
			changes by installing command line Blat on their own Linux server.speed (no queues, response in seconds) at the price of lesser homology depth
			the ability to submit a long list of simultaneous queries in fasta format
			five convenient output sort options
			a direct link into the UCSC browser
			alignment block details in natural genomic order
			an option to launch the alignment later as part of a custom track
			 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Blat use restrictions |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I received a high-volume traffic warning from your Blat
			server informing me that I had exceeded the server use
			limitations. Can you give me information on the UCSC
			Blat server use parameters?"
 
			Response:Due to the high demand on our Blat servers, we restrict 
			service for users who programatically query Blat or do 
			large batch queries. Program-driven use of Blat is 
			limited to a maximum of one hit every 15 
			seconds and no more than 5,000 hits per day. Please limit 
			batch queries to 25 sequences or less.
 
			For users with high-volume Blat demands, we recommend
			downloading Blat for local use. For more information, 
			see Downloading Blat source and 
			documentation.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Downloading Blat source and documentation |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Is the Blat source available for download? Is there
			documentation available?"
 
			Response:Blat source and executables are freely available for
			academic, nonprofit and personal use. Commercial licensing
			information is available on the 
			Kent Informatics website.
 
			Blat source may be downloaded from 
			http://www.soe.ucsc.edu/~kent 
			(look for the blatSrc* zip file with the most recent 
			date). For 
			Blat executables, go to 
			http://hgdownload.cse.ucsc.edu/admin/exe/; and choose your machine type. 
			 
			Documentation on Blat program specifications is available
			here.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Replicating web-based Blat 
			parameters in command-line version |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I'm setting up my own Blat server and would like to use 
			the same parameter values that the UCSC web-based Blat 
			server uses."
 
			Response:Use the following settings to replicate 
			the search results of the UCSC Blat server. Note that
			you may still observe some slight differences between
			command line results and web-based results, depending
			on the search being performed.
 
			faToTwoBit: 
			 
			gfServer (this is how the UCSC web-based blat servers are configured):
			 
			For enabling DNA/DNA and DNA/RNA
			matches, only the host, port and twoBit files are needed.
			The same port is used for both untranslated blat (gfClient)
			and PCR (webPcr). You'll need a separate blat server on a separate
			port to enable translated blat (protein searches or translated searches in protein-space).blat server (capable of PCR):
    			   gfServer start blatMachine portX -stepSize=5 -log=untrans.log database.2bit
			translated blat server:gfServer start blatMachine portY -trans -mask -log=trans.log database.2bit
 
			gfClient: 
			 
			Set -minScore=0 and 
			-minIdentity=0. This will result in some 
			low-scoring, generally spurious hits, but for 
			interactive use it's sufficiently easy to ignore them 
			(because results are sorted by score) and sometimes 
			the low-scoring hits come in handy. 
			 
			standalone blat: 
			 
			blat search:blat -stepSize=5 -repMatch=2253 -minScore=0 -minIdentity=0 database.2bit query.fa output.psl
 
 Notes on repMatch:
 The default setting for gfServer dna matches is: repMatch = 1024 * (tileSize/stepSize).
 The default setting for blat dna matches is: repMatch = 1024 (if tileSize=11).
 To get command-line results that are equivalent to web-based results, repMatch must
			    be specified when using blat.
 
			For more information on the parameters available for
			blat, gfServer, and gfClient, see the 
			blat
			specifications.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Using the -ooc flag |   |  |  
	        |  | 
|---|
 |  | 
			Question: "What does the -ooc flag do?"
 
			Response:Using any -ooc option in blat, such
			as -ooc=11.ooc, simply serves to speed up 
			searches similar to repeat-masking sequence. The
			11.ooc file contains sequences 
			determined to be over-represented in the genome 
			sequence. To speed up searches, these sequences are not 
			used when seeding an alignment against the genome. For 
		 	reasonably-sized sequences, this will not create a 
			problem and will significantly reduce processing time.
 
			By not using the 11.ooc file, you will increase 
			alignment time, but will also slightly increase 
			sensitivity. This may be important if you are aligning 
			shorter sequences or sequences of poor quality. For example,
			if a particular sequence consists primarily of 
			sequences in the 11.ooc file, it will 
			never be seeded correctly for an alignment if the 
			-ooc flag is used.  
			 
			In summary,
			if you are not finding certain sequences and can afford 
			the extra processing time, you may want to run blat 
			without the 11.ooc file if your particular
			situation warrants its use.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Replicating web-based Blat percent identity and score calculations |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Using my own command-line Blat server, how can I 
			replicate the percent identity and score calculations
			produced by web-based Blat?"
 
			Response:There isn't an option to command-line Blat that gives 
			you the percent ID and the score. Instead, you will
			have to write your own program to produce the 
			calculations, incorporating some of the functions from 
			the Genome Browser source code.
 
			To calculate the percent ID, incorporate the following
			code and function into a program that processes your 
			Blat PSL output. The parameter isMrna should
			be set to TRUE, regardless of whether the input
			sequence is mRNA or protein. 
			 
			The percent identity score is calculated like this:
			 
			 
			
    			100.0 - pslCalcMilliBad(psl, TRUE) * 0.1
			
			 
			Here is the source for pslCalcMilliBad:
			 
			 
			
int pslCalcMilliBad(struct psl *psl, boolean isMrna)
/* Calculate badness in parts per thousand. */
{
int sizeMul = pslIsProtein(psl) ? 3 : 1;
int qAliSize, tAliSize, aliSize;
int milliBad = 0;
int sizeDif;
int insertFactor;
int total;
qAliSize = sizeMul * (psl->qEnd - psl->qStart);
tAliSize = psl->tEnd - psl->tStart;
aliSize = min(qAliSize, tAliSize);
if (aliSize <= 0)
    return 0;
sizeDif = qAliSize - tAliSize;
if (sizeDif < 0)
    {
    if (isMrna)
        sizeDif = 0;
    else
        sizeDif = -sizeDif;
    }
insertFactor = psl->qNumInsert;
if (!isMrna)
    insertFactor += psl->tNumInsert;
total = (sizeMul * (psl->match + psl->repMatch + psl->misMatch));
if (total != 0)
    milliBad = (1000 * (psl->misMatch*sizeMul + insertFactor + 
	round(3*log(1+sizeDif)))) / total;
return milliBad;
}
 	
			The complexity in milliBad arises primarily from how it
			handles inserts. Ignoring the inserts, the calculation
			is simply mismatches expressed as parts per thousand. 
			However, the algorithm factors in insertion penalties as
			well, which are relatively weak compared to say blasts 
			but still present. When huge inserts are allowed (which 
			is necessary to accommodate introns), it is typically 
			necessary to resort to logarithms like this calculation 
			does.
			 
			The pslIsProtein function called by 
			pslCalcMilliBad is:
			 
			
boolean pslIsProtein(const struct psl *psl)
/* is psl a protein psl (are it's blockSizes and scores in protein space) 
*/
{
int lastBlock = psl->blockCount - 1;
return  (((psl->strand[1] == '+' ) &&
     (psl->tEnd == psl->tStarts[lastBlock] + 3*psl->blockSizes[lastBlock])) 
||
    ((psl->strand[1] == '-') &&
     (psl->tStart == (psl->tSize-(psl->tStarts[lastBlock] + 
3*psl->blockSizes[lastBlock])))));
}
 
			This function automatically determines whether or not 
			the PSL output file contains alignment information for 
			a protein query. Alternatively, you could write the 
			program such that the user specifies if the query is a 
			protein or not.
			 
			The score calculation is generated by the following
			function:
 
int pslScore(const struct psl *psl)
/* Return score for psl. */
{
int sizeMul = pslIsProtein(psl) ? 3 : 1;
return sizeMul * (psl->match + ( psl->repMatch>>1)) -
         sizeMul * psl->misMatch - psl->qNumInsert - psl->tNumInsert;
}
 
			For help with creating a C program to do perform these
			calculations, you may want to use the libraries from 
			the Genome Browser source code. See our 
			FAQ on source code 
			licensing and downloads for information on obtaining
			the source. The file kent/src/lib/psl.c 
			contains the pslCalcMilliBad, 
			pslIsProtein and pslScore 
			functions and also a useful function called 
			pslLoadAll that loads the psl file into a 
			linked list structure. The definition of the psl struct
			can be found in kent/src/inc/psl.h. 
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Replicating web-based Blat "I'm feeling lucky" search results |   |  |  
	        |  | 
|---|
 |  | 
			Question: "How do I generate the same search results as web-based
			Blat's "I'm feeling lucky" option using 
			command-line blat?"
 
			Response:The code for the "I'm feeling lucky" Blat
			search orders the results based on the sort output 
			option that you selected on the query page. It then 
			returns the highest-scoring alignment of the first 
			query sequence.
 
			If you are sorting results by "query, start" 
			or "chrom, start", generating the "I'm
			feeling lucky" result is straightforward:
			sort the output file by these columns, then select the 
			top result. 
			 
			To replicate any of the sort options involving score, 
			you first must calculate the score for each result in 
			your PSL output file, then sort the results by score or 
			other combination (e.g. "query, 
			score" and "chrom, score").
			See the section on Replicating 
			web-based Blat percent identity and score 
			calculations for information on calculating the
			score.
			 
			Alternatively, you can try filtering your Blat PSL 
			output using either the pslReps or 
			pslCDnaFilter program available in the Genome
			Browser source code. For information on obtaining the
			source code, see our FAQ 
			on source code licensing and downloads. 
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Using Blat for short sequences with maximum sensitivity |   |  |  
	        |  | 
|---|
 |  | 
			Question: "How do I configure blat for short sequences with 
		   	 maximum sensitivity?"
 
			Response:Here are some guidelines for configuring standalone 
			blat and gfServer/gfClient for these conditions:
 
			 
			
			The formula to find the shortest query size that will
			guarantee a match (if matching tiles are not marked as
			overused) is:2 * stepSize + tileSize - 1
 For example, with stepSize set to 5 and
			tileSize set to 11, matches of query size 
			2*5+11-1 = 20 bp will be found, if the query matches
			the target exactly. The stepSize parameter can
			range from 1 to tileSize. The tileSize
			parameter can range from 6 to 15.
		 	Use -fine.
			
			Use a large value for repMatch (e.g. 
			-repMatch = 1000000) 
			to reduce the chance of a tile being marked as 
			over-used.
			
			Do not use an .ooc file.
			
		 	Do not use -fastMap.
                        
			Do not use masking command-line options.
			 
			The above changes will make BLAT more sensitive, but 
			will also slow the speed and increase the memory usage. 
			It may be necessary to process one chromosome
			at a time to reduce the memory requirements. 
			 
			A note on filtering output: increasing the
			-minScore parameter value beyond one-half of
			the query size has no further effect.  Therefore, use
			either the pslReps or pslCDnaFilter
			program available in the Genome Browser source code to
			filter for the size, score, coverage, or quality
			desired.  For information on obtaining the
			source code, see our FAQ 
			on source code licensing and downloads.  |  |  
 |  |  |  |