| | | 
                | Frequently Asked Questions: Data and Downloads |  
 | 
 | 
 
 
    
        | 
	    | 
	        | 
		    | Downloading sequence and annotation data |   |  |  
	        |  | 
|---|
 |  | 
			Question: "How do I obtain the sequence and/or annotation data for a release?"
 
			Response:Sequence and annotation data downloads are usually made available within 
			the first week of the release of a new assembly. The download directories
			are automatically updated nightly to incorporate additions and modifications
			to the data.
 
			We recommend that you download data via our FTP site at 
			ftp://hgdownload.cse.ucsc.edu/, 
			particularly if you plan to download multiple files or files of large size. 
			To do so:
 
    ftp hgdownload.cse.ucsc.edu 
    user name: anonymous
    password: your email address
    go to the goldenPath directory, pick an assembly directory, then a data directory
 
			To download multiple files from the UNIX ftp command 
			line, use the "mget" command. You may want to use the
			"prompt" command to toggle the interactive 
			mode if you do not want to be prompted for each file 
			that you download.
 
    mget [filename1] [filename2] ...
    - or -
    mget -a (to download all the files in the directory) 
 
			You can also download data from our 
			Downloads
			page or our DAS 
		 	server. To download a specific subset of the data or to configure the output
			format of the data, use the Table 
			Browser. For information on extracting a large set 
			of sequences from an assembly, see 	
			Extracting sequence in batch from
		    	an assembly. 
			 
			For more information on using the UCSC DAS server, see 
			Downloading data from the UCSC DAS server.
		     |  |  
 |  |  
 
    
        | 
	    | 
	        | 
		    | Extracting sequence in batch from an assembly |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I have a lot of coordinates for an assembly and want to
			extract the corresponding sequences. What is the best 
		   	way to proceed?
 
			Response:
 
			There are two ways to extract genomic sequence in batch 
			from an assembly: 
			A. Download the appropriate fasta files from our 
			ftp server
			and extract sequence data using your own tools or the
		 	tools from our source tree. This is the recommended
		 	method when you have very large sequence datasets or 
			will be extracting data frequently.
			Sequence data for most assemblies is located in the 
			assembly's "chromosomes" subdirectory on the
			downloads server. For example, the sequence for human
			assembly hg17 can be found in 
			ftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/.
			You'll find instructions for obtaining our source 
			programs and utilities 
			here. Some programs 
		   	that you may find useful are nibFrag and twoBitToFa,
			as well as other fa* programs. To obtain
			usage information about most programs, execute it 
			without arguments. 
			B. Use the Table browser to extract sequence. This is a 
			convenient way to obtain small amounts of sequence. 
			 
			Create a 
			custom 
			track of the genomic coordinates in 
			BED format and upload 
			into the Genome Browser. 
			Select the custom track in the Table browser, then 
			select the "sequence" output format to 
			retrieve data. We recommend that you save the file 
			locally as gzip. 
			 |  |  
 |  |  
 
    
        | 
	    | 
	        | 
		    | Downloading data from the UCSC DAS server |   |  |  
	        |  | 
|---|
 |  | 
			Question: "How do I download data using the UCSC DAS server?"
 
			Response:The UCSC DAS server provides access to genome annotation data for all current assemblies
			featured in the Genome Browser. To view a list of the assemblies available from the
			DAS server and their base URLs, see 
			http://genome.ucsc.edu/cgi-bin/das/dsn.
 
			To construct a DAS query, combine an assembly's base URL with the 
			sequence entry point and type specifiers available for that assembly. The entry point 
			specifies chromosome position, and the type indicates the annotation table 
			requested. You can view the lists of entry points and types available for an assembly 
			with requests of the form:
 
	http://genome.ucsc.edu/cgi-bin/das/[db_name]/entry_points
	http://genome.ucsc.edu/cgi-bin/das/[db_name]/types
where [db_name] is the UCSC name for the assembly, e.g. hg16, mm4. 
			For example, here is a query that returns all the records in the refGene table for the
			chromosome position chr1:1-100000 on the hg16 assembly:
 
	http://genome.ucsc.edu/cgi-bin/das/hg16/features?segment=1:1,100000;type=refGene
For more information on DAS, see the 
			Biodas website and the
			DAS specification. |  |  
 |  |  
 
    
        | 
	    | 
	        | 
		    | Downloading the UCSC Genome Browser source |   |  |  
	        |  | 
|---|
 |  | 
			 
			Question:"Where can I download the Genome Browser source code and 
			executables?"
 
			Response:The Genome Browser source code and executables are freely
			available for academic, nonprofit, and personal 
			use (see Licensing the Genome Browser
			or Blat for commerical licensing requirements). 
			The latest version of the source code may be downloaded 
			here.
 
			See Downloading Blat source
			and documentation for information on Blat downloads.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Download restrictions |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Do you have restrictions on the amount of downloads one can do?"
 
			Response:Generally, we'd prefer that you not hit our interactive site with programs, 
			unless they are themselves front ends for interactive sites. We can handle 
			the traffic from all the clicks that biologists are likely to generate, 
			but not from programs. Program-driven use is limited to a maximum of one 
			hit every 15 seconds and no more than 5,000 hits per day.
 
			If you need to run batch Blat jobs, see
			Downloading Blat source
			and documentation for a copy of Blat you can run
			locally.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Opening .fa files |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I am trying to look at the final decoding of the human genome. How can I
			open the *.fa files?"
 
			Response:Microsoft Word or any program that can handle large text files will do. 
			Some of the chromosomes begin with long blocks of N's. You may want 
			to search for an A to get past them.
 
			Unless you have a particular need 
			to view or use the raw data files, you might find it more interesting to 
			look at the data using the Genome Browser. Type the name of a gene in which
			you're interested into the position box (or use the default position), 
			then click the submit button. In the resulting Genome Browser  
			display, click the DNA link on the menu bar at the top of the page. 
			Select the Extended case/color options button at the bottom of the 
			next page. Now you can color the DNA sequence to display which portions are 
			repeats, known genes, genetic markers, etc.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Data differences between downloaded data and browser display |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I downloaded the genome annotations from your MySQL database tables, but the 
			mRNA locations didn't match what was showing in the Genome Browser. Shouldn't they 
			be in synch?"
 
			Response:Yes. The Genome Browser and Table Browser are both driven by the same 
			underlying MySQL database. Check that your downloaded tables are from
			the same assembly version as the one you are viewing in the Genome Browser. If the 
			assembly
			dates don't match, the coordinates of the data within the tables may differ. 
			In a very rare instance, you could also be affected 
			by the brief lag time between the update of the live databases underlying the Genome
			Browser and the time it takes for text dumps of these databases to become available 
			in the downloads directory.
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Strange characters in FASTA file |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I noticed several characters other than A, C, G, 
			T, and N in my fasta file, for example y, k, 
			s, etc. Is the file corrupted or are these characters valid?"
 
			Response:The characters most commonly seen in sequence are A, C, G, 
			T, and N, but there are 
			several other valid characters that are used in clones to indicate
			ambiguity about the identity of certain bases in the sequence. It's not uncommon to 
			see these "wobble" codes at polymorphic positions in DNA sequences. The
			following chart (IUPAC-IUB Symbols for Nucleotide Nomenclature: Cornish-Bowden
			(1985). Nucl. Acids Res. 13:3021-3030) lists nucleotide symbols, including those
			used for ambiguity:
 
          		--------------------------------------
          		Symbol    Meaning      Nucleic Acid
          		--------------------------------------
           		A            A           Adenine
           		C            C           Cytosine
           		G            G           Guanine
           		T            T           Thymine
           		U            U           Uracil
           		M          A or C
           		R          A or G        Purine
           		W          A or T
           		S          C or G
           		Y          C or T        Pyrimidine
           		K          G or T
           		V        A or C or G
           		H        A or C or T
           		D        A or G or T
           		B        C or G or T
           		X      G or A or T or C
           		N      G or A or T or C
			 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Selection of GenBank ESTs |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I am interested in ESTs. How do you select which ones from GenBank to display in the 
			Genome Browser?"
 
			Response:All ESTs in GenBank on the date of the track data freeze for the given organism are 
			used - none are discarded. When two ESTs have identical sequences, both are 
			retained because this can be significant corroboration of a splice site.
 
			ESTs are aligned against the genome using the Blat program. When a single EST aligns 
			in multiple places, the alignment having the highest base identity is found. Only 
			alignments that have a base identity level within a selected percentage of 
			the best are kept. Alignments 
			must also have a minimum base identity to be kept. For more information on the 
			selection criteria specific to each organism, consult the description page accompanying
			the EST track for that organism. 
			The maximum intron length allowed by Blat is 500,000 bases, which may 
			eliminate some ESTs with very long introns that might otherwise align. If an EST 
			aligns non-contiguously (i.e. an intron has been spliced out), it is also a candidate 
			for the Spliced EST track, provided it meets various quality controls for intron and 
			exon length and match quality.  Start and stop coordinates of each alignment block are 
			available from the appropriate table within the 
			Table Browser. 
			Note that only 250 EST tracks can be viewed at a time within the browser. If more 
			than 250 tracks exist for the selected region, the display defaults to a denser display
			mode to 
			prevent the user's web browser from being overloaded. You can restore the EST track 
			display to a fuller display mode by zooming in on the chromosomal range or by using 
			the EST track filter to restrict the number of tracks displayed. 
			For tracks such as Non[Organism] ESTs and Non[Organism] mRNAs, some selection is done
			on the full set at GenBank. If a sequence is too divergent from the organism's genome 
			to generate a significant Blat hit, it is not included in the track.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | EST strand direction |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Could you help me with my interpretation of EST data? If the EST is taken
			from the minus (-) strand, does this always mean that the transcript is generated 
			on the minus strand? Are two corresponding ESTs that are assigned 
			- and + always complementary?
 
			I want to confirm the strand assignment for two 
			human ESTs: 
			 
			The graphical display goes with the orientation of the gene in that location."BQ016549 (chr22:22,310,674-22,332,143 on hg18): + strand in text and - strand in graphical 
			display 
			AA928010 (chr22:20,345,264-20,354,528 on hg18): - strand in text and + strand in graphical 
			display. 
			 
			Response:From the examples above, it can be seen that the strand to which an EST aligns is not 
                        necessarily reflected in the direction of transcription shown by the arrows in the 
                        display. When UCSC downloads mRNAs and ESTs from GenBank and aligns them to a genome 
			assembly using Blat, each EST aligns to the + or - strand (forward or reverse direction)
			of the genome, which we record as + or - in the strand field of the corresponding database 
			table, e.g. all_ests or chrN_est. The strand information (+/-) therefore 
                        indicates the direction of the match between the EST and the matching genomic 
                        sequence. It bears no relationship to the direction of transcription of the RNA with 
                        which it might be associated. Determining the direction of transcription for ESTs is 
                        not an easy task so we do some calculations to make the best guess for the 
                        transcription direction.
 
			ESTs are sequenced from either the 5' or the 3' end. When sequenced from the 5' end, the resulting
			sequence is the same as that of the mRNA which it represents. With a 3' end read, the resulting 
		 	sequence matches the opposite strand of the cDNA clone. Therefore, it is the reverse complement of 
			the actual mRNA sequence.  A problem occurs if the EST contributor reverse-complements
		        the 3'-read sequence before depositing it into GenBank, with the idea that people will want 
			the mRNA (transcription-direction) sequence. It is not always possible to determine if this has 
			been done. Therefore, we do some calculations to try to determine the correct direction of 
			transcription for the EST sequence. 
 			 
			If an EST alignment produces canonical introns (with gt-ag splice-site pairs), this is used 
			to determine the transcription direction. For example when an EST is aligned to the genome, a 
			canonical intron would look like this:
			 
			NNNNexonNNNNgtnnnnintronnnnnnnnagNNNNexon
			 
			Here, the two nucleotides on either end of the intron show the canonical gt-ag splice site pairs. 
			To find transcription direction, we use a method that relies on finding gt-ag canonical pairs in one 
			direction more often than in the opposite direction. The calculation is:
			 
			gt/ag introns minus ct/ac introns = intronOrientation 
			 	
			The sign of this calculated intronOrientation field (stored in the estOrientInfo table) shows the 
			orientation of the transcript relative to the EST. Therefore, if intronOrientation is positive, 
			then the EST appears in the display with the arrows pointing in the same direction as the EST 
			alignment. If intronOrientation is negative, then the arrows point in the opposite direction. If 
			no introns exist or all of the introns are non-canonical, then intronOrientation is set to zero.  
			In both BQ016549 and AA928010 (in the example above), the intronOrientation is negative; therefore, 
			the arrows on the Genome Browser display point in the opposite direction to that indicated by the 
			alignment on the EST details page. Note: A low intronOrientation number can cause an incorrect
			assignment of transcription direction when calculated in this way.
			 
			The alignment details pages and the Table Browser do not take the intron orientation
			into account. They show only the alignment of the 
			GenBank sequence (as given) to the genome. If the alignment is used to 
			retrieve DNA sequence from the genome, the DNA sequence will look 
			similar to the GenBank sequence (not its complement).
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Missing RefSeq ID |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Why isn't my refseq ID in your database?"
 
			Response:It may have been added after we last downloaded data from Genbank, or it may have 
			been replaced or removed. You can check the submission date and status of an accession
			on the 
			NCBI 
			Entrez Nucleotide site.
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Finished vs. draft segments |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Do chrN.fa tables contain both finished and draft segments? If so, 
			how do you determine which segments are finished?"
 
			Response:Yes, these tables contain both finished and draft segments. Use the 
			corresponding chrN_gold table to look them up. The quality of the draft 
			varies. In 
			general, the larger the contig it is in, the better the quality. The 
			quality of the last 500 bases on either end of a contig tends to be 
			lower than the rest of the contig.
 
			How do you determine the accuracy? The 
			base-calling program Phred analyzes 
			the traces from the sequencing machines 
			and assigns a quality score to these. These quality scores are used by the 
			Phrap assembly program, which gives 
			quality scores for the bases on the assembly as well.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | chrN_random tables |   |  |  
	        |  | 
|---|
 |  | 
			Question: "What are the chrN_random_[table] files in the human assembly? Why are they 
			called random? Is there something biologically random about the sequence in 
			these tables or are they just not placed within their given chromosomes?"
 
			Response:In the past, these tables contained data related to sequence that is 
			known to be in a particular chromosome, but could not be reliably ordered 
			within the current sequence.
 
			Starting with the April 2003 human assembly, these tables also include data for 
			sequence that is not in a finished state, but whose location in the chromosome is 
			known, in addition to the unordered sequence.  
			Because this sequence is not quite finished, it could not be included in the 
			main "finished" ordered and oriented section of the chromosome.  
			 
			Also, in 
			a very few cases in the April 2003 assembly, the random files contain data related to sequence for alternative 
			haplotypes.  
			This is present primarily in chr6, where we have included two alternative
			versions of the MHC region in chr6_random. There are a few clones in 
			other chromosomes that also correspond to a different haplotype.  Because the 
			primary reference sequence can only display a single haplotype, these 
			alternatives were included in random files.  In subsequent assemblies, 
			these regions have been moved into separate files (e.g. chr6_hla_hap1).
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Chromosome Un |   |  |  
	        |  | 
|---|
 |  | 
			Question: "What is ChrUn?"
 
			Response:ChrUn contains clone contigs that can't be confidently placed on a 
			specific chromosome. For the chrN_random and chrUn_random files, we 
			essentially just concatenate together all the contigs into short 
			pseudo-chromosomes. The coordinates of these are fairly arbitrary, 
			although the relative positions of the coordinates 
			are good within a contig. You can find more information about the data organization 
			and format on the Data
			Organization and Format page.
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Chromosome M |   |  |  
	        |  | 
|---|
 |  | 
			Question: "What is chromosome M (chrM)?"
 
			Response:Mitochondrial DNA.
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | N characters at beginning of human chr22 |   |  |  
	        |  | 
|---|
 |  | 
			Question: "When I download human chr22 from your web site, the unzipped file contains only 
			N's."
 
			Response:There is a large block of N's at the beginning and end of chr22. Search 
			for an A to bypass the initial group of N's.
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Erroneous duplicated chrY_random region on Mouse Build 34 (mm6) |   |  |  
	        |  | 
|---|
 |  | 
			Question: "On the mm6 assembly, I've found duplicate contigs 
			that are placed on both chrY and chrY_random. Is this 
			intentional?"
 
			Response:On the mm6 assembly, chrY_random erroneously contains 
			a region duplicated from chrY. Because NCBI 
			discovered this assembly problem after the UCSC 
			Genome Browser was processed, we were not able to 
			remove it from mm6 prior to the browser's release. 
			The duplicated section occupies chrY:1-696,521 and 
			chrY_random:29,615,053-30,311,573 (the end of the
			chromosome) and includes the following repeated 
			fragments:
 
			The fragments
			are assembled into the contig NT_111995 for 
			chrY_random and also appear (under different names) 
			as regions on contigs MmY_110865_34, MmY_78990_34 
			and NT_078925.AC139318.5 
			AC134433.3
			AC145392.2
			AC148319.2
			AC145571.3 
			AC145393.4 
			 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Problems with Mouse Build 32 (mm4) |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I have heard that the Build 32 mouse assembly isn't 
			as good as the Build 30 assembly. Can you clarify?"
 
			Response:Unfortunately, there appear to be some problems with 
			the Build 32 assembly. Ensembl has conducted an analysis 
			of the assembly and has attributed the
			problems to incorrect mapping information that led to 
			the generation of artificial duplications and some 
			incorrect flips in orientation. You can read more 
			information about the problems Ensembl identified and 
			review a list of the chromosomes and genes most likely 
			to be affected by these issues on the  Ensembl
			Mus musculus web page.
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Mapping chimp chromosome numbers to human chromsomes numbers |   |  |  
	        |  | 
|---|
 |  | 
			Question: How do the chimp and human chromosome numbering
			schemes compare?
 
			Response:The following table shows the mapping of chromosomes in 
			the chimp draft assemblies to human chromosomes.
		 	Starting with the panTro2 assembly, the numbering scheme
			has been changed to reflect a new standard that 
			preserves orthology with human chromosomes. Initially 
			proposed by E.H. McConkey in 2004, the new numbering 
			convention was subsequently endorsed by the 
			International Chimpanzee Sequencing and Analysis 
			Consortium. This standard assigns the identifiers 
			"2a" and "2b" to the two chimp chromosomes that fused in
			the human genome to form chromosome 2 and renumbers the
			other chromosomes to more closely match their human 
			counterparts. As a result, chromosomes 2 and
			23 (present in the panTro1 assembly) do not exist in
			later versions.
 
 
            		            
			    
			    | Human Chr | Chimp Chr (panTro1) | Chimp Chr (panTro2) |  | 1 | 1 | 1 |  | 2 (part) | 12 | 2a |  | 2 (part) | 13 | 2b |  | 3 | 2 | 3 |  | 4 | 3 | 4 |  | 5 | 4 | 5 |  | 6 | 5 | 6 |  | 7 | 6 | 7 |  | 8 | 7 | 8 |  | 9 | 11 | 9 |  | 10 | 8 | 10 |  | 11 | 9 | 11 |  | 12 | 10 | 12 |  | 13 | 14 | 13 |  | 14 | 15 | 14 |  | 15 | 16 | 15 |  | 16 | 18 | 16 |  | 17 | 19 | 17 |  | 18 | 17 | 18 |  | 19 | 20 | 19 |  | 20 | 21 | 20 |  | 21 | 22 | 21 |  | 22 | 23 | 22 |  | X | X | X |  | Y | Y | Y |  |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Converting genome coordinates between assemblies |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I've been researching a specific area of the human genome on the current assembly, 
			and now you've just released a new version. Is there an easy way to locate 
			my area of interest on the new assembly?"
 
			Response:You can migrate data from one assembly to another by using the 
			blat alignment tool 
			or by converting assembly coordinates.  There are two conversion 
			tools available on the Genome Browser web site: the
			Convert utility and the LiftOver tool.
			The Convert utility, 
			which is accessed from the menu on the Genome Browser 
			annotation tracks page, supports forward, reverse, and 
			cross-species conversions, but does not accept batch 
			input. 
			The LiftOver tool,
			accessed via the Utilities link on the Genome Browser 
			home page, also supports forward, reverse, and 
			cross-species conversions, as well as batch conversions.
 
			If you wish to update a large number of coordinates	
			to a different assembly and have access to a Linux 
			platform, you may find it useful to try the command-line
			version of the LiftOver tool. The executable file for 
			this utility can be downloaded 
			here. 
			LiftOver requires a UCSC-generated over.chain 
			file as input. Pre-generated files are available for 
			selected assemblies from the 
			Downloads page.
			If the desired file is not available, send a request to
			the 
			genome mailing list 
			and we may be able to  provide you with one.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Linking gene name with accession number |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I have the accession number for a gene and would like to
			link it to the gene name. Is there a table that shows both
			pieces of information?"
 
			Response:If you are looking at the RefSeq Genes, the 
			refFlat 
			table contains both the gene name (usually a 
			HUGO Gene Nomenclature Committee ID) and its accession 
		    	number. For the Known Genes, 
			use the kgAlias table.
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Obtaining a list of Known Genes |   |  |  
	        |  | 
|---|
 |  | 
			Question: "How can I obtain a complete list of all the genes in
			the UCSC Known Genes table for a particular organism?
 
			Response:To obtain a complete copy of the entire Known Genes 
			data set for an organism, open the Genome Browser
			Downloads page, 
			jump to the section specific to the organism, click the
			Annotation database link in that section, then click the
			link for the knownGene.txt.gz table.
 
			Data for a specific region or chromosome may be
			obtained from the Table Browser by selecting the 
			"Genes and Gene Prediction Tracks" group, the
			"Known Genes" track and the 
			"knownGene" table. Set the position to the 
			region of interest, then click the "get 
			output" button.
			 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Repeat-masking data |   |  |  
	        |  | 
|---|
 |  | 
			Question: "What version of RepeatMasker do you use on your data?
			Which flags do you use?"
 
			Response:UCSC uses the latest versions of RepeatMasker and
			repeat libraries available on the date when the 
			assembly data is processed. RepeatMasker version 
			information can usually be found in the README text for
			the assembly's bigZips 
			downloads directory.
 Masking is done using the RepeatMasker -s 
			flag. For mouse repeats, we also use -m. 
			In addition to RepeatMasker, we use the Tandem Repeat 
			Finder (trf) program, masking out repeats of period 12 
			or less. The repeats are just "soft" masked. 
			Alignments are allowed to extend through repeats, but 
			not initiate in them.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Availability of repeat-masked data |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Are the repeat annotation files available for every chromosome?"
 
			Response:Yes, you can obtain the repeat-masked files via the Table Browser or from the 
			organism's annotation database downloads directory. The RepeatMasker annotation 
			tables are named 
			chrN_rmsk (where N represents the chromosome number) and the 
			Tandem Repeat Finder (TRF) tables are named simpleRepeat.
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | RepeatMasker version differences - UCSC vs. RepeatMasker website |   |  |  
	        |  | 
|---|
 |  | 
			Question: "When I run RepeatMasker independently from the 
			RepeatMasker web server, my results vary from those of 
			UCSC.  What's the cause?"
 
			Response:UCSC occasionally uses updated versions of the 
			RepeatMasker software and repeat libraries that are not 
			yet available on the RepeatMasker website (see 
			Repeat-masking data for more
			information).
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Obtaining promoter sequence |   |  |  
	        |  | 
|---|
 |  | 
			Question: "How can I fetch promoter sequence upstream of a gene?"
 
			Response:The UCSC Genome Browser offers several ways to obtain this information, 
			depending on your requirements.
 
			The Genome Browser downloads site provides prepackaged downloads of 1000 bp, 2000 bp, 
			and 5000 bp upstream sequence for RefSeq genes that have annotated
			5' UTRs. You can obtain these from the bigZips downloads
			directory for the assembly of interest. 
			 
			To fetch the upstream sequence for a specific gene, use the 
			Table Browser.
			Enter the genome, assembly, and select the knownGene table. Paste the gene name 
			or accession number in the identifier field. Choose sequence for the output format 
			type, then click the get output button. On the next page, select genomic.  On the 
			final page, you will have the opportunity to configure the amount of upstream 
			promoter sequence to fetch, along with several other options. Click Get Sequence 
			when you've finished configuring the output. 
			 
			You can also use the Genome Browser to obtain sequence for a specific gene. 
			Open the Genome Browser window to display the gene in which you're
			interested. Click the entry for the gene in the RefSeq or Known Genes track, then
			click the Genomic Sequence link. Alternatively, you can click the DNA link in 
			the top menu bar of the Genome Browser tracks window to access options for displaying 
			the sequence. 
			 
			The Stanford Human Promoters track on the 
			UCSC 
			Custom Annotation Tracks page shows promoters for some of the human assemblies.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Data from Evolutionary Conservation Score tracks |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Where can I download the conservation score data from the Human/Mouse 
			Evolutionary Conservation Score track?"
 
			Response:The conservation score data are stored in a group of tables in the annotation 
			database downloads directory. 
			The naming conventions of the tables vary among releases. In earlier
			assemblies, table names are of the form chrN_humMusL, chrN_zoom1_humMusL, and
			or chrN_zoom2500_humMusL. In later releases, the tables are named using 
			specific release numbers, such as chrN_hg16Mm3. The tables within a given set
			differ by the number of bases/score interval and are used to generate the browser
			displays at different zooming levels.
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Minus strand coordinates - axtNet |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I downloaded the axtNet alignments between the latest human and mouse assemblies. 
			I found that some of the alignments listed in the axtNet
			files do not agree with what is shown in the browser."
 
			Response:Is this alignment on the minus strand? Minus strand coordinates in axt files 
			are handled differently from how they are handled in the Genome Browser. To convert 
			axt minus strand coordinates to Genome Browser coordinates, use:
 
      	start = chromSize + 1 - axtEnd
      	end = chromSize + 1 - axtStartSee an explanation of coordinate transforms in the genomeWiki. |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Mapping UCSC STS marker IDs to those of other groups |   |  |  
	        |  | 
|---|
 |  | 
			Question: "How do I map the STS genetic marker IDs in the genome browser to the 
			IDs assigned by other groups? "
 
			Response:We assign our own IDs to each of the STS markers, but we also track 
			the UniSTS IDs for each marker in the downloadable stsInfo2 table.  
			To determine the location of a specific marker, look up the marker's name
			in the stsAlias table to determine the UCSC ID assigned to the 
			marker, and then use this ID to look it up in the stsMap table where the marker 
			is located. For example, D10S249 has UCSC ID 2880 and is located at chr10:240791-241019.
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | deCODE map data |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Where can I get more information about the deCODE map?"
 
			Response:You can obtain this information from the combination of a couple of tables. 
			The stsMap table contains the physical position of all STS markers, 
			including those on the deCODE map. This file also contains information about 
			the position on the genome-wide maps, including the deCODE map. A second file, 
			stsInfo2, contains additional information about each marker, including aliases, 
			primer sequence information, etc. This table is related to the first table by an 
			ID (the identNo field in both files).
 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Direct MySQL access to data |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Is it possible to run SQL queries directly on the
			database rather than using the Table Browser interface?"
 
			Response:In response to requests from Genome Browser users, we have set up a MySQL
			database for public access at genome-mysql.cse.ucsc.edu.  This new server
			allows MySQL access to the same set of data currently available on our
			public Genome Browser site. The data are synchronized weekly with the main
			databases on http://genome.ucsc.edu.
			During this synchronization period, the MySQL server can be
			intermittently out of sync with the main website for a short period. 
			The weekly synchronization takes place on Monday mornings
			from 4:00 am to 9:00 am Pacific Time.
 
			To connect to the database, you must use a computer on which the MySQL
			client libraries have been installed. We recommend you use the most current 
			version of v5.0 MySQL clients, which may be downloaded from
			http://dev.mysql.com/downloads/mysql/5.0.html. 
			Connect to the MySql server
			using the command: 
			
 
    mysql --user=genome --host=genome-mysql.cse.ucsc.edu -AThe -A flag is optional but is recommended for speed.
			Once connected to the database, you may use a wide range of MySQL commands
			to query the database. 
			As a courtesy to others, please observe the following
			guidelines when using the database:
			 
			
			Avoid excessive or heavy queries that may impact the server performance.
			Inappropriate query use will result in a restriction of access. If you plan
			to execute a query that you think may be excessive, contact UCSC first to
			avoid the possibility of having your access blocked.
			
			Bot access and excessive program-driven use are not permitted.
			
			Attachments by local mirror sites are prohibited.
			 
			The MySQL database can also be used by the numerous utilities
			in the kent source tree.  Add the following
			specifications to your $HOME/.hg.conf file (remember to 			chmod your .hg.conf file to 600 permissions):
			 
    db.host=genome-mysql.cse.ucsc.edu
    db.user=genomep
    db.password=password 
			If you prefer a more structured graphical interface to the UCSC database
			tables, use the 
			Table Browser.
			 
			System problems should be reported to 
			genome-www@soe.ucsc.edu.  
			Send questions regarding the database contents or queries to 
			genome@soe.ucsc.edu.
			Messages sent to this address will be posted to the 
			moderated genome mailing list, which is archived on a public 
			Web-accessible pipermail archive.  This archive may be 
			indexed by non-UCSC sites such as Google.
		     |  |  
 |  |  
 
        | 
            | 
                | 
                    | Name of fourth column in BED output |   |  |  
                |  | 
|---|
 |  | 
                        Question: "When using the Table Browser to extract exons from a Gene track, what does the 'Name' column (fourth BED column) refer to?"
 
                        Response:
 The fourth column of the BED output contains a lot of information separated by underscores. For example:
 
 
     uc009vjk.2_cds_1_0_chr1_324343_fThis information is represented as follows:
 
     ucscId_sequenceType_sequenceTypeNumber_basesAdded_chromosome_positionOfFirstBaseOfItem_strand
			
			UCSC ID: our identification for the transcripts in the UCSC Genes track.
			
			Sequence Type: exons, introns, cds, utr5, etc.
			
			Sequence Type Number: for every transcript, there will be a row for 
			each sequence type (cds or intron) and this identifies which is represented in this 
			row;  the first is denoted with 0. So, if you requested exons, and a particular 
			transcript has 10 exons, you will see a row for each one and in this position 
			they will be numbered 0-9.
			
			Bases Added: number of bases added to the regions requested.
			
			Chromosome: chromosome number the item is on.
			
			Position of First Base of Item: if you have specified bases added to the 
			requested features (for example, Exons plus 10 bases on each end), then 
			columns 2 and 3 of the output wouldn't be the exact coordinates of the exon, 
			they would start and end 10 bases before/after the exon. So, this part of 
			the information is an easy way to see where the actual feature starts as 
			displayed in the browser. It is "as displayed in the browser" because the 
			coordinates in our tables almost always have 0-based starts (as they do 
			in columns 2 and 3 of this output) but display as 1-based in the browser 
			(for more info see this FAQ), 
			but this start position listed in this section of the 4th column is actually 1 based. 
			It will be the exact coordinate the feature starts on as displayed in the browser.
			
			Strand: forward(f) or reverse(-) strand.
			 |  |  
 |  |   |  |