Chapter 2. Files

Input File Formats

Haploview currently accepts input data in five formats, standard linkage format, completely or partially phased haplotypes, HapMap Project data dumps, PHASE format, and PLINK outputs. The program can also automatically fetch phased HapMap data off the HapMap website. It also takes in a separate file with marker position information, as well as several auxiliary input files, described below. The four formats are explained in depth below.

Linkage Format

Linkage data should be in the Linkage Pedigree (pre MAKEPED) format, with columns of family, individual, father, mother, gender, affected status and genotypes. The file should not have a header line (i.e. the first line should be for the first individual, not the names of the columns). Please note that Haploview can only interpret biallelic markers — markers with greater than two alleles (e.g. microsatellites) will not work correctly. A sample line from such a file might look something like:

3    12    8    9    1    2    1 2    3 3    0 0    4 2
a     b    c    d    e    f    -----------g------------ 
(a) pedigree name

A unique alphanumeric identifier for this individual's family. Unrelated individuals should not share a pedigree name.

(b) individual ID

An alphanumeric identifier for this individual. Should be unique within his family (see above).

(c) father's ID

Identifier corresponding to father's individual ID or "0" if unknown father. Note that if a father ID is specified, the father must also appear in the file.

(d) mother's ID

Identifier corresponding to mother's individual ID or "0" if unknown mother Note that if a mother ID is specified, the mother must also appear in the file.

(e) sex

Individual's gender (1=MALE, 2=FEMALE).

(f) affection status

Affection status to be used for association tests (0=UNKNOWN, 1=UNAFFECTED, 2=AFFECTED).

(g) marker genotypes

Each marker is represented by two columns (one for each allele, separated by a space) and coded either ACGT or 1-4 where: 1=A, 2=C, 3=G, T=4. A 0 in any of the marker genotype position (as in the the genotypes for the third marker above) indicates missing data.

It is also worth noting that this format can be used with non-family based data. Simply use a dummy value for the pedigree name (1, 2, 3...) and fill in zeroes for father and mother ID. It is important that the "dummy" value for the ped name be unique for each individual. Affection status can be used to designate cases vs. controls (2 and 1, respectively).

Files should also follow the following guidelines:

  • Families should be listed consecutively within the file (i.e. all the lines with the same pedigree ID should be adjacent)
  • If an individual has a nonzero parent, the parent should be included in the file on his own line.
Phased Haplotypes

Haplotype data for Haploview's input must be formatted in columns of Family, Individual and Genotypes. There should be two lines (chromosomes) for each individual. This is the standard format of Genehunter's TDT output. See the sample below:

FAM1    FAM1M01    0    4    2    2
FAM1    FAM1M01    0    4    2    2
FAM1    FAM1F02    3    h    1    2
FAM1    FAM1F02    3    h    1    2

The data format uses the numerals 1-4 to represent genotypes, the number zero to represent missing data, and the letter "h" to represent a heterozygous allele. That is, if an individual is heterozygous at a locus, both alleles should be "h" if the phasing (which allele falls on which chromosome) is uncertain.

HapMap Project Data Dumps

Data from the HapMap Project can be dumped by region using the GBrowse interface. The saved data file is in a marker-per-line format which can be loaded in Haploview.

GBrowse dumps only one file, which has one marker per line and which includes familial relationships among the HapMap samples as well as marker position information. The file format has several header lines (beginning with "#") which Haploview parses. Open the file by selecting "Browse HapMap Data" option and selecting the downloaded file.

If you wish to load data from another source in HapMap style format, you will need to specify pedigree information in the header of the file you've created. This can be done by creating lines of the following format at the top of your file:

#@ FAM01 NA0001 0 0 1 1

This data is the same as the pedfile format discussed above. That is, the fields are family,individual,father,mother,gender,affected status. You would then replace the NAXXXX identifiers in the header row of the HapMap file with your identifiers, subject to two important constraints: they must be unique across the entire dataset, not just within a family and they must begin with the characters NA.

HapMap PHASE Format

Data in the HapMap PHASE format can be loaded into Haploview using three separate files. The first is the data file containing binary allele information. The second is a sample file containing a single column of the individual IDs used in the dataset. The third is a legend file containing four columns: marker, position, 0, and 1. Only the legend file requires a header and is used to decode the information in the data file. These files can be loaded in as GZIP compressed files using the "Files are GZIP compressed" checkbox on the initial loading screen. For more information on the HapMap PHASE format, please see the HapMap PHASE readme.

HapMap Download

Data in the HapMap PHASE format can also be automatically downloaded into Haploview using the "HapMap Download" tab in the load screen by specifying the HapMap Release, chromosome, analysis panel, and start and end positions (in kb). These options can also be automatically filled in by querying the GeneCruiser database with a gene or SNP ID. More information about the GeneCruiser database can be found at the GeneCruiser website.

Marker Information File

The marker info file is two columns, marker name and position. The positions can be either absolute chromosomal coordinates or relative positions. It might look something like this:

marker01 190299
marker02 190950
marker03 191287

An optional third column can be included in the info file to make additional notes for specific SNPs. SNPs with additional information are highlighted in green on the LD display. For instance, you could make note that the first SNP is a coding variant as follows:

marker01 190299 CODING_SNP
marker02 190950
marker03 191287
PLINK Format

Output files from PLINK can be loaded into Haploview using the PLINK tab on the initial loading screen. PLINK files must contain a header and at least one column header must be titled "SNP" and contain the marker IDs for the results in the file. PLINK loading also requires a standard PLINK map or binary map file corresponding to the markers in the output file. The map file can be either three or four headerless columns (the Morgan distance column is optional). The map file can also be embedded in the results file as the first few columns of the file using the "Integrated Map Info" checkbox. You can load in non-SNP based files as well by checking the "Non-SNP" box. These files do not require a map file. You can choose to only load in one chromosome from your results file using the "Only load results from Chromosome" checkbox and selecting a chromosome from the dropdown list. You can also select which columns to load from your results file by checking the "Select Columns" checkbox. For a great deal more information on PLINK outputs, please see Shaun Purcell's PLINK website.

Batch Load File

The "-batch" flag on the command line allows you to run Haploview automatically (in nogui mode) on several files. Batch input files should have one genotype file per line, along with an info file (if desired) separated by a space. Filenames must conform to the following rules:

  • Pedfile names must end in ".ped"
  • Phased haplotype file names must end in ".haps"
  • HapMap file names must end in ".hmp"
  • Info file names must end in ".info"

The following example shows 2 pedfiles (with info files) and a hapmap file:

sample1.ped   sample1.info
sample2.ped   sample2.info
sample3.hmp

Output Files

For any given tab the information in the display can be saved. For the data check and association test tabs, a simple tab-delimited text file is generated from the tables. For the LD and Haplotype tabs, data can either be dumped to text files or the image can be saved to a PNG.

LD Text Output File

LD text output is a tab delimited set of columns containing the various measures of LD used by the program. Details for each column are shown below:

  • L1 and L2 are the two loci in question, referenced by their number or name (if marker info file is provided)
  • D' is the value of D prime between the two loci.
  • LOD is the log of the likelihood odds ratio, a measure of confidence in the value of D'
  • r2 is the correlation coefficient between the two loci
  • CIlow is 95% confidence lower bound on D'
  • CIhi is the 95% confidence upper bound on D'
  • Dist is the distance (in bases) between the loci, and is only displayed if a marker info file has been loaded
  • T-int is a statistic used by the HapMap Project to measure the completeness of information represented by a set of markers in a region

Details about additional options for this output type can be found below in the Export Options section.

LD PNG Output

When saving the LD table to a PNG, Haploview saves an image using the current display settings. This includes color scheme, zoom and proportional spacing. Thus, in order to save a less detailed image to a PNG, first zoom out, then export the tab. Note that Haploview cannot save large datasets at the higher zoom levels. For more information see the Export Options section below.

Haplotype Text Output File

Haplotype output shows a block, its markers, the haplotypes and their population frequencies, the crossover percentages to the next block and the multiallelic D prime. Crossover percentages are shown as a matrix with this block's haplotypes as the rows and the next block'shaplotypes as the columns. An example might look like:

BLOCK 1.  MARKERS: 1 2 3 4
3312 (0.825)    |0.800  0.025   0.000|
1144 (0.163)    |0.031  0.125   0.007|
3342 (0.013)    |0.006  0.000   0.006|
Multiallelic Dprime: 0.802
BLOCK 2.  MARKERS: 10 11 12
441 (0.837)
222 (0.150)
242 (0.013)

In this example, the first block has 4 markers with 3 haplotypes displayed and the second block has 3 markers and 3 haplotypes. The tag SNPs for each block are (3,4) and (10,11) respectively. The crossover percentage matrix can be read as follows: 80% of all samples have the pattern 3312-441, 3.1% have the pattern 1144-441 and so forth.

Haplotype PNG Output

Saving the haplotype tab to a PNG produces an image using the current display settings (such as haplotype frequency cutoff).

Single Marker Association Text Output File

Single marker association results are saved in a tab-delimited text file with the following columns:

  • # is the marker number.
  • Name is the marker ID specified if an info file is loaded.
  • Chi Square is the chi square value for the marker.
  • p value is the significance level for the above chi square.

Trio (TDT) data only:

  • Overtransmitted is the allele overtransmitted to affected offspring.
  • T:U is the ratio of transmissions to non transmissions of the overtransmitted allele (see above).

Case-Control data only:

  • Major Alleles are the major alleles in the case and control populations respectively.
  • Case Control Ratios are the ratios (shown as either counts or quotients, depending on selected options) for the case and control populations, respectively.
Haplotype Association Text Output

Haplotype association text output is a tab-delimited file, broken into sections by block. The columns are:

  • Haplotype is the sequence of alleles for this haplotype in this block.
  • Frequency is the population frequency for this haplotype.
  • Chi Square is the chi square value for the haplotype.
  • p value is the significance level for the above chi square.

Trio (TDT) data only:

  • T:U is the ratio of transmissions to non transmissions of the haplotype to affected offspring.

Case-Control data only:

  • Case Control Ratios are the ratios (shown as either counts or quotients, depending on selected options) for the case and control populations, respectively.
Permutation Text Output File

The output from the permutations tab shwos the number of permutations performed and then a tab-delimited table with one row per permuted test and the following columns:

  • Name is the test name, which is either a marker name or a comma separated list of marker names then a tab then a comma separated set of alleles for those markers.
  • Chi Square is the observed association chi square for that test.
  • Permutation p-value shows the significance of the test among the permutation tests.
Tagger Text Output File

The Tagger text output begins with several pieces of summary information. More details on this can be found in the Tagger section. The rest of the output is divided into two sections. The first lists each marker, with the following rows:

  • Marker is the marker name.
  • Best Test is the test with the highest r2 to this marker.
  • r^2 w/test is the r2 between this marker and its test.

The second part consists of a list of the tests and the alleles they capture best.

Tagger Tests Dump

This file is the same format used by Haploview for custom association tests and exported by Tagger. It is discussed below in the auxiliary files section.

Tagger Tags Dump

This file is the same format used by Haploview for custom association tests and exported by Tagger. It is discussed below in the auxiliary files section.

Marker Check Text Output File

The marker check data is a tab-delimited file with the following columns:

  • # is the marker number.
  • Name is the marker ID specified (only if an info file is loaded).
  • Position is the marker position specified (only if an info file is loaded).
  • ObsHET is the marker's observed heterozygosity.
  • PredHET is the marker's predicted heterozygosity (i.e. 2*MAF*(1-MAF)).
  • HWpval is the Hardy-Weinberg equilibrium p value, which is the probability that its deviation from H-W equilibrium could be explained by chance.
  • %Geno is the percentage of non-missing genotypes for this marker.
  • FamTrio is the number of fully genotyped family trios for this marker (0 for datasets with unrelated individuals).
  • MendErr is the number of observed Mendelian inheritance errors (0 for datasets with unrelated individuals).
  • MAF is the minor allele frequency (using founders only) for this marker.
  • Alleles are the major and minor alleles for this marker.
  • Rating is "BAD" if the marker failed any of the above tests and blank otherwise.
PLINK Table Text Output File

The PLINK text output is a tab-delimited file of the current view of the data in the PLINK tab. Please note that while the filtering state is preserved in this output, the sorting state is not.

Export Options

The "Export Options" item in the File Menu allows adjustment of several parameters and allows the user to save any tab without having to switch to it. Specifically, the LD tab allow the markers to be filtered to output only some of the markers:

All

The default setting (and only one available for most tabs) is to use all the markers.

Marker Range

Generates the LD text or PNG file for only a specific range of markers.

Adjacent Markers

Generates the LD text file for only adjacent markers. This can be useful to view the T-int stat, which measures LD information content in the gaps between markers.

There is also an option to generate a "compressed" LD PNG, which is useful for very large datasets. The image is shrunk to an arbitrary zoom level which allows Haploview to save the PNG with minimal memory usage. Images can also be exported as high quality SVG files for use in publication. Please note that SVG images are quite large and may require a large amount of memory.

Auxiliary Input Files

Blocks File

You can specify a set of blocks by loading a blocks file. Each line is a space separated list of markers with one block per line. For example:

1 2 3 4
9 10 11 12 13 14 15

Would create one block from markers 1-4 and another from 9-15. The first marker in the file is number 1 (not 0).

Analysis Track File

You can add an analysis track along the top of the LD display by loading a file with two columns, <position> <value>. Haploview will plot the values continuously with respect to the positions of the markers, so the positions should use the same coordinates as the marker info file. For example:

1000 0.3
2000 1.7
3000 11.0
4000 2.3
5000 4.6

Would plot a line from position 1000 to 5000. The values can be of any units or magnitude, as the Haploview scales the analysis track to the bounds of the values.

Custom Association Tests File

You can specify a set of custom association tests for Haploview to perform. The format takes both single marker tests and multi-marker tests (which require you to specify alleles for those markers). The format is one test per line with each line containing one of the following: a single marker name or several comma separated names, then a tab, then comma separated alleles for each marker. This format is exported by Haploview using the "Dump Tests" button in the Tagger Results panel and by Paul deBakker's Tagger webpage.

For instance, the following example would create 5 tests: markers 1, 2 and 3 individually, all the alleles (haplotypes) of the block 4,5,6 and the CAA haplotype of the block 12,13,14:

marker1
marker2
marker3
marker12,marker13,marker14     2,1,1

N.B. Using a Custom Association Tests File requires a marker info file, since the tests file reads the marker names as specified in the info file.

Tagger Marker Include/Exclude File

You can specifiy a list of markers for Tagger to include or exclude from those markers available for selection as tag SNPs. In either case the format is the same: one marker name per line. The following file could be used to either include or exclude markers 1,7 and 9:

marker1
marker7
marker9

N.B. Using a Tagger Include/Exclude File requires a marker info file, since it reads the marker names as specified in the info file.