Chapter 1. Using Haploview

Loading a Dataset

Data can be loaded in six formats. Ped and Haps files can also load an optional marker info file and PLINK files normally require an accompanying map or binary map file. Further options are presented on the load screen:

  • Haploview saves time by only computing pairwise LD statistics for markers within a certain distance of each other. The default is 500KB. Enter a value of zero to force all pairwise computations.
  • Haploview excludes individuals with less than 50% complete genotypes. This threshold can be adjusted in the load dialog. Additional details about excluded individuals are available from the marker check tab.
  • When loading a file dumped from the HapMap project website, it is possible to automatically display SNP and gene tracks from the HapMap above the data by checking the "Download and show HapMap info track" box. More information is available with the LD Display help. [hapmap file only]
  • If you wish to perform association tests, you must inform the program now and select either family trios or case/controls. For family datasets a standard TDT or parenTDT are available. More details are available under association. [pedfile only]
  • If your data is from the X chromosome in the linkage formats, tick the box so that Haploview will correctly process your data. In other formats, select the X chromosome in the dropdown menu. X chromosome data is not supported by the phased haplotype format. All functionality now works with the X chromosome.
  • Haploview will maximize the information available from a pedigree for both LD analyses and association tests. For the former it creates a maximal set of unrelated individuals, using trio data only for obligate parent/offspring phasing. For TDT association testing, all available transmissions from parent-offspring will be used. More detailed information about specific situations is available in the FAQ.
  • Haploview can be configured to support proxy host settings using the "Proxy Settings" button on the load screen.

Haploview allocates 512MB of memory by default. This is usually sufficient to handle datasets with several thousand markers. If you are running the program on very large datasets (>20,000 markers) you may need to force more memory (presuming your computer has sufficient resources available). This can be accomplished using the following command:

java -jar Haploview.jar -memory 2000

Where "2000" in this case specifies 2000 megabytes of memory and can be adjusted as necessary. Previous versions of Haploview required a slightly different command to adjust available memory, which still works:

java -Xmx2000M -cp Haploview.jar edu/mit/wi/haploview/Haploview

Data Quality Checks

Marker Checks

After loading a file, Haploview shows some basic data quality checks for the markers. Markers are filtered out based on some default criteria which can be adjusted as necessary. Markers can be added or removed from analyses by hand via the checkboxes. The data in this table can be sorted by clicking on any of the column headers. Compound sorts can be done by clicking on the first column header then CTRL clicking on the next one.

  • # is the marker number.
  • Name is the marker ID specified (only if an info file is loaded).
  • Position is the marker position specified (only if an info file is loaded).
  • ObsHET is the marker's observed heterozygosity.
  • PredHET is the marker's predicted heterozygosity (i.e. 2*MAF*(1-MAF)).
  • HWpval is the Hardy-Weinberg equilibrium p value, which is the probability that its deviation from H-W equilibrium could be explained by chance.
  • %Geno is the percentage of non-missing genotypes for this marker.
  • FamTrio is the number of fully genotyped family trios for this marker (0 for datasets with unrelated individuals).
  • MendErr is the number of observed Mendelian inheritance errors (0 for datasets with unrelated individuals).
  • MAF is the minor allele frequency (using founders only) for this marker.
  • Alleles are the major and minor alleles for this marker.
  • Rating is checked if the marker passes all the tests and unchecked if it fails one or more tests (highlighted in red).

You can adjust the filtering thresholds and click "Rescore" to refilter the markers using the new values. These thresholds can be reset to values by clicking "Reset Values". Markers can also be selected/unselected by hand by clicking the "Rating" checkbox or using the "Select All" and "Deselect All" buttons. Any marker which fails one of the quality tests will have the relevant field(s) highlighted in red.

Duplicate Markers

If two markers in an input file have the same chromosomal position, Haploview will ignore the less completely genotyped marker by default and highlight both in yellow on the check markers panel. When running in nogui mode Haploview always ignores the less completely genotyped version of two markers with the same position. If you want to use both from the command line, you'll need to adjust one of the positions.

If two markers in an input file have the same name, Haploview renames the second one in the file by appending ".X" to the filename, where "X" is a running integer count starting with 1. So if you have marker1, marker1 and marker2, Haploview would adjust this to: marker1, marker1.1 and marker2. Note that if the markers with the same name have different positions, Haploview won't deselect any of them; if they do have identical positions, it will filter all but one out as described above.

Filtered Individuals

The top of the tab contains information about individuals filtered during the loading of the file. It will show overview information about the number of singletons and trios used and the number of independent families loaded. Further information can be shown by clicking the "Advanced Views" button. This will present a list of up to four buttons depending on the nature of the loaded dataset. The "Individual Summary" button will show genotyping percentage by family and individual. If individuals have been excluded, the "Excluded Individuals" button will present a list of excluded individuals as well as the reason for exclusion. If Mendel errors are present, view detailed Mendel error information by clicking the "Mendel Errors" button. If male heterozygotes are present in x chromosome data, information about them can be viewed by clicking "Male Heterozygotes". All of these advanced views can also be exported using the "Export to File" button. Details about individual filtering can be found in the FAQ.

LD Display

Perusing the LD Display

  • The color scheme option (Display menu) allows you to choose among several LD color schemes. The following tables provide details on the color schemes, and a key to the meaning of the currently selected scheme can be dropped down from the "Key" menu in the upper right corner of the screen.

    Table 1.1. Standard Color Scheme

     D' < 1D' = 1
    LOD < 2whiteblue
    LOD ≥ 2shades of pink/redbright red

    Table 1.2. Confidence Bounds Color Scheme

    Strong Evidence of LDdark grey
    Uninformativelight grey
    Strong Evidence of Recombinationwhite

    Table 1.3. r2 Color Scheme

    r2 = 0white
    0 < r2 < 1 shades of grey
    r2 = 1 black

    Table 1.4. Alternate D'/LOD Color Scheme

     Low D'High D'
    Low LODwhiteshades of pink
    High LODwhiteblack

    (r2 and Alt D'/LOD courtesy of Will Fitzhugh)

    Table 1.5. 4 Gamete Color Scheme

    4 distinct 2-marker haplotypeswhite
    < 4 distinct 2-marker haplotypesblack
  • In order to help keep the display uncluttered, D prime values of 1.0 are never shown (the box is empty). These values can be switched on or off with the "Show LD values" option in the Display menu.
  • The zoom option (Display menu) allows you to select one of three zoom modes. The two zoomed out versions can be useful for browsing large datasets.
  • Large datasets also show a "map" in the lower left corner which gives an overview of the D prime display and allows you to navigate quickly. Clicking on an area of the map will cause the main display to jump to that area. This map also shows the currently defined blocks as small black lines across the top.
  • Markers with additional notes (as loaded from the info file) are highlighted (the names are green in the zoomed-in view and the lines from the SNP position to the LD chart are green in the zoomed-out view. Details can be viewed by right clicking on the marker number (as mentioned below).
  • Right clicking on the marker number (or the equivalent space in the zoomed out views) shows the marker name, minor allele frequency and any additional notes specified in the info file. This can be especially helpful in the zoomed out views which do not display marker names. The last such piece of popup information clicked will be shown at the top of the LD plot. This reminder can be dismissed by left clicking anywhere on the LD plot.
  • Right clicking on any pairwise LD comparison will show a more detailed summary of the LD between the two markers in question. This information is also shown at the top of the screen as described above and can be dismissed by left clicking anywhere on the LD plot.

Additional Data Tracks

Analysis Track

A graph of any variable versus chromosomal location can be added above the LD plot with the "Load Analysis Track" option. Simply create a file with two columns: <position> <value> . Haploview will plot the values in a continuous line along the top of the screen, along with a scale bar on the Y-axis. You can load several analysis tracks which will all be plotted in the same box at the top of the LD plot.

HapMap Gene/SNP Track

The "Download HapMap info track" option (with an internet connection) allows you to connect to the HapMap Project server and download and display a track with HapMap genotyped SNPs and gene names. If an info file is specified, the default boundaries are the positions of the first and last markers (which is only valid if the info file is in genomic coordinates). You must specify the proper chromosome and genomic build in the dialog box. If you are using a file downloaded from the HapMap website the program will specify the correct default chromosome, build and start/end positions. This track display can be configured with the "HapMap Info Track Options" item in the "Display" menu. Available tracks include HapMap SNPs, Entrez genes, recombination rate, contigs, and GC content.

Blocks and Haplotypes

Blocks

Haploview generates blocks whenever a file is opened, but these blocks can be edited and redefined in a number of ways. In the Analysis menu, you can clear all the blocks in order to start over, define blocks based on one of several automated methods or customize the parameters of those algorithms. Additionally, the blocks can be edited by hand.

Confidence Intervals [DEFAULT]

The default algorithm is taken from Gabriel et al, Science, 2002. 95% confidence bounds on D prime are generated and each comparison is called "strong LD", "inconclusive" or "strong recombination". A block is created if 95% of informative (i.e. non-inconclusive) comparisons are "strong LD". This method by default ignores markers with MAF < 0.05. The MAF cutoff and the confidence bound cutoffs can be edited by choosing "Customize Block Definitions" (Analysis menu). This definition allows for many overlapping blocks to be valid. The default behavior is to sort the list of all possible blocks and start with the largest and keep adding blocks as long as they don't overlap with an already declared block.

Four Gamete Rule

This is a variant on the algorithm described in Wang et al, Am. J. Hum. Genet., 2002. For each marker pair, the population frequencies of the 4 possible two-marker haplotypes are computed. If all 4 are observed with at least frequency 0.01, a recombination is deemed to have taken place. Blocks are formed by consecutive markers where only 3 gametes are observed. The 1% cutoff can be edited to make the definition more or less stringent.

Solid Spine of LD

This internally developed method searches for a "spine" of strong LD running from one marker to another along the legs of the triangle in the LD chart (this would mean that the first and last markers in a block are in strong LD with all intermediate markers but that the intermediate markers are not necessarily in LD with each other).

Markers can be removed from blocks by clicking on the marker number (along the top of the D prime graph). Blocks can be defined by hand by clicking and dragging along the marker number row. Any block which overlaps with an existing block will take precedence and delete the existing block.

Haplotypes

Display

View haplotypes for selected blocks by clicking on the "Haplotypes" tab or selecting "Haplotypes" from the Display menu. Haplotypes are estimated using an accelerated EM algorithm similar to the partition/ligation method described in Qin et al, 2002, Am J Hum Genet. This creates highly accurate population frequency estimates of the phased haplotypes based on the maximum likelihood as determined from the unphased input.

The haplotype display shows each haplotype in a block with its population frequency and connections from one block to the next. In the crossing areas, a value of multiallelic D' is shown. This represents the level of recombination between the two blocks. Note that the value of multiallelic D' is computed for only the haplotypes ("alleles") currently displayed. This usually does not have a strong effect, as the rare haplotypes contribute only slightly to the overall value. Above the haplotypes are marker numbers along with a tick beneath haplotype tag SNPs (htSNPs).

Display Controls

The display can be edited using the controls at the bottom of the screen to display only more common haplotypes or to adjust the connecting lines. By default, alleles are displayed using A,C,G,T along with the special symbol 'X' which represents a fairly rare situation in which only one allele is unambiguously observed in phased data. The 'X' represents the allele of unknown identity. The display can also be changed to show the alleles numerically from 1-4 with 8 being the equivalent of 'X', or as blue and red boxes, with blue being the major allele and red the minor.

Tag SNPs

Haplotype tag SNPs are no longer displayed by default in the Haplotypes tab. It is recommended that all tagging be done via the Tagger tab. The block-by-block tags can be displayed by ticking the "Show tags in blocks" option in the Display menu.

Tagger

Introduction

We have developed a tagging strategy that combines the simplicity of pairwise methods with the potential efficiency of multimarker approaches. We avoid overfitting and unbounded haplotype tests in the association phase by (a) using only those multiallelic combinations in which the alleles are themselves in strong LD, and (b) explicitly recording the allelic hypotheses that are to be tested in the subsequent association analysis. Attractive practical features include the ability to force in or exclude sets of tags.

Haploview is based on Paul de Bakker's Tagger. It and more information are available at the Tagger website. There are a number of differences between the implementations, although they are constructed around the same concept. Tagger currently searches a much broader space of available multi-marker tests (up to 6-mers) whereas Haploview allows only 2- or 3-marker tests in the interest of computational efficiency.

Features

Haploview's Tagger operates in either pairwise or aggressive mode. In either case it begins by selecting a minimal set of markers such that all alleles to be captured are correlated at an r2 greater than a user-editable threshold with a marker in that set. Certain markers can be forced into the tag list or explicity prohibited from being chosen as tags. You can also specify which markers in the dataset you want to be captured.

Aggressive tagging introduces two additional steps. The first is to try to capture SNPs which could not be captured in the pairwise step (N.B. these must have been "excluded" since otherwise they would simply be chosen to capture themselves) using multi-marker tests constructed from the set of markers chosen as pairwise tags. After this, it tries to "peel back" the tag list by replacing certain tags with multi-marker tests. Tagger avoids overfitting by only constructing multi-marker tests from SNPs which are in strong LD with each other, as measured by a pairwise LOD score. This LOD cutoff can be adjusted to loosen or tighten this requirement; in general, the default cutoff of 3.0 is appropriate for selecting tags from a HapMap-sized reference panel of 120 chromosomes.

Much more information about the development of this algorithm is available at the Tagger website.

Tagger Configuration Panel

N.B. Haploview's Tagger requires either an info file or a hapmap style input file, because it references the marker names specified in those files. If you load a pedigree or phased haplotypes input file without an info file, the Tagger panels will not be available.

This panel shows all SNPs available for tag selection. SNPs which are deselected in the Check Markers tab will not be in this list. There are three checkboxes for each SNP:

Force Include

Checking this box will force this SNP to be chosen as a tag SNP.

Force Exclude

Checking this box will prohibit this SNP from being chosen as a tag SNP.

Capture this Allele?

If this box is checked, Haploview will include this SNP in the list of alleles to be captured by the chosen tag set.

N.B. The include and exclude checkboxes are mutually exclusive, and "Capture this Allele" must be checked in order to either include or exclude a marker.

Directly below the marker list are buttons to quickly manipulate the table above. Use "Include All" to check all of the "Force Include" boxes, and "Exclude All" to check all of the "Force Exclude" boxes. "Uncapture All" will uncheck the "Capture this Allele?" column for all markers, "Exclude A/T and C/G SNPs" will exclude check the "Force Exclude" boxes for SNPs with strand issues, and "Reset Table" will return the table to its initial state. Beneath these buttons are several additional tagging options. You can choose from among pairwise and two aggressive tagging strategies discussed above. You can also set the r2 and LOD thresholds as previously mentioned. Additionally, you can specify the maximum number of tags to pick, as well as the minumum distance (in base pairs) between picked tags. You can load a set of SNPs to include or exclude using the "Load Includes" and "Load Excludes" buttons. These buttons take in a file with a single column of SNPs to include or exclude. The "Alleles to Capture" button also takes in a file with a single column of SNPs to be captured. Design scores can also be loaded in using the "Design Scores" button. Design score files should contain two columns containing the SNP and the design score to assign to that SNP. A minimum design score threshold can also be specified. All of the Tagger thresholds can be reset to their default values using the "Reset Thresholds" button. Clicking "Run Tagger" will run the tagging algorithm. When finished it will switch from the Configuration to the Results Panel.

Tagger Results Panel

This panel is split into a "Tests" section on the left and a marker-by-marker report on the right. The marker report lists all SNPs, the test which best captures them, and their r2 with that test. SNPs which were unchecked from the "Capture this allele?" list on the Configuration panel are greyed out. SNPs which could not be successfully tagged are shown in red.

The first list in the "Tests" section shows all the tests (both single marker and multi-marker alleles) chosen by Haploview. Selecting tests in this list will show which alleles are captured by those tests in the second list in the panel. Beneath these lists is a summary of the tagging results.

Captured N alleles with mean r2 of X.

This shows how many of the SNPs in the dataset have been successfully tagged by the set of chosen tests. The mean r2 represents the mean for only those SNPs successfully captured.

Captured N percent of alleles with r2 >0.8

This shows what fraction of the alleles captured by the tests have an r2 >= 0.8. Of course, if your tagging r2 threshold is >= 0.8 this value will always be 100%.

Using N SNPs in M tests.

This shows that N unique SNPs have been chosen to create M tests, which can either be one of the set of N SNPs or some combination of those SNPs.

The "Dump Tests File" button exports a file with the list of tests in the format used by Haploview's custom association test file and Tagger's export. This file contains the list of all tests (single SNPs and multi-marker tests) selected by Tagger for subsequent association analysis. In pairwise-only tagging this file will be identical to the "Tags" file, below.

The "Dump Tags File" button exports a file with the list of Tag SNPs in the format used by Haploview's custom association test file and Tagger's export. It is the concise list of SNPs selected by Tagger for genotyping. In pairwise-only tagging this file will be identical to the "Tests" file, above.

The "Export Tab to Text" option in the File menu will export a summary file showing the best tag for each marker and the list of tests along with the alleles tagged by each test.

Association Tests

If selected when loading the data, Haploview computes single locus and multi-marker haplotype association tests. For case/control data, the chi square and p-value for the allele frequencies in cases vs. control are shown. For family trios, all probands (affected individual with genotyped parents) are used to compute TDT values. If the parenTDT option is selected, additional information is gained from parental phenotypes. More information about this method can be found in the Citations list in the About Haploview section.

The haplotype association test is performed on the set of blocks selected on the LD and haplotype tabs. Results are shown only for those haplotypes above the display threshold on the haplotype tab. Counts for both TDT and case control association tests are obtained by summing the fractional likelihoods of each individual for each haplotype. In other words, if a particular individual has been determined by the EM to have a 40% likelihood of haplotype A and 60% likelihood of haplotype B, 0.4 and 0.6 would be added to the counts for A and B respectively.

Additional information about the way in which pedigrees are filtered for TDT purposes can be found in the FAQ.

Haploview is not intended to be the only way of testing association results, but to provide a straightforward way to do simple association tests. It's always a good idea to try out multiple approaches to analyzing your data.

You can load a set of custom association tests in the format exported by Haploview and Tagger. This format is discussed below.

Permutation Testing

Haploview provides a framework for permuting your association results in order to obtain a measure of significance corrected for multiple testing bias. You can choose to permute one of several test sets:

Single Markers Only

Permute just association tests to the individual SNPs in your dataset.

Single Markers and Haplotypes in Blocks

Permute the individual SNPs as above, along with all the haplotypes shown in the Haplotypes tab.

Haplotypes in blocks only

Permute only the haplotypes in the Haplotypes tab, ignoring the single marker results.

Custom Tests from File

Permute the set of tests loaded from an external file. Note that this choice is only available if you provided a tests file when you loaded your dataset.

Specify how many permutations to do and press the "Do Permutations" button to start the permutations. While the permutations are running, Haploview shows the following:

  • A progress bar which tracks the progress of the permutations.
  • The highest permuted chi square so far.
  • The fraction of permutations whose strongest association exceeds the best observed chi square.

You can stop the permutations at any time with the "Stop" button. Once the permutations are complete, Haploview displays:

  • A table listing all tests (single SNP and haplotype) along with their association chi squares and permuted p-values.
  • A histogram of the highest chi square from each of the permutations.

You can save the permutation summary by using the "Export Tab to Text" option in the File menu.

PLINK

Haploview can now take in PLINK outputs. These files require a separate map file or binary map file corresponding to each marker in the output file in order to load. Any output file from PLINK can be loaded provided that it contains a SNP column corresponding to the map file. The map file can contain SNPs that are not present in the associated output file and the SNPs need not be in the same order in the two files. PLINK output is displayed in a single tab containing a sortable table of results and a variety of filtering options below the table. In SNP-based files, you can also load in additional columns using the "Load Additional Results" button.

Filtering options include chromosome and position. The "Filter" dropdown box can be used to further filter on any unrecognized columns in the table. Multiple filters can be applied consecutively by changing the filter options and pressing the "Filter" button. Currently active filters can be viewed using the "View X Active Filters" button where X is the number of currently active filters. This will open a small popup window where filters can be viewed, added, or removed as needed. All filters are jointly applied (logically equivalent to an "AND" operator as opposed to an "OR"). You can also go directly to a specified marker name in the table using the "Go" button on the second row, and remove columns from the table by selecting a column from the "Remove Columns:" dropdown menu and using the adjacent "Remove" button. The "Reset" button can be used to revert the table and filters back to their original state (though the sorting state is retained). Please note that non-SNP based files which can be loaded in without a map file do not have the chromosomal or marker filters.

You can use the Fisher method for combining p-values in your PLINK results using the "Combine P-Values" button under the fitering options. This will bring up a dialog that allows you to choose between 2 and 5 p-value columns for use in the Fisher-combined algorithm. Once you click "Go", a new column designated as "P_COMBINED" will appear as the last column in the table.

You can create graphical plots from your results table using the "Plot" button under the filtering options. Use the Plot Options dialog to specify a title for your plot and various plotting options. At the top of the dialog, select an optional title for your plot. Then choose which columns to use on the X and Y axes of the plot as well as the scale for each using the appropriate drop-down boxes. If you have loaded a SNP based file with an accompanying map file, selecting "Chromosomes" as the X-Axis will plot your results across the chromosomes and will color code them separately. You can also select "Index" for either axis which will simply plot sequential numbers for each result shown in the table for that axis. Additionally, you can specify up to two thresholds for use in the plot, along with which axis to place them on and which direction they should be. Threshold 1 or the "Suggestive" threshold for the -log10 scale will create a blue line and Threshold 2 or the "Significant" threshold for the -log10 scale will create a red line. Datapoints which pass the thresholds will be larger in size then the standard datapoints. Directly beneath the thresholds, you can choose the base datapoint size for results in your plot using the dropdown box. To the right of that, you can use the optional "Color Key" dropdown menu to select a column to be used as a coloring key in the plot. Please note that this functionality will only work when the chosen color key column has 50 or fewer unique values. On the next line, you can use the "Show Gridlines?" checkbox to select whether to show or hide the gridlines in the plot. To the right of that, you can specify the initial width and height of the plot (in pixels). Finally, on the next line you can use the "Export to SVG" checkbox and browse to a location to save your plot to a high quality SVG file. Please note that the SVG option generally takes a great deal of processing power and memory and should only be used when very high quality images are required. In most cases, you can save the plot images as PNG files using the right click context menu described below.

Once the plot has loaded, you can hover over individual data points to see information about that point in a tooltip popup. SNP based data will display the corresponding marker name, chromosome, position, and the value that is being plotted. Non SNP-Based data will either display the corresponding FID and IID values if they are available, or simply the X and Y values for that datapoint. You can also click a datapoint to be taken back to that result in the results table. For many more plotting options including export options, right click anywhere on the plot.

By highlighting a specific result in the table and clicking "Go to Selected Region", you can bring up a dialog to automatically fetch that region from the HapMap website and load it into Haploview. The dialog allows you to specify the size of the region and the HapMap analysis panel that you wish to download. You can also optionally choose to annotate the columns from the PLINK tab to annotate in the LD Plot. Once the region has been successfully loaded into Haploview, the initially selected marker will be highlighted in blue on the Check Marker and PLINK tabs. SNPs that appear in the PLINK tab are now marked in green on the LD Plot, and the specific result that you specified is further highlighted in white. You can view the annotated data from the PLINK table by right clicking on the marker number in the LD Plot. You can use the "Force in PLINK SNPs" button in the Tagger panel to force include all the SNPs contained in the PLINK results tab. Please note that using "Go to Selected Region" requires an active internet connection.

The "Export Tab to Text" option in the File menu will export a text file containing the current view of the results table. This file will preserve any sorting and filtering that you've enabled in the table.