Data can be loaded in six formats. Ped and Haps files can also load an optional marker info file and PLINK files normally require an accompanying map or binary map file. Further options are presented on the load screen:
Haploview allocates 512MB of memory by default. This is usually sufficient to handle datasets with several thousand markers. If you are running the program on very large datasets (>20,000 markers) you may need to force more memory (presuming your computer has sufficient resources available). This can be accomplished using the following command:
java -jar Haploview.jar -memory 2000
Where "2000" in this case specifies 2000 megabytes of memory and can be adjusted as necessary. Previous versions of Haploview required a slightly different command to adjust available memory, which still works:
java -Xmx2000M -cp Haploview.jar edu/mit/wi/haploview/Haploview
After loading a file, Haploview shows some basic data quality checks for the markers. Markers are filtered out based on some default criteria which can be adjusted as necessary. Markers can be added or removed from analyses by hand via the checkboxes. The data in this table can be sorted by clicking on any of the column headers. Compound sorts can be done by clicking on the first column header then CTRL clicking on the next one.
You can adjust the filtering thresholds and click "Rescore" to refilter the markers using the new values. These thresholds can be reset to values by clicking "Reset Values". Markers can also be selected/unselected by hand by clicking the "Rating" checkbox or using the "Select All" and "Deselect All" buttons. Any marker which fails one of the quality tests will have the relevant field(s) highlighted in red.
If two markers in an input file have the same chromosomal position, Haploview will ignore the less completely genotyped marker by default and highlight both in yellow on the check markers panel. When running in nogui mode Haploview always ignores the less completely genotyped version of two markers with the same position. If you want to use both from the command line, you'll need to adjust one of the positions.
If two markers in an input file have the same name, Haploview renames the second one in the file by appending ".X" to the filename, where "X" is a running integer count starting with 1. So if you have marker1, marker1 and marker2, Haploview would adjust this to: marker1, marker1.1 and marker2. Note that if the markers with the same name have different positions, Haploview won't deselect any of them; if they do have identical positions, it will filter all but one out as described above.
The top of the tab contains information about individuals filtered during the loading of the file. It will show overview information about the number of singletons and trios used and the number of independent families loaded. Further information can be shown by clicking the "Advanced Views" button. This will present a list of up to four buttons depending on the nature of the loaded dataset. The "Individual Summary" button will show genotyping percentage by family and individual. If individuals have been excluded, the "Excluded Individuals" button will present a list of excluded individuals as well as the reason for exclusion. If Mendel errors are present, view detailed Mendel error information by clicking the "Mendel Errors" button. If male heterozygotes are present in x chromosome data, information about them can be viewed by clicking "Male Heterozygotes". All of these advanced views can also be exported using the "Export to File" button. Details about individual filtering can be found in the FAQ.
The color scheme option (Display menu) allows you to choose among several LD color schemes. The following tables provide details on the color schemes, and a key to the meaning of the currently selected scheme can be dropped down from the "Key" menu in the upper right corner of the screen.
Table 1.2. Confidence Bounds Color Scheme
Strong Evidence of LD | dark grey |
Uninformative | light grey |
Strong Evidence of Recombination | white |
(r2 and Alt D'/LOD courtesy of Will Fitzhugh)
A graph of any variable versus chromosomal location can be added above the LD plot with the "Load Analysis Track" option. Simply create a file with two columns: <position> <value> . Haploview will plot the values in a continuous line along the top of the screen, along with a scale bar on the Y-axis. You can load several analysis tracks which will all be plotted in the same box at the top of the LD plot.
The "Download HapMap info track" option (with an internet connection) allows you to connect to the HapMap Project server and download and display a track with HapMap genotyped SNPs and gene names. If an info file is specified, the default boundaries are the positions of the first and last markers (which is only valid if the info file is in genomic coordinates). You must specify the proper chromosome and genomic build in the dialog box. If you are using a file downloaded from the HapMap website the program will specify the correct default chromosome, build and start/end positions. This track display can be configured with the "HapMap Info Track Options" item in the "Display" menu. Available tracks include HapMap SNPs, Entrez genes, recombination rate, contigs, and GC content.
Haploview generates blocks whenever a file is opened, but these blocks can be edited and redefined in a number of ways. In the Analysis menu, you can clear all the blocks in order to start over, define blocks based on one of several automated methods or customize the parameters of those algorithms. Additionally, the blocks can be edited by hand.
The default algorithm is taken from Gabriel et al, Science, 2002. 95% confidence bounds on D prime are generated and each comparison is called "strong LD", "inconclusive" or "strong recombination". A block is created if 95% of informative (i.e. non-inconclusive) comparisons are "strong LD". This method by default ignores markers with MAF < 0.05. The MAF cutoff and the confidence bound cutoffs can be edited by choosing "Customize Block Definitions" (Analysis menu). This definition allows for many overlapping blocks to be valid. The default behavior is to sort the list of all possible blocks and start with the largest and keep adding blocks as long as they don't overlap with an already declared block.
This is a variant on the algorithm described in Wang et al, Am. J. Hum. Genet., 2002. For each marker pair, the population frequencies of the 4 possible two-marker haplotypes are computed. If all 4 are observed with at least frequency 0.01, a recombination is deemed to have taken place. Blocks are formed by consecutive markers where only 3 gametes are observed. The 1% cutoff can be edited to make the definition more or less stringent.
This internally developed method searches for a "spine" of strong LD running from one marker to another along the legs of the triangle in the LD chart (this would mean that the first and last markers in a block are in strong LD with all intermediate markers but that the intermediate markers are not necessarily in LD with each other).
Markers can be removed from blocks by clicking on the marker number (along the top of the D prime graph). Blocks can be defined by hand by clicking and dragging along the marker number row. Any block which overlaps with an existing block will take precedence and delete the existing block.
View haplotypes for selected blocks by clicking on the "Haplotypes" tab or selecting "Haplotypes" from the Display menu. Haplotypes are estimated using an accelerated EM algorithm similar to the partition/ligation method described in Qin et al, 2002, Am J Hum Genet. This creates highly accurate population frequency estimates of the phased haplotypes based on the maximum likelihood as determined from the unphased input.
The haplotype display shows each haplotype in a block with its population frequency and connections from one block to the next. In the crossing areas, a value of multiallelic D' is shown. This represents the level of recombination between the two blocks. Note that the value of multiallelic D' is computed for only the haplotypes ("alleles") currently displayed. This usually does not have a strong effect, as the rare haplotypes contribute only slightly to the overall value. Above the haplotypes are marker numbers along with a tick beneath haplotype tag SNPs (htSNPs).
The display can be edited using the controls at the bottom of the screen to display only more common haplotypes or to adjust the connecting lines. By default, alleles are displayed using A,C,G,T along with the special symbol 'X' which represents a fairly rare situation in which only one allele is unambiguously observed in phased data. The 'X' represents the allele of unknown identity. The display can also be changed to show the alleles numerically from 1-4 with 8 being the equivalent of 'X', or as blue and red boxes, with blue being the major allele and red the minor.
We have developed a tagging strategy that combines the simplicity of pairwise methods with the potential efficiency of multimarker approaches. We avoid overfitting and unbounded haplotype tests in the association phase by (a) using only those multiallelic combinations in which the alleles are themselves in strong LD, and (b) explicitly recording the allelic hypotheses that are to be tested in the subsequent association analysis. Attractive practical features include the ability to force in or exclude sets of tags.
Haploview is based on Paul de Bakker's Tagger. It and more information are available at the Tagger website. There are a number of differences between the implementations, although they are constructed around the same concept. Tagger currently searches a much broader space of available multi-marker tests (up to 6-mers) whereas Haploview allows only 2- or 3-marker tests in the interest of computational efficiency.
Haploview's Tagger operates in either pairwise or aggressive mode. In either case it begins by selecting a minimal set of markers such that all alleles to be captured are correlated at an r2 greater than a user-editable threshold with a marker in that set. Certain markers can be forced into the tag list or explicity prohibited from being chosen as tags. You can also specify which markers in the dataset you want to be captured.
Aggressive tagging introduces two additional steps. The first is to try to capture SNPs which could not be captured in the pairwise step (N.B. these must have been "excluded" since otherwise they would simply be chosen to capture themselves) using multi-marker tests constructed from the set of markers chosen as pairwise tags. After this, it tries to "peel back" the tag list by replacing certain tags with multi-marker tests. Tagger avoids overfitting by only constructing multi-marker tests from SNPs which are in strong LD with each other, as measured by a pairwise LOD score. This LOD cutoff can be adjusted to loosen or tighten this requirement; in general, the default cutoff of 3.0 is appropriate for selecting tags from a HapMap-sized reference panel of 120 chromosomes.
Much more information about the development of this algorithm is available at the Tagger website.
N.B. Haploview's Tagger requires either an info file or a hapmap style input file, because it references the marker names specified in those files. If you load a pedigree or phased haplotypes input file without an info file, the Tagger panels will not be available.
This panel shows all SNPs available for tag selection. SNPs which are deselected in the Check Markers tab will not be in this list. There are three checkboxes for each SNP:
Checking this box will force this SNP to be chosen as a tag SNP.
Checking this box will prohibit this SNP from being chosen as a tag SNP.
If this box is checked, Haploview will include this SNP in the list of alleles to be captured by the chosen tag set.
N.B. The include and exclude checkboxes are mutually exclusive, and "Capture this Allele" must be checked in order to either include or exclude a marker.
Directly below the marker list are buttons to quickly manipulate the table above. Use "Include All" to check all of the "Force Include" boxes, and "Exclude All" to check all of the "Force Exclude" boxes. "Uncapture All" will uncheck the "Capture this Allele?" column for all markers, "Exclude A/T and C/G SNPs" will exclude check the "Force Exclude" boxes for SNPs with strand issues, and "Reset Table" will return the table to its initial state. Beneath these buttons are several additional tagging options. You can choose from among pairwise and two aggressive tagging strategies discussed above. You can also set the r2 and LOD thresholds as previously mentioned. Additionally, you can specify the maximum number of tags to pick, as well as the minumum distance (in base pairs) between picked tags. You can load a set of SNPs to include or exclude using the "Load Includes" and "Load Excludes" buttons. These buttons take in a file with a single column of SNPs to include or exclude. The "Alleles to Capture" button also takes in a file with a single column of SNPs to be captured. Design scores can also be loaded in using the "Design Scores" button. Design score files should contain two columns containing the SNP and the design score to assign to that SNP. A minimum design score threshold can also be specified. All of the Tagger thresholds can be reset to their default values using the "Reset Thresholds" button. Clicking "Run Tagger" will run the tagging algorithm. When finished it will switch from the Configuration to the Results Panel.
This panel is split into a "Tests" section on the left and a marker-by-marker report on the right. The marker report lists all SNPs, the test which best captures them, and their r2 with that test. SNPs which were unchecked from the "Capture this allele?" list on the Configuration panel are greyed out. SNPs which could not be successfully tagged are shown in red.
The first list in the "Tests" section shows all the tests (both single marker and multi-marker alleles) chosen by Haploview. Selecting tests in this list will show which alleles are captured by those tests in the second list in the panel. Beneath these lists is a summary of the tagging results.
This shows how many of the SNPs in the dataset have been successfully tagged by the set of chosen tests. The mean r2 represents the mean for only those SNPs successfully captured.
This shows what fraction of the alleles captured by the tests have an r2 >= 0.8. Of course, if your tagging r2 threshold is >= 0.8 this value will always be 100%.
This shows that N unique SNPs have been chosen to create M tests, which can either be one of the set of N SNPs or some combination of those SNPs.
The "Dump Tests File" button exports a file with the list of tests in the format used by Haploview's custom association test file and Tagger's export. This file contains the list of all tests (single SNPs and multi-marker tests) selected by Tagger for subsequent association analysis. In pairwise-only tagging this file will be identical to the "Tags" file, below.
The "Dump Tags File" button exports a file with the list of Tag SNPs in the format used by Haploview's custom association test file and Tagger's export. It is the concise list of SNPs selected by Tagger for genotyping. In pairwise-only tagging this file will be identical to the "Tests" file, above.
The "Export Tab to Text" option in the File menu will export a summary file showing the best tag for each marker and the list of tests along with the alleles tagged by each test.
If selected when loading the data, Haploview computes single locus and multi-marker haplotype association tests. For case/control data, the chi square and p-value for the allele frequencies in cases vs. control are shown. For family trios, all probands (affected individual with genotyped parents) are used to compute TDT values. If the parenTDT option is selected, additional information is gained from parental phenotypes. More information about this method can be found in the Citations list in the About Haploview section.
The haplotype association test is performed on the set of blocks selected on the LD and haplotype tabs. Results are shown only for those haplotypes above the display threshold on the haplotype tab. Counts for both TDT and case control association tests are obtained by summing the fractional likelihoods of each individual for each haplotype. In other words, if a particular individual has been determined by the EM to have a 40% likelihood of haplotype A and 60% likelihood of haplotype B, 0.4 and 0.6 would be added to the counts for A and B respectively.
Additional information about the way in which pedigrees are filtered for TDT purposes can be found in the FAQ.
Haploview is not intended to be the only way of testing association results, but to provide a straightforward way to do simple association tests. It's always a good idea to try out multiple approaches to analyzing your data.
You can load a set of custom association tests in the format exported by Haploview and Tagger. This format is discussed below.
Haploview provides a framework for permuting your association results in order to obtain a measure of significance corrected for multiple testing bias. You can choose to permute one of several test sets:
Permute just association tests to the individual SNPs in your dataset.
Permute the individual SNPs as above, along with all the haplotypes shown in the Haplotypes tab.
Permute only the haplotypes in the Haplotypes tab, ignoring the single marker results.
Permute the set of tests loaded from an external file. Note that this choice is only available if you provided a tests file when you loaded your dataset.
Specify how many permutations to do and press the "Do Permutations" button to start the permutations. While the permutations are running, Haploview shows the following:
You can stop the permutations at any time with the "Stop" button. Once the permutations are complete, Haploview displays:
You can save the permutation summary by using the "Export Tab to Text" option in the File menu.
Haploview can now take in PLINK outputs. These files require a separate map file or binary map file corresponding to each marker in the output file in order to load. Any output file from PLINK can be loaded provided that it contains a SNP column corresponding to the map file. The map file can contain SNPs that are not present in the associated output file and the SNPs need not be in the same order in the two files. PLINK output is displayed in a single tab containing a sortable table of results and a variety of filtering options below the table. In SNP-based files, you can also load in additional columns using the "Load Additional Results" button.
Filtering options include chromosome and position. The "Filter" dropdown box can be used to further filter on any unrecognized columns in the table. Multiple filters can be applied consecutively by changing the filter options and pressing the "Filter" button. Currently active filters can be viewed using the "View X Active Filters" button where X is the number of currently active filters. This will open a small popup window where filters can be viewed, added, or removed as needed. All filters are jointly applied (logically equivalent to an "AND" operator as opposed to an "OR"). You can also go directly to a specified marker name in the table using the "Go" button on the second row, and remove columns from the table by selecting a column from the "Remove Columns:" dropdown menu and using the adjacent "Remove" button. The "Reset" button can be used to revert the table and filters back to their original state (though the sorting state is retained). Please note that non-SNP based files which can be loaded in without a map file do not have the chromosomal or marker filters.
You can use the Fisher method for combining p-values in your PLINK results using the "Combine P-Values" button under the fitering options. This will bring up a dialog that allows you to choose between 2 and 5 p-value columns for use in the Fisher-combined algorithm. Once you click "Go", a new column designated as "P_COMBINED" will appear as the last column in the table.
You can create graphical plots from your results table using the "Plot" button under the filtering options. Use the Plot Options dialog to specify a title for your plot and various plotting options. At the top of the dialog, select an optional title for your plot. Then choose which columns to use on the X and Y axes of the plot as well as the scale for each using the appropriate drop-down boxes. If you have loaded a SNP based file with an accompanying map file, selecting "Chromosomes" as the X-Axis will plot your results across the chromosomes and will color code them separately. You can also select "Index" for either axis which will simply plot sequential numbers for each result shown in the table for that axis. Additionally, you can specify up to two thresholds for use in the plot, along with which axis to place them on and which direction they should be. Threshold 1 or the "Suggestive" threshold for the -log10 scale will create a blue line and Threshold 2 or the "Significant" threshold for the -log10 scale will create a red line. Datapoints which pass the thresholds will be larger in size then the standard datapoints. Directly beneath the thresholds, you can choose the base datapoint size for results in your plot using the dropdown box. To the right of that, you can use the optional "Color Key" dropdown menu to select a column to be used as a coloring key in the plot. Please note that this functionality will only work when the chosen color key column has 50 or fewer unique values. On the next line, you can use the "Show Gridlines?" checkbox to select whether to show or hide the gridlines in the plot. To the right of that, you can specify the initial width and height of the plot (in pixels). Finally, on the next line you can use the "Export to SVG" checkbox and browse to a location to save your plot to a high quality SVG file. Please note that the SVG option generally takes a great deal of processing power and memory and should only be used when very high quality images are required. In most cases, you can save the plot images as PNG files using the right click context menu described below.
Once the plot has loaded, you can hover over individual data points to see information about that point in a tooltip popup. SNP based data will display the corresponding marker name, chromosome, position, and the value that is being plotted. Non SNP-Based data will either display the corresponding FID and IID values if they are available, or simply the X and Y values for that datapoint. You can also click a datapoint to be taken back to that result in the results table. For many more plotting options including export options, right click anywhere on the plot.
By highlighting a specific result in the table and clicking "Go to Selected Region", you can bring up a dialog to automatically fetch that region from the HapMap website and load it into Haploview. The dialog allows you to specify the size of the region and the HapMap analysis panel that you wish to download. You can also optionally choose to annotate the columns from the PLINK tab to annotate in the LD Plot. Once the region has been successfully loaded into Haploview, the initially selected marker will be highlighted in blue on the Check Marker and PLINK tabs. SNPs that appear in the PLINK tab are now marked in green on the LD Plot, and the specific result that you specified is further highlighted in white. You can view the annotated data from the PLINK table by right clicking on the marker number in the LD Plot. You can use the "Force in PLINK SNPs" button in the Tagger panel to force include all the SNPs contained in the PLINK results tab. Please note that using "Go to Selected Region" requires an active internet connection.
The "Export Tab to Text" option in the File menu will export a text file containing the current view of the results table. This file will preserve any sorting and filtering that you've enabled in the table.