On May 31, 2007, Nobel Laureate James Watson received his personal genome sequence in a ceremony at the Baylor College of Medicine. This genome sequence describes the six billion base pairs of DNA that James Watson received from his two parents, the unique combination of which are responsible for James Watson's genetic individuality. Dr. Watson is making his genome sequence available to the public in the hope that it will encourage the development of an era of "personalized medicine" when the information contained in our genomes is used to identify and prevent diseases to which we are genetically prone before they appear, and to create personalized medical therapies that have the maximum benefit and the minimum risk. This simple browser allows you to view the places where Watson's sequence is different from the "reference" human genome sequence, as well as to view the genes and some of the common diseases associated with them.
Dr. Watson's genome was sequenced at 6x coverage using 454 Life Sciences Technology. This means that each position on the genome was sequenced roughly six times. However, because of the probabilistic nature of the technology, some positions have been seen more than six times, and some less or not at all. The 454 technology produces short stretches of sequences called "reads" that are roughly 100 bases long. However, the functional units of the genome, the genes, are roughly 50,000 bases long, or 500 times the size of a 454 sequence. To interpret the Watson sequence, it was matched to the human genome project's published reference sequence in order to reassemble it into gene-length pieces. The entire Watson sequence, with the exception of the ApoE gene, variants of which are associated with early-onset Alzheimer's Disease, has been released to the public through this web site, the NCBI Trace Repository, and will be available from many other web sites in the future.
This site displays all places that Watson's DNA sequence differs from the reference sequence. These differences are called variants or polymorphisms. Because each of these differences involves only a single base pair change, they are known as SNPs or Single Nucleotide Polymorphisms.
The two forms of the variant (the form found in the reference sequence and the form found in Watson's sequence) are known as alleles. Each SNP has two possible alleles.
You can search this site by typing the name of a chromosome (try "chr3"), the name of a gene (try "HTR2A"), the name of a polymorphism (try "rs726455"), or a disease name (try "parkinson disease"). This will take you to a visual display of a region or regions that contain information about your search term. Click on the graphical display of the chromosome to jump around, or use the arrows to scroll a screen's worth at a time. The table at the bottom of the page allows you to turn on and off different types of information. Mouse over or click on a genome feature in order to get more information about it.
This site combines the following information:
This track contains information about protein-coding genes as they appear on the reference sequence. Each gene consists of a series of protein-coding parts known as exons (shown as yellow boxes), connected by intervening sequences known as introns (shown as black lines). Genes also have non-coding regions, shown as grey rectangles. Genes that are on the forward strand are depicted as right-pointing arrows, while those that are on the reverse strand point to the left. Some genes have multiple splice forms (alternative combinations of exons), in which case you will see a stack of similar genes on top of each other.
These genes are taken from NCBI Entrez. The Entrez accession number (database identifier) is printed above each gene. A brief description that describes the common name of the gene and its function (if known), is printed in red below each gene.
Hover the mouse over a gene to get more information about it. Click on it to be taken to Entrez for even more information.
This track shows places where James Watson's sequence differs from the reference sequence. Each difference is shown as a blue triangle at the position that the difference occurs. There are two types of difference. One is the case in which the base was sequenced multiple times and two forms of the base were found. James Watson is heterozygous at this site: one form corresponds to Watson's paternal chromosome, and the other corresponds to his maternal chromosome. The other is the case in which the base was sequenced multiple times and only one form of the base was found. In this case, Watson is possibly homozygous: both his parental chromosomes have the same base at this position, and this base is different from the one recorded for the reference genome.
The number of times each base was seen is reported in parenthesis in a description printed below the triangle. If the description reads "(C:4 T:2)" then it means that this base was sequenced six times: four times it was read as "C" and twice it was read as "T". It is important to understand that our ability to call homozygous bases is limited by the randomness of the sequencing process. For example, if the description reports "(G:0 A:4)" then this means that the base was read four times and all four times it was an "A" rather than the reference sequence's "G". This means that Watson might be homozygous for "A" in this location (he inherited an "A" from both his parents), but it also might mean that we just got unlucky on this base and rolled an "A" four times in a row. If we had read the base a fifth time, it might have shown a "G".
At higher magnifications, a line to the right of each of the alleles shows the number of times it was seen. If both alleles were seen equal numbers of times, the lines will be equal length. If one allele was seen twice as often as the other, its line will be twice as long.
It is also important to understand that a position in which no triangle appears means that each time Watson was sequenced in this position his DNA matched the reference genome. This can mean one of three things: (1) that he is homozygous for the reference allele; (2) that he is heterozygous for the reference allele but we got unlucky and didn't see the non-reference form of the base; or (3) that the sequencing failed to capture this base at all and we have no information. To see whether this base was sequenced at all, turn on the Sequence Coverage track.
90% of James Watson's 2 million sequence variants are identical to common polymorphisms that are already known to occur frequently in the human population. These have been given the same names as previously described polymorphisms in the dbSNP database (see the dbSNP Track.) More than 200,000 (about 10%) of James Watson's sequence variants have never been seen before. These are rarer polymorphisms that occur less than 5% of the time in the human population. It is not surprising to see so many rare polymorphisms in one individual, because the total number of distinct rare polymorphisms is quite large. A rare polymorphism has a name that begins with "NOVEL."
Mouse over the difference to get more information about it, including the actual sequence surrounding the variant base. Click on the difference to get even more information.
These are statistical associations between diseases and common polymorphisms extracted from Online Mendelian Inheritance in Man by Itsik Pe'er and colleagues using a software system named MutaGeneSys. An association means that in a particular population group, a certain allele (variant form) is associated with increased risk of acquiring a disease. For many polymorphisms, the at-risk allele is different for different populations. Having an at-risk allele doesn't necessarily mean someone carrying it will come down with the corresponding disease. It just increases his risk, just as being exposed to second-hand cigarette smoke increases the risk that one will come down with lung disease but does not guarantee it.
Above each OMIM Association is an "rs" number, a dbSNP accession number (database ID) that corresponds to the common polymorphism that is associated with this disease. Below the association is a short description of the disease or the function of the affected gene.
Mouse over the association to get more information about it, and how it relates to James Watson's sequence. You will see a legend like the following:
rs482934 is associated with ARMD1; MACULAR DEGENERATION, AGE-RELATED, 1 for allele 'T' in HapMap panel CEU.
The reference sequence has a G at this position.
JW is potentially homozygous for T at this position.
This means that the "T" allele of this polymorphism is associated with age-related macular degeneration in members of the HapMap "CEU" panel (Utah residents with ancestry from northern and western Europe). The reference sequence has a "G" at this position, but James Watson is potentially homozygous for "T." You can interpret this as meaning that Watson has the at-risk allele, but bear in mind that we cannot know for sure that he does not also have the low-risk "G" allele. Since James Watson has made it into his 7th decade without coming down with age-related macular degeneration, it is unlikely that he will develop the disease.
Please see the HapMap web site for a description of each of the four population panels that appear in OMIM associations. The population that is most relevant to interpreting James Watson's disease risks is CEU, because this population most closely matches Watson's European ancestry. OMIM associations for the CEU panel are shown as dark purple diamonds. Others are shown in grey.
The OMIM Association track represents only a small slice of what is currently known about genetic risks for common diseases, because many new associations are being discovered every day. Nor does it contain information about rare variants which confer high risk of certain diseases to a very small part of the population. We hope to add this information in the near future.
This track provides allele frequency information on a subset (approximately 4.5 million) of common polymorphisms that have been characterized by the Human Haplotype Mapping (HapMap) Project. This information describes whether a polymorphism is common or rare in a particular population, and whether it is more common in some populations than others.
HapMap has data on the frequency of alleles in the following four populations:
| Abbreviation | Description |
|---|---|
| CEU | CEPH (Utah residents with ancestry from northern and western Europe) |
| CHB | Han Chinese in Beijing, China |
| JPT | Japanese in Tokyo, Japan |
| YRI | Yoruba in Ibadan, Nigeria |
James Watson is primarily of European descent, and so the HapMap population most relevant to interpreting his polymorphisms is the CEU population.
This track corresponds to a large collection of confirmed and predicted polymorphisms stored in the NCBI's dbSNP database. There are more than 9 million SNPs in this database. Most of James Watson's 2 million polymorphisms are identical to SNPs already known to this database. However, roughly 200,000 of the polymorphisms found in his genome are novel ones that have not previously been identified.
This track links genes to a database of biological pathways called Reactome. By following these links, you can learn more about the structure and function of the genes shown in the Entrez Genes track.
At very high magnification (100-200 bp) this track shows the sequenced bases of the reference human genome. At low power, this shows a plot of the percentage of G and C nucleotides. Genes tend to have a higher GC content than non-genic areas.
The article describing the James Watson sequence was published in Nature. 2008 Apr 17;452(7189):872-6.
The complete genome of an individual by massively parallel DNA sequencing.
Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM.