| Questions list
|
|
| Cells of living organisms contain a program of instructions (the genome), which enables them to stay alive and reproduce. The instructions of this program (the genes) are coded in chemical form along giant molecules, the DNA molecules which form the chromosomes. The genome corresponds to the totality of genes of an organism, i.e. all the DNA or all the chromosomes. |
| Questions list
|
- What is the DNA sequence ?
|
| The instructions contained in the DNA are coded in a chemical alphabet composed of 4 characters, the nucleotides (or bases), which are symbolized by the letters A, T, G and C. The molecules of DNA consist of a sequence of millions of these elementary characters, like a necklace in which each pearl can be one of 4 possible colors. The order of bases in each sequence is the way that biological information is stored, analogous to the storage of information in a computer as a succession of magnetic bytes. In other words, DNA is the chemical memory of living organisms . |
| The illustration below illustrates the analogy between two sequences of equal length (a succession of the letters A, T, G and C) and necklaces composed of 4 different colors of pearls. The order of bases (or pearls) is different in these two sequences (or necklaces); therefore they contain different information. |
 |
| To understand the instructions contained in a DNA molecule, it is first necessary to find out the succession (order) of characters (the sequence). This involves a sort of reading of information (data?) which we call sequencing. It differs from reading a text in which comprehension is immediate, because reading or sequencing of DNA requires supplementary interpretation in order (1) to identify the instructions and (2) to understand the biological sense. |
| Questions list
|
|
| A knowledge of the instructions (genes) is an indispensable step in the understanding of biological phenomena at the molecular and cellular level. However, this knowledge (the sequence) is only the first step in this quest. |
| As the understanding of biological phenomena progresses, we see more and more applications in the fields of medicine, the pharmaceutical industry, biotechnology, agribusiness, and all sorts of other areas which depend on biological processes (agriculture, environmental studies). The sequence is therefore a necessary but not sufficient starting point for applications in these domains. |
| Questions list
|
|
| A fragment of DNA to be sequenced consists of a series of hundreds of copies of the 4 basic characters (or building blocks), the nucleotides (A, C, G and T) in a definite order. Sequencing of this molecule means determining this order. |
| The principle of sequencing is to produce, from a fixed point, partial copies of the molecule, cut off randomly. Fragments of all possible intermediate sizes are synthesized from the fixed point. |
 |
| Then one separates the fragments by size by electrophoretic migration in a porous gel. These gels permit separation of two consecutive fragments which differ by a single nucleotide. Since we can identify the nucleotide at the cut-off point on each of these partial fragments, from the shortest to the longest, it becomes possible to reconstitute the order of nucleotides of the fragment. |
 |
| In practice, to identify the nucleotides at the end of each fragment, the DNA to be sequenced is copied using a chemical substance which provokes random but systematic cuts at the end of only one of the 4 nucleotides, A, T, G or C. Thus 4 series of fragments are prepared in parallel. In each series, all the fragments are cut off after only one type of nucleotide, for example all the intermediary fragments of the series which end with A. Furthermore, the substance which causes the cut-off is fluorescent, so it can be detected automatically in automatic sequencers using an optical system which scans the lower part of the electrophoresis gel. The signals obtained are interpreted by a computer program which reconstitutes the original sequence of the DNA sequence analyzed. An automatic sequencer can determine the sequence of 500 to 1000 nucleotides per run, or per "read." |
| Questions list
|
|
| So we can read a sequence of 500 to 1000 characters (or bases) in a single sequencing experiment or "read." Since the DNA molecules are much longer than this, however, it is necessary to place the reads in the proper order, one after the other. In order to place them in their correct sequence, we perform redundant reads. By reading a large number of small fragments, a sequence which overlaps in places is obtained. |
 |
| Next, we compare these reads in order to recognize and align the parts which have been sequenced several times. These can be recognized, because they have the same sequence. In this way we can align them (using the shared sequences?) and then reassemble several reads and reconstitute much longer sequences. |
 |
| We can thus reconstitute the complete sequence of the fragment we started with. This assembly operation, which is carried out by computer programs, allows final deciphering of the sequence of molecules consisting of several million to several tens of millions of bases. For genomes like the human genome, it is necessary to proceed with a redundancy factor of 8 to 10 in order to reassemble large fragments. In other words, the sequencing of a large fragment of DNA requires cutting it into small segments (see molecular cloning) and then performing a sufficient number of reads, which, if arranged end to end, would cover the DNA fragment to be sequenced 10 times. Nevertheless, even at this level of redundancy, a few gaps remain. The number and size of the gaps increases as the redundancy of the reads decreases. |
 |
| To determine the complete sequence of large molecules (the chromosomes) which contain the totality of the genes of a species (genome) such as humans, it is thus necessary to perform tens of millions of sequencing operations (reads). It is possible, however, to obtain a rough outline using a lower level of redundancy. In this case, the reassembled fragments are rather small. For example, with a redundancy level of 5x, reconstituted fragments of about 5000 bases are obtained for the human genome. A genome sequence obtained in this way would therefore consist of several hundred thousand fragments. |
| For economic reasons it is much more efficient to carry out these millions of large-scale operations in appropriate structures: sequencing centers in which the work is organized and partially automated in order to perform several thousand or tens of thousands of reads per day at a cost which is considerably lower than in traditional research laboratories. Centers of this type exist in the USA, Great Britain, Japan, Germany, China and France. |
| Questions list
|
- What is the public human genome sequencing project ?
|
| At the beginning of the 1980s, the international scientific community set itself the goal of the complete sequencing of the human genome--23 pairs of chromosomes, 3.5 billion nucleotides, which is equivalent to 2000 books of 500 pages each---for the beginning of the third millenium. Because of the large size of this genome, the big sequencing centers financed with public funds agreed to divide up the work, with each one doing specific chromosomal regions or specific chromosomes. Each center was committed to place its data in publicly-accessible databases as soon as it was obtained. (see organization) |
| Questions list
|
- What is the working draft of the human genome sequence ?
|
| For practical reasons, it was decided to proceed in successive stages. The objectives of each stage were designed to fill specific needs. The most urgent need was to carry out an inventory of the genes in the human genome. Therefore, the objective of the first step of the program was to produce a rough working draft of the human genome which would permit the identification of a large majority of the genes. |
 |
| Nevertheless, this working draft consists of a large number of fragments (several hundred thousand) which are ordered in groups of 20 to 30. There is a total of about 20,000 groups of 20-30 fragments of 5000-6000 bases each. At the end of June 2000 the consortium of public sequencing centers had completed 90% of the working draft of the sequence of the human genome. |
 |
| The objectives of the second stage are to (1) order and orient all the fragments of each of the 20,000 groups, (2) to fill the gaps, which are often small but numerous (several hundred thousand) and to attain a level of quality which will avert the necessity of continuous and costly verifications. This second step should be finished in 2003, although the complete sequence of many chromosomes will be available earlier. |
 |
| A private company, Celera Genomics, also has the objective of sequencing the human genome. |
| Questions list
|
- Is the public project competing with Celera Genomics ?
|
| The short term objectives (for the year 2000) are similar for the two projects: the achievement of an incomplete but useable working draft of the human genome sequence. This first version needs to be improved using supplementary data in order to obtain as complete as possible a good-quality coverage of the ensemble of the genome. This is an objective that the public project is trying to attain by 2003; Celera Genomics has not specified its intentions for this objective. |
| The strategies followed by the two projects are quite different but complementary. The public project proceeds by sequencing large fragments which have been previously ordered on a map. Because of its localization on this map, the sequence of a fragment is therefore utilizable as soon as it is determined. |
| On the other hand, Celera Genomics proposed to sequence (has sequenced?) the human genome using a strategy of global shotgun sequencing which skips the step in which the fragments are first ordered on a map. This strategy requires the availability of a huge quantity of data before proceeding to a valid assembly of the ensemble. In the case of the human genome, this involves several tens of millions of fragments. Celera has tested its strategy on the genome of the vinegar fly (Drosophila) with success, and has published an incomplete but good quality sequence of this organism in collaboration with public laboratories. This achievement nevertheless required the production of sequences which, placed end-to-end, would cover the Drosophila genome 14 times. Celera does not plan to produce an equivalent quantity of human genome sequence. In the strategy used by Celera, the assembled fragments can also be connected to one another in the form of a scaffold; the company was able to reconstitute the genome of Drosophila in the form of 20 unlinked (unconnected?) scaffolding structures. |
 |
| In order to perform the assembly of the human genome sequence, Celera has used the data from the public project, which was produced using a pre-established map of the genome, which is therefore quite complementary. This combination two different groups of data has allowed Celera to construct a scaffold structure of the order of a million bases which covers the majority of the genome, although the exact fraction covered has not been revealed. |
| Questions list
|
- Why a company genome project ? What use is it ?
|
| Ever since we learned to read DNA sequences in the 1970s, humans have dreamed of knowing their own genome, even though it might not be possible to understand the meaning of the instructions contained in it. |
| A whole series of repercussions resulting from the interpretation and exploitation of this information is expected in the coming decades. The most important of these will be in the fields of science and medicine, and we should not forget that the scientific advances themselves will lead to a multitude of new applications. However, these applications will not come immediately, but will require many years of research. However, this research could never be undertaken at all without the genome sequence. |
| One of the first results of the genome sequence, will be the ability to identify and produce a complete inventory of human genes. At the moment, about 10,000 to 12,000 genes are known. The most recent estimates performed at Genoscope suggest that the total number of human genes is between 30,000 and 35,000, which is much lower than previous estimates. |
| In the case of genomes of multicellular organisms, the identification of the genes does not immediately follow scrutiny of the genome sequence. This requires the use of computer programs, and at their present level of advancement, these analyses are imperfect and often require experimental validation. Despite these drawbacks, however, the identification of genes permits better orientation of research in both medical and fundamental domains. By orienting this work downstream, knowledge of the sequence leads to a considerable gain of time. |
| Very often, genetic studies permit determination of an interval of the genome (on a chromosome) in which a gene responsible for a genetic disease has been located. An inventory of the genes in the defined interval (made possible by the sequence analysis) makes it possible to choose the genes which are most likely (because of known or predicted properties of the products of the genes) to be implicated in the pathology, and thus begin work on the best candidates (in the interval?). |
| In this way, many thousands of genes responsible for genetic diseases can thus be found more rapidly now that we have the genome sequence. An understanding of these genes can lead to a diagnosis based on the DNA. For the most serious diseases, genetic diagnosis can be performed before birth in at-risk families. Identification of the responsible gene can also lead to an understanding of the physiological mechanism which causes the disease and, in some cases, to an exploration of new therapeutic possibilities. For example, a new treatment for Friedreich's ataxia based on knowledge of the gene and its function was recently developed by a French group at the Hopital Necker. |
| Questions list
|
|
| A large number of human diseases are genetic, or partly genetic in origin. The influence of this genetic component is variable. |
| For many rare diseases (such as cystic fibrosis or Duchenne muscular dystrophy) an alteration (mutation) in a single gene generally leads to the appearance of a series of symptoms which are characteristic of the disorder, whereas in the majority of common diseases like diabetes, hypertension and neuro-psychiatric disorders, the effects of the genes are modified by the influence of the rest of the genome as well as environmental factors. For this reason, we make a distinction between the rare, purely genetic diseases which are still called Mendelian or monogenic disorders and which can be predicted as soon as the responsible gene is known (or simply localized), and the more common diseases which have a multifactorial origin, and for which the presence of a predisposing factor in an individual does not necessarily lead to the appearance of the disease. |
| Although there are many of them, purely genetic diseases are rare. The most frequent one is cystic fibrosis, which affects about one newborn in 2500. The gene for this serious disease, which greatly reduces the life expectancy of affected individuals, was isolated in 1989. About a thousand genes responsible for genetic diseases are known to date. On the other hand, only a few genes for predisposition to common diseases are known. |
| As we have seen, common diseases (diabetes, cardiovascular diseases, psychiatric disorders) also have a genetic component. Knowledge of the human genome sequence will lead to increased success in identifying factors which predispose individuals to these diseases. |
| Questions list
|
- When did the Human Genome Project start ?
|
| The human genome project began at the beginning of the 1990s. It started with a mapping phase, to which French groups were major contributors. France's participation in the sequencing phase began in 1996; most of this phase has been carried out by American and British groups. |
| Questions list
|
|
| A public consortium of sequencing centers from 6 different counties coordinates the project. The public sequencing project sequences fragments which cover each of the 23 (how about the Y?) chromosomes. Each center announced the regions (chromosomes or parts of chromosomes) it intended to sequence. The international distribution of objectives are the following : |
| United states | 55-60 % |
| United kingdoms | 33 % |
| Japan | 10 % |
| France | 2,5 % |
| Deutchland | 1,5 % |
| China | 1 % |
|
| The distribution of effort by the various sequencing centers is as follows : |
 |
| Two chromosomes, chromosomes 21 and 22, have already been completely sequenced. Because of the priority given to the working draft in 1999, the complete sequence of the other chromosomes will be achieved at a later date. |
| Questions list
|
- What is the french contribution to the international public project ?
|
| France's contribution mainly involves chromosome 14. This totality of this work has been performed at Genoscope. In contrast to the other centers, Genoscope produces a sequence in which the reassembled fragments are ordered and oriented; it is therefore already more complete than the working draft. |
| Questions list
|
- What is the cost of the project ?
|
| It is still difficult to estimate what the final cost of the genome project will be, mainly because of the cost of finishing, which is still 80% unfinished. The cost of the first phase up to the achievement of the working draft is about 300 million dollars. The final cost will be about double this figure (600 million dollars). |
| The cost of the French contribution will be of the order of the budget for Genoscope (80 million France from the Ministry of Research and Technology) (i.e. of the order of 1.6% of the final cost, for 2.5% of the sequence). |
| Questions list
|
- Who are the members of the public consortium ?
|
| Abbreviation |
Centre |
| AECOM |
Albert Einstein College of Medicine |
| BCM |
Baylor College of Medicine |
| Beijing |
Human Genome Center, Institute of Genetics, Chinese Academy of Sciences |
| CGM |
Center for Genetics in Medicine (Perkin Elmer/Washinton Univ.) |
| GBF |
Gesellschaft fur Biotechnologische Forschung mbH |
| GS |
Genoscope |
| GTC |
GTC Sequencing Center |
| IMB |
Institute for Molecular Biotechnology, Jena, Germany |
| LAHGC |
Lita Annenberg Hazen Genome Center, Cold Spring Harbor |
| MPIMG |
Max Planck Institute for Molecular Genetics |
| JGI |
Joint Genome Institute, U.S. Department of Energy |
| JST |
Japan Science and Technology Corporation |
| RIKEN |
RIKEN Genome Sciences Center |
| SC |
The Sanger Centre |
| SDSTC |
Stanford DNA Sequencing and Technology Development Center |
| SHGC |
Stanford Human Genome Center |
| TIGR |
The Institute for Genome Research |
| UUGC |
University of Utah Genome Center |
| UOAGTC |
University of Oklahoma, Advanced Genome Technology Center |
| UTSW |
University of Texas, Southwestern Medical Center |
| UWGC |
University of Wasington Genome Center |
| UWMSC |
University of Wasington Multimegabase Sequencing Center |
| WIBR |
Whitehead Institute for Biomedical Research/MIT |
| WUGSC |
Washington University, Genome Sequencing Center |
| YMGC |
The National Yang Ming University Genome Center |
|
| Questions list
|
- Where can I find more information ?
|
| The following is a list of English-language websites for information on the Human Genome Project and related topics. |
- Main US Human Genome Project site. Links organized by subject.
- Sanger Centre genome site (British).
- "Genes and disease"; Includes links to diseases of various systems (immune system, nervous system, etc.).
- Glossary of genetic terms.
- Many links useful to scientists and medical persons. Includes keyword search possibilities.
- Research techniques (PCR, chips, cDNA, etc.); illustrated.
- Recent advances and topics of current media interest.
- Genetics course for laypersons and "User's Guide" to genetics.
- Ethical, Legal and Social Issues site, with many links.
- Genetic disease information.
- "To Know Ourselves"; The basic genome project explanation for the general public, from the US Dept. Of Energy.
- University of Kansas site, useful for teachers and students.
- Online Mendelian Inheritance in Man.
- Pharmaceutical Research and Manufacturer's Association site. Links to legal, informatics, therapeutic and environmental sites.
|
Questions list |