|
|
||||||||
1 Department of Biostatistics, Boston University School of Public Health;
2 Department of Oral Medicine, Infection and Immunity, Harvard School of Dental Medicine;
3 Harvard Partners Center for Genetics and Genomics, Harvard Medical School, 77 Avenue Louis Pasteur, NRB 255C, Boston, MA 02115;
4 Informatics Program, Childrens Hospital, Boston;
Correspondence: * corresponding author, marco_ramoni{at}harvard.edu
| Abstract |
|---|
|
|
|---|
KEY WORDS: Functional genomics bioinformatics machine learning Bayesian statistics oral cancer head and neck squamous cell carcinoma (HNSCC)
| Introduction |
|---|
|
|
|---|
The availability of a complete reference sequence of the human genome has facilitated the development of new technologiessuch as DNA microarrays and large-scale genotypingthat provide us with multiple views of the genome and multiple access points to its nature. These access points may be instrumental in understanding the code of life, but, to deliver on this promise, they need to be integrated into a single, global picture, able to capture the interplay between structure and function of the genome.
The genome is not static, and very important functions are revealed only by its dynamic behavior. Even basic biological processes, such as the cell cycle, can be deconvoluted and understood only by observing the genome in action and by studying the behavior of genes over time. Today, microarray technology allows us to take snapshots of the expression of every gene at a particular instant, and we are given the opportunity and the challenge to combine these snapshots into "movies" telling the global behavior of the genome. These global views of the behavior of the genome reveal clusters of genes displaying similar behaviors and suggest new roles of some genes by associating their behavior to that of others. These global functional views offer the even greater opportunity to go beyond similarities and dive into the control mechanisms of the genome underpinning the regulation processes and the functional interplay among genes.
This paper outlines how Bayesian methods can offer interesting and unique solutions to these critical problems involved in the decoding of the semantics of the genome: integration, functional analysis, and control identification. It also briefly reviews the current state of the art in the genomic analysis of oral cancer to identify where Bayesian methods can be most relevant.
| Bayesian Methods |
|---|
|
|
|---|
Suppose, for example, that we need to guess the outcome of an experiment that consists of tossing a coin. How many biased coins have we ever seen? Probably not many, and hence we are ready to believe that the coin is fair and that the outcome of the experiment can be either head or tail with the same probability. On the other hand, imagine that someone would tell us that the coin was forged so that it is more likely to land "head". How can we take into account this information in the analysis of our data? This question becomes critical when we consider data in domains of application for which knowledge corpora have been developed. Scientific and medical data are both examples of this situation.
Bayesian methods provide a principled way to incorporate this external information into the data analysis process. For this to happen, however, Bayesian methods have to change entirely the vision of the data analysis process with respect to the classic approach. In a Bayesian approach, the data analysis process starts with a given probability distribution. Since this distribution is given before any data are considered, it is called prior distribution. In our previous example, we would represent the fairness of the coin as a uniform prior probability distribution, assigning probability 0.5 of landing on one of the two sides of the coin. On the other hand, if we learn, from some external source of information, that the coin is biased, then we can model a prior probability distribution that assigns a higher probability to the event that the coin lands "head".
The Bayesian data analysis process consists of using the sample data to update this prior distribution into a posterior distribution. (See Ramoni and Sebastiani, 1999, for an introduction to Bayesian data analysis.) The basic tool for this updating is a theorem, proved by Thomas Bayes, an 18th century clergyman. The role of Bayes theorem in this approach is so critical that the whole approach is named after it.
The ability to integrate data with external information, a trademark of Bayesian analysis, provides a natural framework for integrating various forms of information available about the genome, and it has already been exploited to develop linkage models (Wang et al., 2000) of complex diseases, such as autism (Wang et al., 2000; Vieland et al., 2001). The intuition behind this approach is that the conclusions of a study (expressed as posterior probabilities) can be used as prior probabilities for another study, with a principled and seamless integration of the flow of information. Functional differences and similarities between genes can be integrated with information about their structural properties, as long as the conclusions of each component are expressed in terms of posterior probability. When searching for genes associated to a particular disease, for instance, one can update the posterior probability of linkage (PPL) using various phenotypic variables. These phenotypes can be clinical manifestations of a disease or complex patterns of gene expressioninferred through microarray analysisthat are common to subgroups of patients. In this way, each common pattern of gene expression can be used as a phenotype in the genetic studies. The power of the Bayesian approach is that an otherwise undetectable linkage may be established with the use of complex phenotypic information (Vieland et al., 2003).
| Functional Genomics |
|---|
|
|
|---|
Differential expression analysis
Current technology allows for the simultaneous expression analysis of thousands of genes with the use of devices known as microarrays, and they are today considered one of the most promising technologies to decode the functional mechanism of the genome (Lander, 1999). These massively parallel methods to study the expression of large numbers of genes are based on hybridization either to cDNA arrays (Schena et al., 1995) or to synthetic oligonucleotide arrays (Lipshutz et al., 1999). Notwithstanding some substantial technical differences, both approaches rely on high-resolution arrays measuring the expression level of each gene as a function of the RNA transcript abundance. This abundance is, in turn, measured by the emission intensity of the region where the gene transcript is located in the scanned image of the microarray, and the signal is filtered to remove noise generated by the microarray background and non-specific hybridization. (A general review of statistical methods in functional genomics can be found in Sebastiani et al., 2003.)
The simplest functional genomic study we can conduct is a comparative experiment aimed at identifying those genes that are differentially expressed between two biological conditions, such as normal vs. cancerous tissues. For instance, we can compare the gene expression levels of a cancer cell line against a healthy cell line and identify the genes differentially expressed in the two cell lines. Early analyses of these array data identified differentially expressed genes by taking the ratio of the intensities and choosing an arbitrary threshold value above (below) which the genes were taken to be differentially expressed (Schena et al., 1995). More sophisticated techniques take into account the noise in the gene expression data measured with microarrays by modeling the intensity values by probability s. In the first statistical analysis of these data, Chen et al.(1997) proposed a method to identify statistically significant changes between two conditions, under several distributional assumptions.
Bayesian approaches to this problem have been emerging over the past few years. Newton et al.(2001) proposed a Bayesian approach to differential analysis using a hierarchical model that helps to identify differentially expressed genes on the basis of the posterior odds of their average expression change. A similar approach has been proposed by Baldi and Long (2001) by modeling expression values as independent log-normal distributions, parameterized by means and variances with conjugate prior distributions. Microarray experiments are typically characterized by a small sample size, due to the high costs of the technology and the intrinsic paucity of some biological samples. The choice of appropriate distributional assumptions may be critical if reproducible results are to be achieved at low sample size. Bayesian Analysis of Differential Gene Expression (BADGE) (available from http://genomethods.org/badge) is a Bayesian method for the analysis of microarray data designed to yield high reproducibility at low sample size. BADGE models gene expression measurements by log-normal and gamma distributions and uses model averaging to compute the posterior probability of differential expression and to build molecular classification models. BADGE accounts for gene expression variability without arbitrary normalization, and provides a common framework for both detection of genes with different expression and molecular classification.
Functional clustering
A more ambitious approach to functional genomics tackles the problem of portraying a functional picture of the genome of an entire organism. The chief tool of this quest has been correlation-based hierarchical clustering (Eisen et al., 1998). Given a set of expression values measured for a set of genes under different experimental conditions, this approach recursively clusters pairs of genes according to the correlation of their measurements under the same experimental conditions. The intuition behind this approach is that correlated genes are acting together, since they belong to the same functional categories. The critical point of this approach is that it always ends up generating a single tree and leaves to the investigator the burden of identifying the actual functional clusters by relying on the available domain knowledge.
A specific property of Bayesian statistics can provide a principled solution to this problem. Contrary to its classic counterpart, Bayesian hypothesis-testing computes directly the probability of a hypothesis rather than the probability of committing an error in assuming it. Within this framework, we have developed a Bayesian clustering method able to identify the set of most probable processes responsible for sequences of observations (Ramoni et al., 2002a,b). The idea underpinning this Bayesian approach is that the observed data are generated by processes. The aim of the algorithm is to find the set of processes most likely, a posteriori, to have generated the sequences of observations in the database.
We have applied this method to cluster observations on 517 genes in a study of the responses of human fibroblasts to serum. The data were collected with the use of competitive cDNA microarrays. These microarrays measure the expression level of a gene simultaneously in a basal or control condition and in an experimental condition. The overall expression induced by the experimental condition is measured as the ratio of the two intensity levels, and these are the data used as input by clustering algorithms. Fig. 1
shows the clustering model obtained on these data by our Bayesian clustering method. The tree on the left represents the steps of the clustering algorithm and reports the four clusters identified by the algorithm. The squares in the center of the picture represent the gene expression measurements. Each row displays the expression levels of a gene in each time point condition, represented by the columns. Green cells and red cells represent expression levels higher than 1 and lower than 1, respectively. Color intensity represents the distance between the measurement and 1.
|
Bayesian networks are not new to genetic research. As a matter of fact, networks based on directed acyclic graphs actually originated from the genetics studies by Sewall Wright (1921), who developed a method called Path Analysis, a recognized ancestor of Bayesian networks. The application of Bayesian networks to functional genomics, on the other hand, is very recent. Bayesian networks hold the promise of answering very interesting questions in functional genomics, and, in principle, they seem to be the right technology to take advantage of the massively parallel analysis of whole-genome data to discover how they interact, control each other, and align themselves in pathways of activation. While clustering algorithms attempt to locate groups of genes that have similar expression patterns over a set of experiments to discover genes that are co-regulated, Bayesian networks dive into the regulatory circuitry of genetic expression to discover the web of dependencies among genes.
A Bayesian network has two components: a directed acyclic graph, in which nodes represent stochastic variables and arrows represent dependencies among the variables; and a joint probability distribution for the network variables (Pearl, 1988). The graph encodes conditional independencies among the variables that are used to factorize the joint probability distribution into modules. Each node in the network is associated with a conditional probability distribution that shapes the association between the node and all other nodes with arrows pointing to it. The advantage of a Bayesian network is to break down an otherwise unmanageable joint probability distribution over the domain variables into a set of smaller components, easier to define and cheaper to use. However, the fact that the network breaks down the overall association into modules does not lead to information disintegration. Being parts of the same network, the components of the Bayesian network can be interrelated according to well-established algorithms for probabilistic inference.
The promise of Bayesian networks in functional genomics goes even further, since intensive research efforts have been addressed, during the past decade, to define conditions under which Bayesian networks actually uncover the causal model underlying the data (Pearl, 1995). The most ambitious question, therefore, is: Given a set of microarray data, can we discover a causal model of interaction among different genes? The challenge is the common problem of sound statistical methods when faced with microarray data: a large number of variables with a small number of measurements. In the context of Bayesian networks, this situation results in the inability to discriminate among the sets of possible models, since the small amount of data is insufficient to identify a single most probable model.
Friedman et al.(2000) address these problems using partial models of Bayesian networks and a measure of confidence in a learned model. The strategy they follow is to search a space of under-specified models, each comprised of a set of Bayesian networks, and to select a class of models rather than a single one. They also adopt a measure of confidence based on bootstrapping to evaluate the reliability of each discovered dependency in the database, to avoid the risk of ascribing a causal role to a gene when not enough information is actually available to support the claim.
Hartemink et al.(2002) tackled the under-determination problem by turning the unsupervised search of the most probable network structure into a supervised one. They leveraged on established biological knowledge to select a small number of networks and then limited their comparisons to these networks only.
We have taken a slightly different approach, adopting the strategy used in differential gene expression analysis and converting the ratio measures generated by cDNA microarrays into discrete variables by thresholding the measures at 2 folds up and 2 folds down, the same used by the authors of the original paper. Fig. 2
shows a Bayesian network learned from the fibroblasts response to a serum dataset used for the functional clustering displayed in Fig. 1
.
|
| Genomic Analysis of Head and Neck Cancer |
|---|
|
|
|---|
Following a supervised but multi-class design, Ha and colleagues (Ha et al., 2003) compared malignant lesions, pre-malignant lesions, distant, histopathologically normal mucosa from patients with pre-malignant or malignant lesions, and normal mucosa from the upper aerodigestive tract of patients with non-cancer diagnoses. Using a slightly different experimental design, Vigneswaran and colleagues (Vigneswaran et al., 2003) validated their previously used cDNA microarray by characterizing the genomic profile of metastatic lesions in oral squamous cell carcinoma.
Belbin and colleagues (Belbin et al., 2002) applied an unsupervised design to cDNA microarrays containing 9216 clones to analyze gene expression profiles of tumor blocks obtained from 17 patients. Using unsupervised clustering, they were able to identify two distinct subgroups of patients by means of 375 genes differentially expressed between the two groups. Mendez and colleagues (Mendez et al., 2002) used Affymetrix GeneChip® microarrays to examine the expression profiles of 26 invasive squamous cell carcinomas of the oral cavity and oropharynx, 2 pre-malignant lesions, and 18 normal oral tissue samples. They were able to confirm that oral carcinomas are distinguishable from normal oral tissue based on genome-wide transcriptional expression patterns, but they were unable to account for other differences among the tumor tissues.
| Discussion |
|---|
|
|
|---|
Bayesian clustering methods, in contrast, can introduce a new dimension to this integration process. They can leverage on the combination of genomic and phenotypic information to understand the interplay between clinical outcomes and identify new classifications able to disambiguate the well-known heterogeneity of HNSCC.
Bayesian networks can take this process of integration even further, seamlessly combining structural, functional, and phenotypic information into a single, coherent molecular landscape. Their ability to discover control mechanisms can be used to identify downstream functional regulations induced by structural abnormalities and explain gene expression changes in genes with no structural changes and their long-range effects on clinical phenotypes.
| Footnotes |
|---|
| References |
|---|
|
|
|---|
Alevizos I, Mahadevappa M, Zhang X, Ohyama H, Kohno Y, Posner M, et al. (2001). Oral cancer in vivo gene expression profiling assisted by laser capture microdissection and microarray analysis. Oncogene 20:61966204.[Medline]
Baldi P, Long AD (2001). A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17:509519.
Belbin TJ, Singh B, Barber I, Socci N, Wenig B, Smith R, et al. (2002). Molecular classification of head and neck squamous cell carcinoma using cDNA microarrays. Cancer Res 62:11841190.
Chen Y, Dougherty ER, Bittner ML (1997). Ratio-based decisions and the quantitative analysis of cDNA microarray images. Biomed Optics 2:364374.
Eisen MB, Spellman PT, Brown PO, Botstein D (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95:1486314868.
Friedman N, Linial M, Nachman I, Peer D (2000). Using Bayesian networks to analyze expression data. J Comput Biol 7:601620.[Medline]
Gonzales-Moles MA, Rodriguez-Archilla A, Ruiz-Avila I, Martinez AB, Morales-Garcia P, Gonzalez-Moles S (2002). p16 expression in squamous carcinomas of the tongue. Onkologie 25:433436.[Medline]
Ha PK, Benoit NE, Yochem R, Sciubba J, Zahurak M, Sidransky D, et al. (2003). A transcriptional progression model for head and neck cancer. Clin Cancer Res 9:30583064.
Hartemink AJ, Gifford DK, Jaakkola TS, Young RA (2002). Combining location and expression data for principled discovery of genetic regulatory network models. Pac Symp Biocomput 437439.
Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, et al. (1999). The transcriptional program in the response of human fibroblasts to serum. Science 283:8387.
Kuo WP, Whipple ME, Jenssen TK, Todd R, Epstein JB, Ohno-Machado L, et al. (2003). Microarrays and clinical dentistry. J Am Dent Assoc 134:456462.
Lander ES (1999). Array of hope. Nat Genet 21(Suppl 1):34.[Medline]
Leethanakul C, Patel V, Gillespie J, Pallente M, Ensley JF, Koontongkaew S, et al. (2000a). Distinct pattern of expression of differentiation and growth-related genes in squamous cell carcinomas of the head and neck revealed by the use of laser capture microdissection and cDNA arrays. Oncogene 19:32203224.[Medline]
Leethanakul C, Patel V, Gillespie J, Shillitoe E, Kellman RM, Ensley JF, et al. (2000b). Gene expression profiles in squamous cell carcinomas of the oral cavity: use of laser capture microdissection for the construction and analysis of stage-specific cDNA libraries. Oral Oncol 36:474483.[Medline]
Leethanakul C, Knezevic V, Patel V, Amornphimoltham P, Gillespie J, Shillitoe EJ, et al. (2003). Gene discovery in oral squamous cell carcinoma through the Head and Neck Cancer Genome Anatomy Project: confirmation by microarray analysis. Oral Oncol 39:248258.[Medline]
Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ (1999). High density synthetic oligonucleotide arrays. Nat Genet 21(Suppl 1):2024.[Medline]
Mendez E, Cheng C, Farwell DG, Ricks S, Agoff SN, Futran ND, et al. (2002). Transcriptional expression profiles of oral squamous cell carcinomas. Cancer 95:14821494.[Medline]
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001). On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 8:3752.[Medline]
Ohyama H, Zhang X, Kohno Y, Alevizos I, Posner M, Wong DT, et al. (2000). Laser capture microdissection-generated target sample for high-density oligonucleotide array hybridization. Biotechniques 29:530536.[Medline]
Pearl J (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. San Francisco, CA: Morgan Kaufmann.
Pearl J (1995). Causal diagrams for empirical research. Biometrika 82:669710.
Ramoni M, Sebastiani P (1999). Bayesian methods for intelligent data analysis. In: Intelligent data analysis, an introduction. Berthold M, Hand D, editors. New York, NY: Springer, pp. 128166.
Ramoni M, Sebastiani P, Cohen P (2002a). Bayesian clustering by dynamics. Mach Learn 47:91121.
Ramoni MF, Sebastiani P, Kohane IS (2002b). Cluster analysis of gene expression dynamics. Proc Natl Acad Sci USA 99:91219126.
Schena M, Shalon D, Davis RW, Brown PO (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467470.
Sebastiani P, Gussoni E, Kohane I, Ramoni M (2003). Statistical challenges in functional genomics (with discussion). Stat Sci 18:3370.
Vieland VJ, Wang K, Huang J (2001). Power to detect linkage based on multiple sets of data in the presence of locus heterogeneity: comparative evaluation of model-based linkage methods for affected sib pair data. Hum Hered 51:199208.[Medline]
Vieland VJ, Sheffield V, Wassink T, Beck J, Goedken R, Childress D, et al. (2003). A new genome screen for autism based on the posterior probability of linkage (PPL) and incorporating language-based phenotypes finds evidence of linkage to several genomic locations, each supported by independent sources of information (abstract). Am J Hum Genet 73(Suppl 5):196.
Vigneswaran N, Wu J, Zacharias W (2003). Upregulation of cystatin M during the progression of oropharyngeal squamous cell carcinoma from primary tumor to metastasis. Oral Oncol 39:559568.[Medline]
Villaret DB, Wang T, Dillon D, Xu J, Sivam D, Cheever MA, et al. (2000). Identification of genes overexpressed in head and neck squamous cell carcinoma using a combination of complementary DNA subtraction and microarray analysis. Laryngoscope 110(3 Pt 1):374381.[Medline]
Wang K, Huang J, Vieland VJ (2000). The consistency of the posterior probability of linkage. Ann Hum Genet 64(Pt 6):533553.[Medline]
Wright S (1921). Correlation and causation. J Agr Res 20:557585.
This article has been cited by other articles:
![]() |
E. S. Klings, S. Safaya, A. H. Adewoye, A. Odhiambo, G. Frampton, M. Lenburg, N. Gerry, P. Sebastiani, M. H. Steinberg, and H. W. Farber Differential gene expression in pulmonary artery endothelial cells exposed to sickle cell plasma Physiol Genomics, May 11, 2005; 21(3): 293 - 298. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| IADR Journals | Advances in Dental Research ® | Journal of Dental Research ® | Critical Reviews (1990-2004) |