|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||
Center for Health and Community, Center to Address Disparities in Childrens Oral Health, Department of Preventive and Restorative Dental Sciences, Division of Oral Epidemiology and Dental Public Health, University of California, San Francisco, CA 94143-1361, USA; sgansky{at}itsa.ucsf.edu
| Abstract |
|---|
|
|
|---|
KEY WORDS: Models, statistical decision support techniques neural networks (computer) dental caries oral health
| Introduction |
|---|
|
|
|---|
In current business applications, KDD touches lives daily when customers swipe supermarket savings cards, sending buying habits to data warehouses. This provided retailers the (apocryphal?) data mining discovery: diapers and beer sharing mens late-night supermarket baskets. In the future, similar encounters in clinicians offices might collect health information in data warehouses (according to patient confidentiality protections), which can be mined to identify at-risk patients and better treatment modalities. Such possibilities are gradually becoming reality (e.g., Page et al., 2002). Some potential oral health applications for KDD include: large surveys (e.g., NHANES), longitudinal cohort studies (e.g., Veterans Administration Longitudinal Study on Aging), disease registries (e.g., National Cancer Institutes Surveillance, Epidemiology and End Results [SEER] program; birth defects registry; craniofacial treatment outcomes registry), health services research (e.g., claims data, fraud detection), provider and workforce databases, digital diagnostics (e.g., radiology, microbiology), and molecular biology (e.g., polymerase chain-reactions, microarrays).
| KDD Methods |
|---|
|
|
|---|
Logistic regression models linear relationships between predictors (inputs) and a binary response (output) (e.g., Harrell, 2001). The binary logit model can be written as:
![]() |
where
i is the probability of the i-th person having the response (yi), ßs are the corresponding parameters for P predictor variables, and ei is the error for the i-th person. For example, if log10MS and fluoride levels relate linearly to the probability of developing caries, this model would fit well. Logistic regression coefficients (ßs) are easy to interpret (as natural logarithms of odds ratios), a very desirable property. If the actual likelihood surface is not a hyperplane, logistic regression will not fit well, since it misses bumps or non-linearities.
CART models (e.g., Stewart and Stamm, 1991; Hastie et al., 2001) adapt well to fit interactions, since they group individuals with similar probabilities of caries (to produce terminal nodes with the highest purity or homogeneity of outcome classes). Unlike logit models, CART models are robust to outliers and do not require specific data transformations or hierarchical interaction specification. CART models are step-function-type likelihood approximations (analogous to Riemann sums approximating integrals); these models are highly interpretable for easy clinician use.
ANNs, extremely flexible weighted combinations of non-linear functions, use a hidden layer with hidden units/nodes/neurons and activation functions to link inputs to the hidden layer and from the hidden layer to outputs. A feed-forward or multilayer perceptron ANN is:
![]() |
where
![]() |
with r = 1, 2,..., R indexing the neurons, Hr denoting the r-th neuron, wpr denoting the coefficients of the p-th input xpi for the r-th neuron, and g01 denoting the inverse activation function (in this case, tanh1). In ANN terminology (Schwarzer et al., 2000), a P-R-S model has P inputs (predictors), 1 hidden layer with R neurons, and S outputs (outcomes). Neurons are a function of weighted sums of inputs plus a constant ("bias"), w0. Similarly, outputs are a function of weighted sums of neurons plus bias; logistic and hyperbolic tangents are common activation functions. (Logistic regression is a P-0-1 feed-forward ANN with logistic activation function.) Weight decay, a model complexity penalty term for maximization, can be added to examine potential overfitting. In a simulation study of a 1-15-1 ANN, weight decays of 0, 0.002, and 0.005 were used with 0.005 stabilizing prediction (Schwarzer et al., 2000). Varying random seeds, R, and weight decays stabilizes global optimization (Ripley, 1996). ANNs are iteratively optimized with training data, and the final model is fitted to validation data so that future performance can be assessed. Training estimates weights, but they have no clear interpretation; thus, ANNs have very poor interpretability. Since ANNs with large R fit any arbitrary surface, ANNs should not be overfitted to the training data. Common mistakes with ANN are: too many parameters for the sample size, not using validation, not using a model complexity penalty, incorrect misclassification estimation, implausible probability functions, incorrectly described network complexity, inadequate flexible statistical competitors (e.g., CART), and insufficient comparisons with statistical competitors (e.g., receiver operating characteristic curves) (Schwarzer et al., 2000).
| KDD Process |
|---|
|
|
|---|
3000), cross-validation (if sample size < 3000), bootstrap (resampling with replacement), or jackknife (leave-one-out) methods. Finally, implementation could involve changes in the KDD process, new clinical interventions, or changes in health policy.
|
Goals of this paper are to demystify knowledge discovery and data mining (KDD) by explaining the process, to identify possible pitfalls and practical issues, and to compare the performance of KDD methods (logit, CART, and ANN) in analyzing Rochester caries study data.
| Materials and Methods |
|---|
|
|
|---|
KDD methods
Logistic regression, CART, and ANN caries prediction methods were compared. Logistic regression used stepwise selection, with alpha = 0.05 to enter and 0.20 to stay, and the Akaike Information Criterion to judge the need for additional predictors. CART used the Gini index-splitting criterion and the proportion correctly classified for pruning back the maximal tree. A 5-3-1 multilayer perceptron ANN model (22 degrees of freedom) with inverse hyperbolic tangent activation functions, Levenberg-Marquardt optimization, 5 preliminary runs, average error selection, and no weight decay function was used. Sensitivity analyses varied random seed (5 different values), number of hidden neurons (2, 3, 4), and weight decay parameter (0, 0.001, 0.005, 0.010, 0.250).
Training and validation were performed with a 70%/30% randomly split sample stratified on primary dentition caries. All methods used the same training data to develop the prediction models and the hold-out (not used to develop the models) validation data to score or validate the models. Additionally, five-fold cross-validation [CV(5)] was performed, randomly forming 5 groups leading to 5 analyses, each with 4/5 of the total data (i.e., each 5th was left out of one analysis). Results were then aggregated to calculate mean square error (MSE), also called the Brier score (B), between observed and expected output:
![]() |
where n is the sample size.
Visualization
Area under the curve (AUC) from receiver operating characteristic (ROC) curves plotting sensitivity vs. the false-positive fraction (1 - specificity), which is equivalent to concordance (c index), was also calculated. The (positive) likelihood ratio is sensitivity / (1 - specificity). ROC curves allow for balancing between the sensitivity-specificity tradeoff.
Cumulative captured-response curves are similar to ROC curves, but with graph sensitivity vs. the percent testing positive (identified as high-risk). Thus, sensitivity for KDD methods can be compared for a specific percent-positive cut-off, which may be useful when resources for those labeled high-risk might be limited. A related graph is the lift chart, which displays the gain each KDD method has over baseline vs. the percent testing positive.
To visualize the input contribution, we divided ANN predicted probabilities into quintiles (fifths) and showed the distributions of the standardized predictors in each quintile via boxplots.
| Results |
|---|
|
|
|---|
The resultant training classification tree is presented in Fig. 2
. In this example, the overall prevalence of caries in the primary dentition was 15% (root node). Each input variable was searched to partition the root node. All children with log10 MS less than 7.08 (~ 10 million CFU/mL) were in the left node, with the remainder in the right node. Node-specific prevalences were 15% and 91%, respectively. Circles identify nodes with prevalence less than or equal to the overall prevalence of 15%, while squares identify nodes with prevalence greater than the overall prevalence. Continuing, the left node was split into two nodes according to log10 LB. The node with log10 LB less than 3.05 was further split with log10 MS for identification of a group with very low prevalence. This illustrates tree models recursive nature, since predictors can be re-used. Next, the node with log10 LB greater than or equal to 3.05 was split with fluoride. Finally, the node with log10 MS greater than or equal to 7.08 was split with fluoride. There were 6 terminal nodes (3 high prevalence and 3 low prevalence); 1 high-prevalence node was very high, while 2 low-risk nodes were very low.
|
|
|
| Discussion |
|---|
|
|
|---|
| Conclusions |
|---|
|
|
|---|
| Appendix: Glossary of Knowledge Discovery and Data Mining Methods and Related Terms (italicized words are cross-referenced) |
|---|
|
|
|---|
Bagging (Bootstrap aggregation) ensemble tree model method to reduce misclassification error using bootstrap (with replacement) resampling
Boosting (e.g., adaptive resampling and combining (ARCing) or adaptive boosting (AdaBoost)) ensemble tree model method to reduce misclassification error using increased weights for misclassified observations to allow for better prediction in subsequent trees on those records
Bootstrap drawing (resampling) a large number (e.g., 500 to 10,000) of new sets of data with the original sample size () from the original data with replacement and re-analyzing those bootstrap resamples to simulate variability and assess robustness
Classification and regression tree (CART) model recursive partitioning method (re-assessing all inputs at each stage) to split the data into 2 groups at each stage based on inputs that minimize the output class misclassification error
Cross-validation or K-fold cross-validation [CV(K)] randomly dividing the data into K mutually exclusive and exhaustive subsets (e.g., 5 or 10), re-analyzing each subset, and aggregating across the K subsets to estimate robustness
Ensemble tree model or committee of trees classifier using majority vote (modal) class assignment or mean predicted probability from a group of tree models grown under different conditions to reduce classification error
Hierarchical clustering groups records together based on closeness/similarity, starting with each record in its own cluster and ending with all records in one cluster (or vice versa) and allowing the reader to choose classification from those in the middle; formed in either a step-down (divisive) or step-up (agglomerative) direction
Input predictor, explanatory, or independent variable
Jackknife assessing analysis robustness by leaving out one observation (i.e., sample size is - 1), analyzing the data again, repeating times until each observation has been left out once, and then comparing with the original analysis with all the data; equivalent to -fold cross-validation [CV()]
K-means clustering iteratively determines K groups based on closeness/similarity to group center (mean) and minimal within-group variability
k-nearest neighbor (knn) clustering iteratively identifies groups based on the k closest neighbors to each point, assigning modal or majority class among the k neighbors
Multivariate adaptive regression splines (MARS) iterative modeling method using combinations of linear basis functions of inputs (predictors) to fit non-linear relationships smoothly
Output response, outcome, or dependent variable
Random Forests ensemble tree method using randomly selected subsets of inputs while also providing interpretability through summary measures of input variable importance
Regression model (linear or logistic) classic statistical model to predict output value or probability (linearly or log-linearly) from inputs
Split sample randomly grouping data into training and testing samples, stratifying on output, building prediction models with the training sample, and testing that resultant model with the holdout testing sample to provide an unbiased error estimate and assess robustness
Supervised learning modeling in which the output class is known
Support vector machines (SVM) computationally intensive "black box" method to find the non-linear multidimensional boundary (hyperplane) transformed as a linear hyperplane that best splits classes
Unsupervised learning modeling in which the output class is not known; data are clustered according to similar input variables
| Acknowledgments |
|---|
| Footnotes |
|---|
| References |
|---|
|
|
|---|
Amariti ML, Restori M, De Ferrari F, Paganelli C, Faglia R, Legnani G (2000). A histological procedure to determine dental age. J Forensic Odontostomatol 18:15.[Medline]
Billings RJ, Gansky SA, Mundorff-Shrestha SA, Leverett DH, Featherstone JDB (2003). Pathological and protective caries risk factors in a childrens longitudinal study (abstract). Caries Res 37:277278.
Brickley MR, Shepherd JP (1996). Performance of a neural network trained to make third-molar treatment-planning decisions. Med Decis Making 16:15360.
Brickley MR, Shepherd JP (1997). Comparisons of the abilities of a neural network and three consultant oral surgeons to make decisions about third molar removal. Br Dent J 182(2):5963.[Medline]
Brickley MR, Shepherd JP, Armstrong RA (1998). Neural networks: a new technique for development of decision support systems in dentistry. J Dent 26:305309.[Medline]
Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996). From data mining to knowledge discovery: an overview. In: Advances in knowledge discovery and data mining. Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, editors. Menlo Park, CA: AAAI Press, pp. 136.
Goodey RD, Brickley MR, Hill CM, Shepherd JP (2000). A controlled trial of three referral methods for patients with third molars. Br Dent J 189:556560.[Medline]
Harrell FE Jr (2001). Regression modeling strategies with applications to linear models, logistic regression and survival analysis. New York: Springer Verlag.
Hastie T, Tibshirani R, Friedman JH (2001). The elements of statistical learning: data mining, inference, and prediction. New York: Springer Verlag.
Hausen H (1997). Caries predictionstate of the art. Community Dent Oral Epidemiol 25:8796.[Medline]
Kattan MW, Hess KR, Beck JR (1998). Experiments to determine whether recursive partitioning (CART) or an artificial neural network overcomes theoretical limitations of Cox proportional hazards regression. Comput Biomed Res 31:363373.[Medline]
Leverett DH, Featherstone JDB, Proskin HM, Adair SM, Eisenberg AD, Mundorff-Shrestha SA, et al. (1993a). Caries risk assessment by a cross-sectional discrimination model. J Dent Res 72:529537.
Leverett DH, Proskin HM, Featherstone JDB, Adair SM, Eisenberg AD, Mundorff-Shrestha SA, et al. (1993b). Caries risk assessment in a longitudinal discrimination study. J Dent Res 72:538543.
Lux CJ, Stellzig A, Volz D, Jager W, Richardson A, Komposch G (1998). A neural network approach to the analysis and classification of human craniofacial growth. Growth Dev Aging 62(3):95106.[Medline]
Nilsson T, Lundgren T, Odelius H, Sillen R, Noren JG (1996). A computerized induction analysis of possible co-variations among different elements in human tooth enamel. Artif Intell Med 8:515526.[Medline]
Page RC, Krall EA, Martin J, Mancl L, Garcia RI (2002). Validity and accuracy of a risk calculator in predicting periodontal disease. J Am Dent Assoc 133:569576.
Ripley BD (1996). Pattern recognition and neural networks. New York: Cambridge University Press.
Schwarzer G, Vach W, Schumacher M (2000). On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology. Stat Med 19:541561.[Medline]
Speight PM, Elliott AE, Jullien JA, Downer MC, Zakrzewska JM (1995). The use of artificial intelligence to identify people at risk of oral cancer and precancer. Br Dent J 179(10):3827.[Medline]
Stewart PW, Stamm JW (1991). Classification tree prediction models for dental caries from clinical, microbiological, and interview data. J Dent Res 70:12391251.
Waldrop MM (2001). Data mining. MIT Technology ReviewTen emerging technologies that will change the world. January/February. Internet Web site accessed 23 March 2004. <http://www.technologyreview.com/articles/mag_toc_jan01.asp>
| ||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| IADR Journals | Advances in Dental Research ® | Journal of Dental Research ® | Critical Reviews (1990-2004) |