Adv Dent Res 17:100-103, December, 2003
© 2003 International and American Associations for Dental Research
Comparative Genomics and Structure Prediction of Dental Matrix Proteins
R.K. Krishnaraju1,*,
T.C. Hart3, and
T.K. Schleyer2,3
1 Center for Biomedical Informatics,
2 Center for Dental Informatics,
3 School of Dental Medicine, University of Pittsburgh, PA 15261, USA;
Correspondence: * corresponding author, present address, Bldg. 10, Rm 1N-103, Neurosensory Mechanisms Branch, National Institute of Dental and Craniofacial Research, National Institutes of Health, Bethesda, MD 20892, USA; kkrishna{at}mail.nih.gov
 |
Abstract
|
|---|
Non-collagenous matrix proteins secreted by the ameloblasts (amelogenin) and odontoblasts (osteocalcin) play important roles in the mineralization of enamel and dentin. In this study, comparative genomics approaches were used to identify the functional domains and model the three-dimensional structure of amelogenin and osteocalcin, respectively. Multiple sequence analysis of amelogenin in different species showed a high degree of sequence conservation at the nucleotide and protein levels. At the protein level, motifs (a sequence pattern that occurs repeatedly in a group of related proteins or genes), conserved domains, secondary structural characteristics, and functional sites of amelogenin from lower phyla were similar to those of the higher-level mammals, reflecting the high degree of sequence conservation during vertebrate evolution. Osteocalcin, produced by both odontoblasts and osetoblasts, also showed sequence similarity between species. Three-dimensional structure predictions developed by modeling of conserved domains of osteocalcin supported a role for glutamic acid residues in the calcium mineralization process.
KEY WORDS: Amelogenin osteocalcin 3-D structure sequence alignment
 |
Introduction
|
|---|
The extracellular matrix of dentin primarily consists of type I collagen, non-collagenous matrix proteins, and proteoglycans. In dentin, collagen serves as a lattice for deposition of calcium and phosphate, and extracellular matrix proteins control the growth of hydroxylapatite crystals in the mineralization process. Non-collagenous matrix proteins in the dental tissue are synthesized and secreted by ameloblasts and odontoblasts. Ameloblast proteins include amelogenin, ameloblastin, enamelin, and tuftelin, and amelogenin constitutes 90% of the total. These proteins primarily function in enamel mineralization (Moradian-Oldak, 2001). Proteins secreted by odontoblastssuch as dentin sialophosphoprotein (DSPP), dentin matrix proteins, osteonectin, and osteocalcinare involved in the dentin mineralization process. Osteonectin and osteocalcin are also secreted by osteoblasts, and are important in bone mineralization (Papagerakis et al., 2002).
In the present study, we performed comparative sequence analysis of matrix proteins, mainly amelogenin and osteocalcin, to computationally identify important features such as functional domains, motifs, post-translationally modified sites, and relationships among the organisms, and to predict protein structural elements. Domains are regions of a protein that execute a specific function, such as DNA binding or kinase activity. Motifs are a sequence pattern that occurs repeatedly in a group of related proteins or genes and are consistently associated with a particular function in related proteins with similar or identical functions. Comparative sequence analysis is a powerful approach for detecting functional regions in genomic and protein sequences, facilitates identification of conserved domains, motifs, and distantly related sequences of different organisms, and provides evolutionary insights into the underlying biology of organisms (Rubin et al., 2000). These approaches allow one to characterize proteins within a family and to assign functions reliably to family members whose functions are unknown or not well-understood (Heger and Holm, 2000). In this work, multiple sequence alignment and conserved sequence pattern recognition methods are used extensively to find sequence similarity and conserved domains. Phylogenetic analysis using sequence data is discussed as a means to study sequence-relatedness. Secondary and tertiary structures that determine the function of a protein are predicted with structure prediction methods.
 |
Methods
|
|---|
Human amelogenin (accession AAK77213) and osteocalcin (accession P02818) sequences were used as a reference to find the homologous sequences from other species in GenBank. For this to be obtained, a standard protein local alignment search tool (BLAST, a pair-wise alignment tool) was used with default settings to find protein sequences which are similar to query in the non-redundant (nr) protein database, GenBank (Altschul et al., 1990). Top scoring BLSAT hits with expectation value (e value) less than 407 were chosen for further analysis. To find the shared similarity between multiple sequences, top-scoring BLAST-retrieved homologous sequences were aligned with the use of multiple sequence alignment software, CLUSTALW 1.8 (Thompson et al., 1994). In the alignment, similarity among amino acids was determined with the following alignment parameters: BLOSUM (Henikoff) matrices with gap-opening penalty, 10.0; and gap extension penalty, 0.05. Motifs, a sequence pattern that occurs repeatedly in a group of related proteins, were discovered in the sequence with the use of the Multiple Expectation maximization for Motif Elicitation (MEME), version 3.0, program (GCG Wisconsin Package, Accelrys Inc., San Diego, CA, USA) by a search of the nr protein database (Bailey and Elkan, 1995). Identified consensus motifs were mapped to multiple-sequence alignment by means of the alignment editor, GeneDoc (http://www.psc.edu/ biomed/genedoc/). A phylogenetic tree was generated by means of a PHYLIP (Phylogeny Inference Package)-based method to find the closely related organisms from multiple sequence alignment (http://www.genebee.msu.su/services/ phtree_full.html). Post-translationally modified sites such as phosphorylation, glycosylation, and N-myristoylation on the homologous sequences were identified by a search of Prosite, a database of protein families and domains (Sigrist et al., 2002). The secondary structures of the protein sequences were predicted by means of the PSIPRED structure prediction method (Jones, 1999). Similarities between two homologous sequences were compared by BLAST 2 Sequences (Tatiana et al., 1999). Three-dimensional structures (3-D) of the protein were derived by comparative modeling of conserved domains (Marchler-Bauer et al., 2002) and Cn3D, a 3-D structure viewer (NCBI).
 |
Results and Discussion
|
|---|
Comparison of amelogenin sequences
Amelogenin sequences from different species which share identity to the human query sequence are shown in the multiple sequence alignment (Fig. 1
). Sequences in the alignment have conserved residues, insertions, substitutions, and low-complexity regions rich in histidine, proline, and glutamine. The entire sequence shows high levels of homology in short conserved regions of protein (motifs) as determined by the MEME motif discovery method. The list includes the following motifs: DKTKREEVD, SYGYEPMGGW, GYINFS/LYE, LKWYQSMIR, MGTWILFACLLGAAF, DLPLEAW, MMPVPGQ/HHSMTPTQHHQPN, LHHQIIPVL/VSQ, S/AHA/TLQPHHHI/LPV/MVPAQQPV, and QQPFQPQ. These repeating fragments of amino acid sequences are important in maintaining structural integrity and/or function of various proteins (Attwood, 2000). Exon-4-containing amelogenin isoforms were found in the human, mouse, rat, hamster, and guinea pig. Exon-4-coded amino acid residues of human FSYENSHSQAINVDRTAL showed significant sequence similarity to the mouse, rat, hamster, and guinea pig. In Fig. 1
, human and mouse sequences are not shown.

View larger version (66K):
[in this window]
[in a new window]
|
Fig. 1 Multiple sequence alignment of the amelogenin protein sequence. The alignment includes the following species with their sequence accession numbers in parentheses: human (AAK77213), rat (NP_062027), golden hamster (AAC24751), mouse (P45559), guinea pig (CAA09957), bovine (P02817), goat (AAG43996), pig (P45561), horse (BAA84219), dog (BAB85804), Japanese serow (BAB83510), short-tailed opossum (Q28462), platypus (O97646), porcupine (O97647), rattlesnake (AAD22553), crocodile (AAC78133V), and frog (AAC78134). The hyphen indicates gaps due to insertions in some species. Some species (dog, Japanese serow, opossum, platypus, and porcupine) had incomplete sequences in the database. Color codes indicate different protein motifs on the alignment. ClustalW parameters: Similarity of amino acids to each other was determined by BLOSUM (Henikoff) matrices with a gap-opening penalty of 10.0 and a gap extension penalty of 0.05.
|
|
Prosite database searches for functional domains indicated the presence of N-glycosylation sites consisting of amino acids Asn, Phe/Leu, Ser, and Tyr (3033 residues, Fig. 1
), and N-myristoylation sites spanning amino acids Gly, Ala to Ala, Met (GAafAM) at the amino-terminal end in all of the sequences studied. Myristoylation (transfer of myristate from myristoyl-co-enzyme A in amide linkage to the amino-terminal glycine residue of the proteins) is essential for the biological function of most proteins. Attachment of the myristoyl residue to glycine residues provides hydrophobicity and promotes protein-protein interactions (Johnson et al., 1994). In yeast two-hybrid experiments, it has been reported that the amino-terminal end of the amelogenin is involved in self-assembly during enamel formation (Paine and Snead, 1997). The presence of N-myristoylation sites in the amino-terminal end of amelogenin may be responsible for the hydrophobic nature of the region and may play a crucial role in enamel formation. Glycosylation is a complex type of post-translational modification where sugars are attached either to the amide nitrogen atom in the side-chain of asparagine (termed an N-linkage) or to the oxygen atom in the side-chain of serine or threonine (termed an O-linkage). When bound to proteins, oligosaccharides act as signals in several biological processes, ranging from the control of protein folding to the mediation of cell adhesion (Petrescu et al., 2004). While no experimental evidence points to the presence of glycosylation sites in amelogenin, it is interesting to note the presence of glycosylation sites, NF/LSY (Asn, Phe/Leu, Ser, and Tyr), conserved across all the species. On the carboxy terminal end, protein kinase c (PKC) and casein kinase (CK) II phosphorylation sites, consisting of Ser, Thr, Asp, Lys/Arg, glu and Thr, Asp, Lys/Arg, glu, respectively, were identified. Phosphorylation of serine residues in the amino-terminal regions at positions 16 and 25, in porcine and mouse samples, respectively, were shown to be involved in nanosphere formation during enamel formation (Moradian-Oldak et al., 2002). Our analyses revealed the presence of functional sites with conserved serine residues (GYINFS/LYE, LKWYQSMIR), with potential for phosphorylation by PKC kinase, in the amino-terminal region of the proteins. Conservation of these functional sites may indicate a role for these residues in nanosphere formation. In this work, we observed the presence of functional sitesnamely, glycosylation, myristoylation, and phosphorylation siteswithin the conserved motifs of the proteins, suggesting motifs indicative of functional sites. In all these sequences there are low-complexity regions rich in proline, glutamine, and histidine residues in the carboxy terminal end of the protein. Histidine and predominantly glutamine residues are responsible for the hydrophilic nature of the region (Moradian-Oldak et al., 2002).
The phylogenetic tree obtained from multiple sequence alignment shows sequence relatedness (Fig. 2
). Species with closely related sequences are clustered together. For example, the pig amelogenin sequence is most similar to humans than are the others. Secondary structure prediction of the human amelogenin sequence revealed the presence of an alpha helix region in the amino-terminal end starting from the 4th to the 15th residues (WILFACLLGAA). This region was also found conserved in all the species, suggesting similarity at the structural level as well. Pair-wise comparisons of the nucleotide or protein sequences of amelogenin with sequences of species shown in the Table
revealed significant similarity at the nucleotide and the protein levels in all the species, with the exception of Xenopus and the snake. These two species did not have significant similarity at the nucleotide level, but showed similarity at the protein level. This can be due to the existence of multiple codons for a single amino acid, leading to the evolution of an amelogenin sequence at the gene level but not at the protein level. The high degree of sequence homology and conservation of functional sites among the sequences suggest a specialized function for amelogenin. Compared with amelogenin, osteocalcin, a component in the dental matrix involved in dentin formation, did not show a high degree of sequence homology, suggesting that evolutionary changes affected the protein. However, in this protein, a conserved domain consisting of 43 amino acids showed homology across all species (Fig. 3
).

View larger version (27K):
[in this window]
[in a new window]
|
Fig. 2 Phylogenetic tree of amelogenin sequences. Species closely related based on sequence similarity are clustered.
|
|

View larger version (38K):
[in this window]
[in a new window]
|
Fig. 3 Multiple sequence alignment of osteocalcin. Conserved residues are shaded. Species with accession numbers (in parentheses) are: human (P02818), mouse (P04641), rat (P04640), bovine (P02820), chicken (P02822), crab-eating macaque (P02819), horse (P83005), dog (P81455), cat (P02821), rabbit (P39056), emu (P15504), Xenopus (P40147), swordfish (P02823), sea bream (P40148), and Bluegill (P28317).
|
|
Structure Prediction
Predicting the three-dimensional structures of protein from sequence data by comparative modeling is relatively easy today and provides much-needed information on which experiments can be planned. If the sequence or structural similarity is established between target (protein of interest) and template (sequence for which experimentally solved 3-D structure is known), it is possible to predict the 3-D structure of a protein/domain using publicly available resources (NCBI). In a search for domains with known 3-D structure in the conserved domain database, human osteocalcin aligned with human coagulation factor VIIa (3-D structure code 1DAN_L), and no structural template was found for human amelogenin. The corresponding osteocalcin domain consists of 43 amino acids (GAPVPYPDPLEPRREVCELN PDCDELADHIGFQ EAYRRFYGPV) with glutamic acid residues which have been shown in the 3-D structure as high-affinity calcium-binding gamma glutamic acid (gla) residues. In addition, the domain has alpha helical structures similar to those of the template (Fig. 4
). The structure provides evidence that osteocalcin is possibly involved in mineralization in dental and bone structures by similar mechanisms.

View larger version (63K):
[in this window]
[in a new window]
|
Fig. 4 Three-dimensional structure of conserved domain. The conserved domain of osteocalcin with corresponding predicted 3-D structure depicting calcium-binding sites is shown. The structure graphics was generated by a Cn3D viewer. Calcium binding with the side-chains of gamma-carboxy-glutamic acid (Cgu) is shown in yellow (pinhead shape), and residues forming the alpha helices (secondary structure) are highlighted with cylinders. The lower panel displays the alpha helices of the conserved domain of human osteocalcin predicted by the PSIPRED structure prediction method. "AA", "H", and "C" indicate amino acid letter codes, alpha-helices, and coils, respectively. The secondary structure predicted by PSIPRED is comparable with the structure shown in the graphic.
|
|
 |
Acknowledgments
|
|---|
This research was supported by a National Research Service Award (institutional) from the National Library of Medicine, NIH, August 2000September 2002.
 |
Footnotes
|
|---|
Publication supported by Software of Excellence (Auckland, NZ)
 |
References
|
|---|
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). Basic local alignment search tool. J Mol Biol 215:403410.[Medline]
Attwood TK (2000). The quest to deduce protein function from sequence: the role of pattern databases. Int J Biochem Cell Biol 32:139155.[Medline]
Bailey TL, Elkan C (1995). The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol 3:2129.[Medline]
Heger A, Holm L (2000). Towards a covering set of protein family profiles. Prog Biophys Mol Biol 73:321337.[Medline]
Johnson DR, Bhatnagar RS, Knoll LJ, Gordon JI (1994). Genetic and biochemical studies of protein N-myristoylation. Annu Rev Biochem 63:869914.[Medline]
Jones DT (1999). Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195202.[Medline]
Marchler-Bauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH (2002). CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 30:281283.[Abstract/Free Full Text]
Moradian-Oldak J (2001). Amelogenins: assembly, processing and control of crystal morphology. Matrix Biol 20:293305.[Medline]
Moradian-Oldak J, Bouropoulos N, Wang L, Gharakhanian N (2002). Analysis of self-assembly and apatite binding properties of amelogenin proteins lacking the hydrophilic C-terminal. Matrix Biol 21:197205.[Medline]
NCBI. Protein Database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein). Cn3D 3D structure viewer, (http://www.ncbi.nih.gov/Structure/CN3D/cn3d.shtml). Last accessed April 25, 2003.
Paine ML, Snead ML (1997). Protein interactions during assembly of the enamel organic extracellular matrix. J Bone Miner Res 12:221227.[Medline]
Papagerakis P, Berdal A, Mesbah M, Peuchmaur M, Malaval L, Nydegger J, et al. (2002) Investigation of osteocalcin, osteonectin, and dentin sialophosphoprotein in developing human teeth. Bone 30:377385.[Medline]
Petrescu AJ, Milac AL, Petrescu SM, Dwek RA, Wormald MR (2004). Statistical analysis of the protein environment of N-glycosylation sites: implications for occupancy, structure and folding. Glycobiology 2:103114.
Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, Hariharan IK, et al. (2000). Comparative genomics of the eukaryotes. Science 287:22042215.[Abstract/Free Full Text]
Sigrist CJ, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, et al. (2002). PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265274.[Abstract/Free Full Text]
Tatusova TA, Madden TL (1999). Blast 2 sequencesa new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 174:247250.[Medline]
Thompson JD, Higgins DG, Gibson TJ (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:46734680 (http://www.ch.embnet.org/software/ClustalW.html).[Abstract/Free Full Text]