ASMC method

Article: Identification of Subfamily-specific Sites based on Active Sites Modeling and Clustering

Abstract : ASMC method (Active Sites Modeling and Clustering) is a novel unsupervised method to classify sequences using structural information of protein pockets. The method predicts functional amino-acids by proposing active site SDP residues (Specificity Determining Position) and active site CP residues (Conserved Positions) profiles. ASMC combines homology modeling of family members, structural alignment of modeled active sites and a subsequent hierarchical conceptual classification of obtained alignments. Comparison of profiles obtained from computed clusters allows the identification of the residues correlated to sub-families function divergence.

Supplementary material

ASMC method has been validated on a benchmark of 42 Pfam families for which previous resolved holo-structures were available.

Supplementary material Download contains:

  • Table I and Table II: The test set benchmark
    • The set is composed of enzymes families with well-characterized functions and at least one structure with bound natural ligands available in the PDB (Protein Data Bank). One part of the dataset is composed of protein families acting on a variety of substrates (diverse dataset), whereas the second set is composed of mono-activity protein families.
      • Table III: Average distance comparison
        • SDPs, CPs and OPs residues have been positioned onto the crystal structure bound to the ligand(s). For each category, the average distance between residues and the ligand(s) are computed as described in Material&Method. These results are compared with Kalinina et al. (2009) [Combining specificity determining and conserved residues improves functional site prediction. BMC Bioinformatics. 10:174]
          • Figure I: ASMC tree of Family PF02274 (Amidinotransferase).
            • This family contains glycine (EC: and inosamine (EC: amidinotransferases, enzymes involved in creatine and streptomycin biosynthesis respectively. This family also includes arginine deiminases, EC: These enzymes catalyse the reaction: arginine + H2O to citrulline + NH3. ASMC has divided the superfamily in 3 main groups: yellow group (that includes three clusters), orange group (with two clusters) and green group (with three clusters). Yellow, orange and green groups contain respectively sequences that are associated to Enzyme Commission number EC:, EC:, EC:3.5.3.
              • Figure II: Influence of the % of identity between targets and templates on ASMC performance.
                • Evolution of Sp and Se in function of the threshold limit in term of % of identity between target and templates sequences. For instance, we performed ASMC on set of sequences with at least 45% of identity with one of the templates structures. Values indicated here are the mean over the 42 benchmark families. The blue line indicate the average number of sequences per family for each threshold range. The gain in sensibility with the increasing of sequences identities is limited by the drop of available sequences (less than 50 for high level of identity).
                  • Table IV: Comparison of ASMC performance with a similar procedure that uses only information on sequences
                    • This procedure is based on a multiple sequence alignment (called here MSeqA, see Material&Methods, section 2.4). For families PF00693, PF01135 and PF01234, Sp and Se could not be calculated (indicated with nan in the table) as no cluster could emerged.