Sigillvm Vniversitatis Hafniensis (The faculty of Science) The Bioinformatics Centre University of Copenhagen

From BINF - Bioinformatics Centre

Contents

Thomas Hamelryck

image:Thomas_hamelryck.jpg


PLEASE SEE MY GROUP'S WEBSITE FOR UP-TO-DATE INFORMATION

I am group leader (associate professor) of the structural bioinformatics group in the bioinformatics center at the university of Copenhagen, led by Prof. Anders Krogh. I have a background in biotechnology and macromolecular crystallography, but am currently active in the field of structural bioinformatics. My main aim is the development of protein and RNA structure prediction, simulation and design methods, making use of probabilistic models.

Address
Thomas Hamelryck
Bioinformatics Center
University of Copenhagen
Room 1.2.22
Ole Maaloes Vej 5
2200 Copenhagen
Denmark
E-mail: thamelry -at- binf.ku.dk
Tel: +45 35321278

Research in structural bioinformatics

Bioinformatics is the study of large scale problems in molecular biology using computational tools. Structural bioinformatics studies problems that are associated with macromolecular structure. My research focus lies on the prediction of protein and RNA 3D structure, and related problems such as protein design, simulation of protein dynamics and inferential protein structure determination.

Probabilistic models of protein structure

Directional statistics. A protein's structure can be represented as a sequence of points on the unit sphere. Shown are two views on the spherical histogram of these points for a set of protein structures. Figure made by wrapping a histogram made with R (by Wouter Boomsma) on a 3D sphere with PovRay.

FB5-HMM and TorusDBN are probabilistic models of protein structure, based on Dynamic Bayesian Networks and directional statistics. The models can be used to generate protein-like conformations that are compatible with a given amino acid sequence, on a local length scale. We expect this approach to protein structure sampling will replace fragment library approaches in the very near approaches, since it is conceptually elegant, computationally efficient and fully probabilistic. The latter is of great importance in Markov Chain Monte Carlo simulations of protein structure. An article describing FB5-HMM, a model of protein C-alpha geometry, made the cover of the September 2006 issue of PLoS computational Biology. In 2008, we published a related model of the full protein backbone (called TorusDBN) in PNAS.

Loop closure

The loop closure problem. The green loop needs to be replaced with a new loop, keeping the ends fixed. First, a new, open loop (shown in red) is created, with one of the ends fixed (this is trivial). Then, the loop is closed again (shown in blue) using an iterative algorithm.

In protein structure prediction, it is often important to construct a protein segment that bridges two fixed segments. This non-trivial problem is often tackled using algorithms from the field of robotics, ie. inverse kinematics methods. We recently designed a novel algorithm called Full Cyclic Coordinate Descent (FCCD) that is fast, easy to implement and extremely flexible. It is especially efficient for rebuilding the protein backbone making use of C-alpha positions only. An article describing the method is published in BMC Bioinformatics.

Measuring Solvent Exposure

Solvent exposure measures. Half Sphere Exposure (HSE) construction. This simple, two-dimensional measure of solvent exposure counts the number of neighbors in two domes (with radius R typically equal to 10 or 12 Å) around the Calpha atom. It is simple and extremely fast to compute, and superior to the widely used Contact Number measure. The HSE value of the example above is (3,5).

Half Sphere Exposure (HSE) is a new method to measure amino acid solvent exposure in a protein structure. It is in many ways superior to the conventional measures and only requires the coordinates of the C-alpha atoms. We are using this measure in the context of structure prediction. An article on HSE is published with Proteins.

Mocapy

Mocapy, a Dynamic Bayesian Network toolkit. Mocapy is a toolkit for Maximum Likelihood (ML) and Maximum a posteriori (MAP) parameter learning and inference in Dynamic Bayesian Networks. The toolkit is specifically designed to develop probabilistic models of molecular geometry. The names comes from two main ingredients of the toolkit: Monte Carlo sampling and the Python programming language.

Mocapy is a toolkit for inference and learning in Dynamic Bayesian Networks (DBN). A Dynamic Bayesian Network is a machine learning method that can be used to develop probabilistic models of sequences. A DBN can be considered as a generalization of the better known Hidden Markov Model (HMM), but they have much more modelling power. DBNs can for example be used to model protein sequences, or for speech recognition. Inference and maximum likelihood (ML) and maximum a-posteriori (MAP) parameter learning is done using Gibbs sampling/Stochastic Expectation Maximization. Currently discrete (that is, Multinomial), Gaussian, Kent, Von Mises-Fisher and Dirichlet nodes are implemented. In practice this means that you can model sequences of symbols (ie. discrete observations), floats, vectors (of any dimension) and even unit vectors (using the Kent and Von Mises-Fisher nodes). The latter makes it for example possible to model bond angles in molecules. Mocapy can handle large datasets and can be run on a cluster computer or a desktop computer with a single CPU. Mocapy was originally implemented in Python, making use of the numpy, SciPy and pyMPI libraries. Mocapy is freely available from sourceforge under the LGPL license, and comes with a 50+ page manual. Mocapy++, a recent, fast re-implementation of Mocapy in C++, is available as well.

Bio.PDB

The Biopython toolkit. Biopython is a freely available bioinformatics toolkit implemented in Python. Bio.PDB is the structural submodule of Biopython. Logo by Henrik Vestergaard.

This is a Python library that allows you to access the data in PDB and mmCIF files. The data in the PDB file is represented by a Structure/Model/Chain/Residue/Atom data structure. The parser also does some integrity checks (ie. do all atoms and residues have a unique name?). This python library is part of the Biopython project, a set of freely available Python modules that deal with various aspects of boinformatics. Be sure to try out the CVS version, which contains some additional goodies and bug-fixes. People who want to contribute are welcome, BTW. An article describing this toolkit is published in Bioinformatics. Bio.PDB comes with extensive documentation.

Bio.PDB's features include:

  • Support for mmCIF and PDB files
  • Multiple models (i.e. in NMR structures) supported
  • Insertion codes are taken into account
  • Deals with anisotropic B-values
  • Disorder is adequately handled (of atoms or complete residues, i.e. due to point mutations)
  • It does a lot of sanity checking
  • It's quite fast (10 s for the large ribosomal subunit - 64000 atoms)
  • Fast atom neighbor lookup using a KD tree
  • Identification of polypeptides
  • Superposition of structures
  • Various analysis tools (DSSP, residue depth, etc.)
  • Coordinates are available as full-fledged Vector objects
  • Keeping a local copy of the PDB up-to-date
  • Writing PDB files
  • Calculation of Half Sphere Exposure (a new solvent exposure measure)
  • New features are added regularly!

Function from structure

Function from structure. Putative active sites in protein structures can be identified by looking for recurring functional amino acid triads. The geometry of such a triad can be characterized by a set of pairwise atom distances (dotted lines).

I developed a new algorithm that makes it possible to identify recurring 3D patterns of side chains in a large set of structures (the method was applied to about 800 superfamily domains from the SCOP classification). It can also be used to identify potentially interesting sites in a single structure. The method incorporates a number of novel features that are not found in other similar methods:

  • It deals with conservative amino acid subsitutions
  • It deals with shifted C-alpha positions (ie. the side chain atom position coincide, but the backbone position is shifted)
  • It can find mirror imaged side chain patterns
  • It takes atom label ambiguities into account
  • It is very speed and memory efficient by making use of an SR-tree data structure

The method has been used to identify various interesting novel active site similarities, and also identified a putative active site in bacterial luciferase.

The project won the Ishango prize 2001. The Ishango prize is part of the Operation Ishango campaign launched by the Brussels-Capital region to increase awareness of science and encourage young people to take up scientific careers. The competition awards two prizes of 2,500 euro to young researchers or science students working in the region, one french and one dutch speaking.

An article describing the method is published in Proteins.

Presentations

Teaching

I'm teaching the obligatory Structural Bioinformatics course at the Bioinformatics center. Topics include introduction to protein structure, prediction of function from structure and prediction of local structure, solvent exposure and tertiary structure. I'm also teaching the Structural Bioinformatics section of the Introduction to Bioinformatics course, and an introduction to Dynamic Bayesian Networks and Mocapy as part of the advanced bioinformatics PhD course.

Former research interests

Protein-carbohydrate interactions

Protein-carbohydrate binding. Crystal structure of the D. biflorus seed lectin. The stars indicate the sugar binding sites. The spheres are metal ions.

I recieved my PhD from the Free University Brussels (VUB), Ultrastructure Department, in 1999 on the subject of crystal studies of protein-carbohydrate interactions. I used the legume lectins as a model system to study the general features of carbohydrate binding sites in proteins. This led for example to the discovery of a general give-and-take mechanism that these proteins use to distinguish very similar carbohydrates.

My PhD thesis (gzipped PS|PDF) contains a broad introduction to legume lectin structure.

The proteins I studied include (click on the structure identifiers to go to the PDB):

  • Lentil lectin (1LES): Used in a multi-disciplinary (NMR, molecular modeling and crystallography) study of protein-carbohydrate interactions.
  • Phytohaemagglutinin-L (1FAT): the thing that makes raw beans toxic by binding to your gut. Every year a substantial amount of people get sick from eating raw or unsufficiently cooked beans, in which the PHA fraction is not or not fully denatured. The cause of the illness is almost always wrongly attributed to bacterial food poisoning, which has similar symptoms. PHA-L exhibited a novel quaternary structure, which was shown to be important for binding plant hormones (cytokinines).
  • Arcelin-5 (1IOA): an insecticidal protein from wild bean strains. This protein is a "truncated" legume lectin, with some surprising features. The biggest surprise was the presence of a specific cis-peptide bond (a conserved feature of the legume lectin family) without the stabilisation of a neighboring metal ion binding site. This site was thought to be necessary for the stabilisation of the cis-peptide bond.
  • DBL in complex with adenine (1BJQ), with the Forssman disaccharide (1LU1) and with the blood group A trisaccharide (1LU2) and DB58 (1LUL): two lectins from Dolichos biflorus. The DBL structure (see picture) led to a better understanding of the specificity of proteins that bind N-acetylated sugars, and to the discovery of a general give-and-take mechanism that lectins use to to distinguish closely similar carbohydrates. DB58 has a very peculiar quaternary structure. DBL and DB58 both bind plant hormones (cytokinines) as well, in an unusual binding site that depends on the quaternary structure.
  • FRIL (1QMO): Flt3 Interacting Lectin, a lectin that keeps haematopoietic progenitors alive in vitro . FRIL forms a very complicated crosslinked lattice in the crystal, which is probably important for its unique biological activity. The FRIL structure showed for the first time how weak protein-protein interactions can become important when so-called cross-linked lectin-sugar lattices are formed. These lattices are believed to be responsible for the creation of a higher-level specificity, and are thought to be of high importance for the biological effects of lectins. For some years, Phylogix, Inc. (located in Boston) developed FRIL-based therapeutics to protect and repair tisues damaged by chemotherapy.

My PhD work was awarded the shared second place by the jury of the DSM prize for chemistry and technology 1999.

Protein architecture

Protein architecture. The legume lectins are also very interesting from the point of view of protein architecture. They can form different oligomers, despite the extremely conserved character of their subunits. In the picture, the subunits to the left of the oligomers (upper left for the tetramers) are in the same orientation, emphasizing the different ways of subunit assembly. Both PHA-L, DBL and DB58 provided new insights on how legume lectin monomers can assemble into multimers. The legume lectins can also form so-called higher order oligomers that can become important when they bind multivalent sugars (see the JMB articles on FRIL in the reference list).


And in addition...

Publications

BibTeX format

PubMed Query

1995

  • Casset, F., Hamelryck, T., Loris, R., Brisson, J., Tellier, C., Dao-Thi, M., Wyns, L., Poortmans, F., Pérez, S. & Imberty, A. (1995) NMR, molecular modeling and crystallographic studies of lentil lectin-sucrose interaction. J. Biol. Chem., 270, 25619-25628 (PDF)

1996

  • Dao-Thi, M.-H., Hamelryck, T. W., Poortmans, F., Voelker, T. A., Chrispeels, M. J. & Wyns, L. (1996) Crystallization of Glycosylated and Nonglycosylated Phytohemagglutinin-L. Proteins Struct. Func. Genet., 24, 134-137 (PDF)
  • Hamelryck, T.W., Dao-Thi, M., Poortmans, F., Chrispeels, M.J., Wyns, L. & Loris, R. (1996) The Crystallographic Structure of Phytohemagglutinin-L. J. Biol. Chem. , 271, 20479-20485. (PDF)
  • Hamelryck, T. W., Poortmans, F., Goossens, A., Angenon, G., Van Montagu, M., Wyns, L., & Loris, R. (1996) Crystal Structure of Arcelin-5, a Lectin-like Defense Protein from Phaseolus vulgaris. J. Biol. Chem. 271, 32796-32802 (PDF)

1998

  • Dao-Thi, M., Hamelryck, T.W., Bouckaert, J., Körber, F., Burkow, V., Poortmans, F., Etzler, M., Strecker, G., Wyns, L. & Loris, R. (1998) Crystallization of two related lectins from the legume plant Dolichos biflorus. Acta Cryst., D54, 1446-1449.
  • Hamelryck, T.W., Loris, R., Bouckaert, J. & Wyns, L. (1998) Properties and Structure of the Legume Lectin Family. Trends Glycosci. Glycobiol. , 10, 349-404 (PDF) (You'll need Japanese fonts for this one :-)

1999

  • Hamelryck, T.W., Loris, R., Bouckaert, J., Dao Thi M.-H., Strecker, G., Imberty, A., Fernandez, E., Wyns, L. & Etzler, M.E. (1999) Carbohydrate Binding, Quaternary Structure and a Novel Hydrophobic Binding Site in Two Legume Lectin Oligomers from Dolichos biflorus. J. Mol. Biol. , 286, 1161-1177 (PDF)
  • Bouckaert, J., Hamelryck, T., Wyns, L., Loris, R. (1999) Novel structures of plant lectins and their complexes with carbohydrates. Curr. Opin. Struct. Biol. , 9, 572-577. (PDF)
  • Bouckaert, J., Hamelryck, TW., Wyns, L., Loris, R. (1999) The crystal structures of Man(alpha1-3)Man(alpha1-O)Me and Man(alpha1-6)Man(alpha1-O)Me in complex with concanavalin A. J. Biol. Chem. 274, 29188-2995. (PDF)

2000

  • Hamelryck, T.W., Moore, JG., Chrispeels, MJ., Loris, R., Wyns, L. (2000) The Role of Weak Protein-Protein Interactions in Multivalent Lectin-Carbohydrate Binding: Crystal Structure of Cross-linked FRIL. J. Mol. Biol. 299, 875-883. (PDF)

2001

  • Buts, L., Dao-Thi, M., Loris, R., Wyns, L., Etzler, M., Hamelryck, T. (2001) Weak protein-protein interactions in lectins: the crystal structure of a vegetative lectin from the legume Dolichos biflorus. J. Mol. Biol. 309, 193-201. (PDF)
  • Hamelryck, T.W., Kjeldgaard, M. (2001) An Open Source Multi-purpose Programming Environment for Macromolecular Crystallography. CCP4 newsletter , 39 (PS) (HTML@CCP4)

2003

  • Hamelryck, T., Manderick, B. (2003) PDB parser and structure class implemented in Python. Bioinformatics, 19, 2308-2310. (PDF@Bioinformatics)

2005

  • Hamelryck T. (2005) An amino acid has two sides: A new 2D measure provides a different view of solvent exposure. Proteins Struct. Func. Bioinf., 59, 38-48. (PDF)
  • Boomsma, W., Hamelryck, T. (2005) Full Cyclic Coordinate Descent: Solving the protein loop closure problem in Calpha space, BMC Bioinformatics, 6:159 (Abstract&PDF@BioMed)
  • Won, KJ., Hamelryck, T., Prugel-Bennett, A., Krogh, A. (2005) Evolving Hidden Markov Models for Protein Secondary Structure Prediction, Proceedings of the 2005 IEEE Congress on Evolutionary Computation, pp. 33-40, Edinburgh. (PDF)
  • Kent, J.T., Hamelryck, T. (2005). Using the Fisher-Bingham distribution in stochastic models for protein structure. In S. Barber, P.D. Baxter, K.V.Mardia, & R.E. Walls (Eds.), Quantitative Biology, Shape Analysis, and Wavelets, pp. 57-60. Leeds, Leeds University Press. (PDF@LASR)

2006

Note that all publications in 2006 were open access!

  • Boomsma, W., Kent, J.T., Mardia, K.V., Taylor, C.C. & Hamelryck, T. (2006) Graphical models and directional statistics capture protein structure. In S. Barber, P.D. Baxter, K.V.Mardia, & R.E. Walls (Eds.), Interdisciplinary Statistics and Bioinformatics, pp. 91-94. Leeds, Leeds University Press. (PDF@LASR)
  • Baranov, PV., Vestergaard, B., Hamelryck, T., Gesteland, RF., Nyborg, J., Atkins , JF. (2006) Diverse bacterial genomes encode an operon of two genes, one of which is an unusual class-I release factor that potentially recognizes atypical mRNA signals other than normal stop codons. Biology Direct, 1:28 (PDF@Biology Direct)
  • Paluszewski, M., Hamelryck, T. and Winter, P. Reconstructing protein structure from solvent exposure using Tabu Search. (2006) Algorithms Mol. Biol. 1:20. (PDF@AlgMolBiol).

2007

  • Won, KJ., Hamelryck, T., Prugel-Bennett, A. and Krogh, A. (2007) An evolving method for learning HMM Structure: prediction of protein secondary structure. BMC Bioinformatics, 8, 357 (PDF@BMC Bioinformatics)

2008

  • Boomsma, W., Mardia, KV., Taylor, CC., Ferkinghoff-Borg, J., Krogh, A. and Hamelryck, T. (2008) A generative, probabilistic model of local protein structure. Proc. Natl. Acad. Sci. USA, 105, 8932-8937. PDF@PNAS
  • Boomsma, W., Borg, M., Frellsen, J., Harder, T., Stovgaard, K., Ferkinghoff-Borg, J., Krogh, A., Mardia, KV. and Hamelryck, T. (2008) PHAISTOS: protein structure prediction using a probabilistic model of local structure. Proceedings of CASP8, Cagliari, Sardinia, Italy, December 3-7 2008. pp 82-83

2009

  • Hamelryck, T. (2009) Probabilistic models and machine learning in structural bioinformatics. Statistical Methods in Medical Research, Review. 18, 505-526.
  • Cock, P., Antao, T., Chang, J., Chapman, B., Cox, C., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., de Hoon, M. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11),1422-1423.
  • Frellsen, J., Moltke, I., Thiim, M., Mardia, KV., Ferkinghoff-Borg, J., Hamelryck, T. (2009) A probabilistic model of RNA conformational space. PLoS Computational Biology, 5(6), e1000406.
  • Borg, M., Mardia, KV., Boomsma, W., Frellsen, J., Harder, T., Stovgaard, K., Ferkinghoff-Borg, J., Røgen, P., Hamelryck, T. A probabilistic approach to protein structure prediction: PHAISTOS in CASP9. LASR 2009 - Statistical tools for challenges in bioinformatics, pp. 65-70. Leeds university press, Leeds, UK.

2010

  • Paluszewski, M., Hamelryck, T. (2010) Mocapy++ - A toolkit for inference and learning in dynamic Bayesian networks. BMC Bioinformatics, 11:126.
  • Harder, T., Boomsma, W., Paluszewski, M., Frellsen, J., Johansson, KE., Hamelryck, T. (2010) Beyond rotamers: A generative , probabilistic model of side chains in proteins. BMC Bioinformatics, 11:306.
  • Paulsen, J., Paluszewski, M., Mardia, KV., Hamelryck, T. (2010) A probabilistic model of hydrogen bond geometry in proteins. LASR 2010 - High-throughput sequencing, proteins and statistics, pp. 61-64. Leeds university press, Leeds, UK.
  • Stovgaard, K., Andreetta, C., Ferkinghoff-Borg, J., Hamelryck, T. (2010) Calculation of accurate small angle X-ray scattering curves from coarse-grained protein models. BMC Bioinformatics, 11:429.
  • Hamelryck, T., Borg, M., Paluszewski, M., Paulsen, J., Frellsen, J., Andreetta, C., Boomsma, W. Bottaro, S., Ferkinghoff-Borg, J. (2010) Potentials of mean force for protein structure prediction vindicated and generalized.PLoS ONE, 5(11): e13714.