Supercomputing Institute Research Bulletin

Summer 1997

Rational Drug Design Workshop
On April 7–11, 1997, the University of Minnesota’s Institute for Mathematics and its Applications and the Supercomputing Institute sponsored a Workshop on Mathematical and Computational Issues in Rational Drug Design. Held on the Minneapolis campus, the workshop attracted over 100 participants from the University and industrial settings. One of the hottest topics was the combinatorial library approach to lead discovery and refinement, which has dramatically changed the nature of the way that computational chemists at pharmaceutical companies think about drug development.

Colin McMartin of Thistlesoft Inc. in Morris Township, N.J. estimated that the number of potential molecules to be considered in searching for a new drug is 10400, although there was also some discussion of whether this number might be as small as 1060. For purposes of discussion, this difference of 340 orders of magnitude is not important since, with either number, it’s clear that it is impossible to synthesize even one percent of these compounds, let alone measure their properties. Many of the talks at the workshop focused on the question of whether computation can come to the rescue. Clearly, the kind of performance that is required can only be powered by new algorithms, and this was a prominent topic of the workshop.

The field is moving experimentally toward the development of thoughtfully created libraries, rather than simply the largest possible libraries. The library should be representative of the chemical functionalities and molecular shapes that are most likely to succeed. Robots make the molecules and then carry out high-throughput assays of their bioactivity. This raises challenging problems in sampling strategy. When the sampling is necessarily thin, researchers want to maximize the effectiveness of a search by making the sampled space as diverse as possible. But how is diversity measured in a nonhomogeneous space of many dimensions? Even if we agree on a measure of diversity, how densely must we sample? Then, once a sample is selected, it must be scored. The goal here is to use computations (which are in principle quantum mechanical, but are more often—in 1997—based on classical approximations) to assign a figure of merit to each molecule which somehow quantifies the probability that it will be a useful lead compound (a high-activity compound whose properties can perhaps be refined to make a fully suitable drug). This puts a new demand on computations: how many molecules can be scored overnight? While chemists call the mathematical techniques involved in this last step modeling, the techniques are pervasive and go by different names in different fields—for example, economists and climatologists call models “forecasts,” and in other fields, modeling is called “machine learning.”
In “conventional” drug design, researchers experimentally assay various synthesized potential drugs for biological activity. With computational assays, however, the assay can, in principle, occur prior to synthesis—i.e., one can assay virtual libraries to prioritize the possibilities for synthesis. In many cases, the assay consists of estimating (“modeling”) bioactivity on the basis of theoretical “descriptors.” Such a descriptor might be a quantitative measure of some aspect of the molecule’s 3-D geometry, or it might be a predicted chemical property, such as free energy of solvation in some medium. A crucial issue with computational assays is how rapidly we can calculate these descriptors, and which ones correlate best with biological activity. This raises numerous questions in quantitative structure–activity relationships (QSARs), such as how we discover and quantify such relationships.

Most drugs work by binding to a specific site, called a receptor, on a protein. A central problem is to find molecules (called ligands in this context) with high binding affinity. The workshop’s first speaker, Garland Marshall of Washington University in St. Louis, presented a stimulating overview of QSAR methods that account for the full 3-D structure of the ligand. A number of specific real-life examples were presented, and their complexity led Garland to pronounce that “Mother Nature never shaved with Occam’s Razor.” A critical issue raised by Marshall is that the internal energy of a typical bound ligand is about 4–5 kcal/mol higher than the lowest energy structure of the free ligand. Marshall’s final examples demonstrated impressive successes using neural nets to obtain QSARs with 3-D descriptors based on properties calculated from an energy-minimized complex.

Peter Willett of the University of Sheffield, England explained the use of group theoretical tools, especially maximal common subgraph isomorphism, for substructure searching in chemical databases. The problem is NP complete, but clever use of chemical “screens” allows progress to be made—another example of how individual chemical or mathematical techniques are not nearly as powerful as a combination of techniques. Joe Eyermann of Dupont Merck discussed procedures for extracting ring scaffolds and their substitution patterns from a molecular graph.

W. Graham Richards of Oxford University in England presented an original approach to another mathematical question: how well can two-dimensional representations simulate a three-dimensional ligand-receptor problem? He found a 2-D structure by minimizing the difference between the distance matrix (the matrix of distances between all pairs of atoms) of a 2-D structure and that of the 3-D structure. Then he applied techniques developed for optical character recognition to work with the 2-D structures. This work was motivated by the combinatorial library revolution mentioned above, in particular by the need to rapidly calculate the similarity between all pairs of a library of thousands of molecules. Clearly there are major opportunities for biopolymer modeling to take advantage of exciting developments in data compression for speech and images.
Richards also raised another issue: the new millennium problem of chemoinformatics. Succinctly put, high-throughput robotic screening can generate data on 107 molecules per year. How, then, should we mine this data?

Doug Rohrer of Pharmacia & Upjohn summarized that company’s approach to similarity analysis, which emphasizes electrostatics and geometry (steric interactions). Jie Liang from Clare Woodward’s research group at the University of Minnesota presented analytic methods for computing molecular shapes.
David Doherty of Minnesota Supercomputer Center Inc. presented a discussion of cooperativity and raised the questions of cooperativity and nonlinearity effects in drug diffusion though cell membranes.
Bill Dunn of the University of Illinois at Chicago discussed the issue of variable selection (which descriptors to use) when modeling the binding of flexible ligands.

Chris Cramer of the University of Minnesota discussed solvation energy models which he developed at the University along with MSI Fellow and Director Donald G. Truhlar, Army High Performance Computing Research Center and National Science Foundation postdoctoral associate Candee Chambers, Kodak Fellow David J. Giesen, and National Institute of Standards and Technology graduate research assistant Gregory D. Hawkins. Chris summarized the group’s approach to solvation theory in three axioms: (1) replace the solvent with a field, thereby reducing the number of degrees of freedom (2) the field is intrinsically approximate, so it is best not to get too caught up in theoretical rigor (3) don’t neglect short-range effects; electrostatics are not the whole story.

Colin McMartin emphasized that not only are there 1060–10400 potential drug molecules, but also, in a typical application, 1027 of these may well bind to the receptor of interest. This astoundingly large number still represents a needle in the haystack of total possibilities, and the question arises of how to explore the space. McMartin discussed his QXP (Quick Explore) program and its “lazymouse” interface. He raised an interesting software design issue in which the goal is not to minimize the number of arithmetic operations in a background calculation but rather to minimize the number of mouse clicks required to visualize a large data base. This strategy takes into account the practical reality that the human being who sits in front of the screen has a finite amount of patience.

Regine Bohacek of Ariad Pharmaceuticals in Cambridge, Mass. discussed her GROMOL program for “growing” ligands (the “key”) in the receptor site (the “lock”). This is a technique for designing combinatorial libraries when the receptor structure is known, and she presented a stunningly successful case study of its use.
Ken Dill of the University of California at San Francisco examined the validity of three basic premises in computational biology modeling: continuum media, additivity assumptions, and independence assumptions. These assumptions are deeply ingrained in chemical thinking, but Dill presented evidence that sometimes they lead us astray. This talk raised questions such as whether parameters fit to solvation data for small molecules provide useful building blocks for biomolecule modeling. Dill brought out the advantages of starting from polymer theory rather than small-molecule theory.

Dennis Sprous of Wesleyan University discussed the question: Having run a long molecular dynamics simulation, how does one extract useful information from the potentially overwhelming amount of raw output?
Markus Wagener of SmithKline Beecham in King of Prussia, Pa. discussed the use of artificial neural networks for three applications in drug design: finding QSARs, transforming data into a simpler representation that is more amenable to further analysis, and classification. He discussed libraries of up to 6.5 x 104 compounds with 12 descriptors for each compound. Brian Luke of the National Cancer Institute’s Frederick Research and Development Center discussed the application of parallel genetic algorithms (GAs) for exploring the full conformational space of a set of inhibitors as a step in the generation of new putative ligands.

David Rogers of Molecular Simulations Inc. in San Francisco discussed a second-generation genetic algorithm strategy in which the GA not only finds the best parameters for a QSAR, but also finds the best set of descriptors. The GA selects possible linear and nonlinear descriptors out of a pool and optimizes the linear and nonlinear parameters, as well as the choice of descriptors, with a fitness function that includes a penalty for increasing the number of descriptors. He calls this strategy genetic function approximation (GFA). Rogers also raised questions about the strategy of designing experiments on libraries to maximize what we can learn from them, not just to find a lead ligand directly. In this context he emphasized the concept of heteroscadacticity (the variance of predictions over diverse models) and whether it is better to use only one’s best model or to retain a set of diverse models. This is a dilemma we are all familiar with when, for example, we attempt to “average” the weather forecasts on three different newscasts. Since all three forecasts typically derive from a single forecast by the National Weather Service, there is little heteroscadacticity. The use of a diverse set of models may be particularly appropriate when the experimental data is underdetermined, that is, does not contain enough information to differentiate among alternative hypotheses of the system.

What is the best method of selecting (for further scrutiny) a subset of n drugs from a larger set of N potential drugs? The drug selection problem is formally identical to finding the optimal set of digital approximations to a set of analog signals. Jason Rush of the University of Washington in Seattle envisaged the components of a library with D descriptors as points in a D-dimensional Euclidean space and discussed various criteria for sampling this space to maximize the ratio of diversity retained to the number of points sampled. He advocated using 24-dimensional Voronai cells, rather than hypercubes, and explained the unusual features of 24-dimensional space that appear to make this optimal. Along the way, he gave the audience a taste of the amazing properties of 24-dimensional space that seem to make the mathematics more feasible than in other dimensionalities. For example, what is the maximum number of touching spheres in D dimensions? The answer is well known to be 2, 6, and 12 in 1, 2, and 3 dimensions, respectively, and it is also known to be 24 in 4 dimensions. The answer is unknown in 5-to-23 dimensions, but is 192,560 in 24 dimensions.

Gordon Crippen of the University of Michigan discussed the use of mixed-integer arithmetic for deducing how different ligands might bind to the same receptor. This involves a competition of steric and energetic factors. He found multiple-binding models even for a single ligand and discussed the complications that this engenders. Mathias Rarey of SmithKline Beecham discussed an algorithm for docking flexible ligands.

Tom Darden of the National Institute of Environmental Health Science in Research Triangle Park, N.C., who can be seen, thinking deeply over Simon Kearsley’s right shoulder, on page 5, discussed Ewald summation methods for electrostatics and the application of several methods to a prototype problem in electrostatics—namely, can one calculate the free energy of a single sodium cation in water? Darden pointed out a number of issues that still require clarification before we can accept such calculations as reliable.
Mike Pique of the Scripps Research Institute in La Jolla, Calif. discussed a convolution algorithm for the rapid computation of the electrostatic potential energy between two proteins when their relative orientation and separation must be optimized. Excellent results were obtained on a 256-node Intel Paragon parallel computer. He also discussed the future of dataflow visualization environments in which scientific visualization is merging with three technical innovations: visual object-oriented programming (VOOP) with graphical editing tools; the unification of computer imaging (image-to-image or image-to-data), including texture mapping, spatial filtering, and computer graphics (data-to-image); and the Internet. The new features introduced by the Internet, he said, are an emphasis on communication, distributed resources, and computer-architecture neutrality. Pique stressed the advances of Java, which allows pharmaceutical chemists to e-mail 3-D images. Pique used the metaphor “fog of excitement” to describe the flourish of activity in this field, but he believes that taking advantage of advances in digital technology by combining the exploding resources of the Internet and desktop 3-D graphics and imaging will provide a constructive advance.

Sandor Vajda of Boston University discussed a free energy function for scoring protein-ligand binding. Wynn Walker of UCLA discussed the design of ligands for DNA, as opposed to the more common protein targets. Wei-min Lin of Indiana University–Purdue at Indianapolis presented quantitative rules for determining protein structural classes based on their secondary structure. Joel Nilsson of Uppsala University in Sweden discussed a composite-overlapping grid model for drug delivery in the human eye.
Jeff Blaney of Chiron Corporation led a panel discussion on the question: What are the new challenges that should be addressed in the next ten years? Also contributing were Gordon Crippen, Simon Kearsley of Merck, Garland Marshall, and Phil Portoghese of the University of Minnesota. Some of the suggestions were: visualization of multidimensional databases, scoring functions for intermolecular interactions, improved methods for error analysis, understanding the relationship of ligand binding to conformational change in the receptor, improved techniques for giving “statistical advice,” mining chemical databases, data scrubbing, expert data warehouses, rule-generation algorithms, fuzzy logic, and methods of dealing with noisy, sparse data. Clearly many of these topics are problems whose prominence has been promoted by the rise of digital technology.

In a compelling after-dinner speech at the banquet, Dr. Ralph Hirschmann of the University of Pennsylvania drew on his long industrial experience to present another perspective on drug design, focusing on many non-computational issues. For example, he discussed a 1997 paper in the Journal of the American Chemical Society in which the authors found that the shape of a base inserted in DNA, rather than its hydrogen-bonding ability, may be the key to the polymerase recognition process that leads to faithful copying of DNA. Although this is an experimental result, by underscoring the role of 3-D shape it further dramatizes the role that computation can play in designing biotechnological molecules that mimic one or another capability of natural biological molecules. Dr. Hirschmann, however, took exception to the use of “rational drug design” and “computer-aided drug design” as near synonyms; he claims that pharmaceutical researchers were not totally irrational before they had computers!

The workshop was preceded by a six-hour tutorial presented by Professor David Ferguson of the University of Minnesota which culminated in a visit to the University’s medicinal chemistry labs. The Institute is grateful to Professor Ferguson and to organizing committee chair Jeff Howe of Pharmacia and Upjohn, as well as the other members of the organizing committee, Rich Dammkoehler of Washington University in St. Louis, Jeff Blaney, Tony Hopfinger, and Donald Truhlar, for their contributions to making the conference a success. The Institute also thanks Avner Friedman and Robert Gulliver of the Institute for Mathematics and its Applications for their many contributions to the planning and success of the workshop. Funding by the National Science Foundation is also gratefully acknowledged.


In This Issue:

1997 Summer Undergraduate Intern Program

>

Relational Drug Design Workshop

Rayleigh-Taylor Instability

How Alumina Phases Impact the Ruby Scale

Turbulent Flow and Hypersonic Vehicles

Origin 2000 Arrives

Seminar Synopses

Research Reports

[RESEARCH BULLETINS]

[Supercomputer Institute Homepage]

 

 

This information is available in alternative formats upon request by individuals with disabilities. Please send email to alt-format@msi.umn.edu or call 612-624-0528.
 

URL: http://
This page last modified on  
Website related questions or problems should be directed to webmaster@msi.umn.edu
The Supercomputing Institute does not collect personal information on visitors to our website. For the University of Minnesota policy, see www.privacy.umn.edu.