Markov (state) models (MSMs) and related approaches are coarse-grained models of the molecular dynamics consisting of distinct molecular structures / conformations, and the transition rates between them. Markov models can systematically reconcile simulation data from either a few long or many short simulations and allow us to analyze the essential metastable structures, thermodynamics, and kinetics of the molecular system under investigation. However, the estimation, validation, and analysis of such models is far from trivial and involves sophisticated and often numerically sensitive methods.

PyEMMA 2 provides accurate and efficient algorithms for kinetic model construction and analysis. PyEMMA is a python package, runs under all common OSes and can be used through Python user scripts or interactively in IPython notebooks. Its functionalities include:

- Read all commonly used MD input formats (powered by mdtraj).
- Featurize your trajectories, i.e. transform Cartesian coordinates into dihedrals, distances, contact maps or custom features.
- Find the slowest collective coordinates (reaction coordinates) using time-lagged independent component analysis (TICA).
- Cluster and discretize the state space.
- Process data either in memory or streaming mode - that way you can work even with very large / many trajectories and big molecular systems.
- Estimate and validate Markov state models (MSMs). Computer their statistical distribution and uncertainty using Bayesian MSMs.
- Computing long-lived (metastable) states and structures with Perron-cluster cluster analysis (PCCA)
- Perform systematic coarse-graining of MSMs to transition models with few states.
- Estimate hidden Markov Models (HMM) and Bayesian HMMs (powered by bhmm).
- Take advantage of extensive analysis options for MSMs and HMMs, e.g. calculate committor probabilities, mean first passage times, transition rates, experimental expectation values and time-correlation functions (powered by msmtools).
- Explore mechanisms of protein folding, protein-ligand association or conformational changes using Transition Path Theory (TPT).
- Plot and visualize your results in paper-ready form.

**Installation, Documentation and Tutorials**: http://pyemma.org

**Post issues / participate in development**: https://github.com/markovmodel/pyemma

**PyEMMA Paper**: M. K. Scherer, B. Trendelkamp-Schroer, F. Paul, G. Pérez-Hernandez, M. Hoffmann, N. Plattner, C. Wehmeyer, J.-H. Prinz and F. Noé: PyEMMA 2: A software package for estimation, validation and analysis of Markov models. *Journal of Chemical Theory and Computation* **11**, 5525,5542 (2015)

The introduction of new hardware, which was necessary to extend to longer time scales, brought with it the need to rewrite software to take advantage of the accelerators. These rewrites paid dividends by allowing MD simulations to run up to 100x faster than was previously possible on hardware that is more inexpensive and energy efficient than dedicated supercomputers.

However, these software advances aimed at new accelerator hardware has focused almost exclusively on classical fixed-charge force fields, and advanced force fields like AMOEBA were largely left out. While fixed-charge force fields have enjoyed jumps computational performance by several orders of magnitude since I started working in the field (ca. 2008), the computational performance of the AMOEBA force field has been growing far slower.

As is a general trend, what makes AMOEBA so much more accurate than fixed-charge force fields with respect to the underlying physics--namely the treatment of atoms as point multipoles with a polarizable dipole rather than a simple charge--also makes it much slower. For instance, comparing the implementation of AMOEBA in the Amber program suite with the fixed-charge Amber force field showed a performance gap of 40-50x. Comparing against the GPU implementation increases that gap 10x to 100x *more*. Such performance gaps make it nearly impossible to do satisfactory sampling with the AMOEBA force field, which limits the promise of its improved underlying physics and handicaps efforts aimed at improving the parametrization.

But with OpenMM, that is changing. OpenMM boasts the first, and to date only, implementation of the AMOEBA force field for GPUs. Its results are indistinguishable from those generated by the reference AMOEBA implementation in Tinker (and is used as a backend in TInker to provide GPU acceleration). To demonstrate the promise of the AMOEBA implementation in OpenMM's CUDA platform, I've listed some comparative benchmarks below when running the AMOEBA force field with Tinker and OpenMM on the Joint Amber CHARMM benchmark (JAC). This system has 23,558 atoms comprised of the dihydrofolate reductase protein (DHFR) with explicit water molecules.

Number of Cores | Tinker (OpenMP) | pmemd.amoeba (MPI) |
---|---|---|

2 | 0.03 ns/day | 0.06 ns/day |

4 | 0.04 ns/day | 0.09 ns/day |

8 | 0.8 ns/day | 0.11 ns/day |

16 | 0.9 ns/day | 0.07 ns/day |

These implementations have limited scalability, and cannot utilize more than 16 to 32 processors efficiently. As a result, it would take a standard server almost 25 years to simulate 1 microsecond of MD! By comparison, many fixed-charge force fields can achieve performance nearing 200 ns/day!

Now let's look at OpenMM performance using the CUDA implementation of AMOEBA:

GPU Model | Performance (ns/day) |
---|---|

C2050 | 0.3886 |

K5000 | 0.5233 |

GTX 680 | 0.7468 |

GTX 780 | 1.1380 |

GTX 980 | 2.1467 |

Using the latest model of GPU available (GTX 980, which cost around $500-$600 USD in the USA), we achieve about 20x speedup compared to the maximum performance attainable on a standard server! And the trend in GPU performance is very telling, as incremental increases in GPU models result in substantial performance gains.

To improve matters even more, the AMOEBA implementation has undergone numerous performance improvements since the release of OpenMM 6.3, and now runs between 10 and 15% faster than these benchmarks show. And further improvements seem well within reach, but those will be saved for a future blog post.

]]>Back when I was a postdoctoral researcher doing molecular simulations (~2005), a colleague would often, perhaps half-jokingly, say that 90% of our job was converting data from one file format into another. We would spend a huge amount of time poring over documentation trying to make sure we got format specifications just right in order to put together a one-off conversion script to get data into the right format to apply the tool we needed to take us on to the next step. And if we weren't manipulating file formats, we were poring over documentation to figure out how to migrate our simulation protocol from one simulation package into another to take advantage of the latest algorithm we wanted, which was only available in a simulation package we hadn't already been using. And even worse was if we wanted an analysis tool designed to cope with a third file format, requiring multiple file format conversions which could be nearly impossible to validate.

Unfortunately, I think a substantial portion of my time over the last 10 years has been devoted to solving those same problems again and again. The science, in some ways, has seemed to be the easy and fun part, while all of the file format and software usage issues provide perpetual tedious time-sinks.

With Omnia, I'm finally seeing the future arriving. With tools like ParmEd and MDTraj, we're now able to convert between a wide range of file formats, allowing easy access to more simulation packages, and more analysis options (which can now be implemented in a general, rather than package-specific, way). And if algorithms we need aren't available, OpenMM provides an easier and more flexible way to get up and running with them without having to switch to a totally different simulation package. And the community arising around Omnia means that it now makes more sense to justify writing sound, reusable code rather than one-off tools which solve our specific problem but will never be used again.

While my group's involvement with Omnia is still in its early days, it has already been useful in helping us automate setup of simulations of solutes in arbitrary solvent mixtures for ongoing studies in the lab, via our SolvationToolkit. And we're devoting substantial effort towards Omnia-related development, and shifting applications work towards using many of the Omnia tools.

I'm so excited to finally be moving away from reinventing the wheel and towards devoting more of my time to the actual science that excites me.

]]>So why is this useful? There are two reasons. First, you might want to simulate a crystal, and not all crystals have rectangular unit cells. You need to be able to specify whatever unit cell shape the actual crystal has, and with OpenMM 6.3 you can now do that.

The other reason is less obvious, but probably more important. When you simulate a protein in water, you should make the periodic box big enough that the protein doesn't interact with itself. "Big enough" is hard to exactly define, since Coulomb interactions are very long ranged, but generally you want at least 1 nm of water between periodic copies of the protein, and more is better.

But a protein in water is free to rotate, and that can change the distance between copies. You need to make sure that, no matter how the protein rotates, there will still be enough padding. That is, you need a sufficiently large sphere of water surrounding it. You can then embed that sphere in a cube, but that's wasteful. The extra water makes your simulation run slower. Ideally, you want your periodic box to be a sphere.

That's impossible, since you can't pack spheres to fill space. But there are polyhedra you can use that at least come closer to being spherical, like the truncated octahedron and rhombic dodecahedron. By using one of those shapes, you can get away with fewer water molecules while still keeping the same minimum distance between copies of the protein.

So what does this have to do with triclinic boxes? There turns out to be an amazing mathematical result: *any* repeating shape that fills all of space is exactly equivalent to a triclinic box. Given a truncated octahedron, for example, there is an equivalent triclinic box with exactly the same volume that, when tiled periodically throughout space, produces exactly the same repeating pattern of atoms. And that means that if you have triclinic boxes, you automatically have all other shapes as well.

Given the minimum required distance between periodic copies, the optimal triclinic cell has about 71% the volume of the optimal rectangular cell, which means your system only needs about 71% the number of atoms. So although this isn't a huge optimization, it does allow you to use a somewhat smaller system that can be simulated somewhat faster.

]]>Ensembler is an automated pipeline for generating diverse arrays of protein configurations from *omics*-scale genomic and structural data. The user selects a set of protein sequences (*targets*) and a set of protein structures (*templates*), and each target-template pair is then subjected to comparative modeling and a series of refinement steps. Briefly, these are the stages of the Ensembler pipeline:

1. Target sequence selection - via a search of UniProt, or defined manually

2. Template structure selection - via a search of UniProt, or by specifying PDB IDs, or defined manually

3. (*Optional*) Template loop reconstruction (using Rosetta)

4. Alignment and comparative modeling (using Modeller)

5. RMSD-based clustering to filter out non-unique models (using MDTraj)

6. Energy minimization and implicit solvent molecular dynamics simulation (using OpenMM) (default: 100 ps)

7. Solvation with explicit water (using OpenMM)

8. Explicit solvent molecular dynamics simulation (using OpenMM) (default: 100 ps)

9. (*Optional*) Package and/or compress the models, ready for transfer or set-up on other platforms such as Folding@home

Ensembler thus maximizes usage of publicly available sequence and structure data to produce configurationally diverse protein ensembles. These models can then be subjected to further structural analysis or used to seed highly parallel molecular dynamics simulations (e.g. using distributed computing frameworks such as Folding@home). In the latter case, the increased efficiency of sampling would be of particular utility for methods such as Markov state models which can aggregate trajectory data to construct kinetic models of conformational dynamics.

Documentation and installation information can be found at http://ensembler.readthedocs.org/en/latest/.

All source code is available on GitHub at

https://github.com/choderalab/ensembler

Our recently submitted manuscript describing Ensembler and its application to modeling all human tyrosine kinases can be found on bioRxiv.

]]>We are looking for a software engineer to join Vijay Pande’s lab at Stanford University to advance the Folding@home project. As the primary software engineer, you will be responsible for coordinating the development of the software pipeline from identifying needs and designing new features, to programming and testing the code, to supporting the Folding@home users. Your technical contributions, interactions with researchers, and working relationships with industrial partners like NVIDIA and Sony will make you a critical part of advancing this unique and valuable resource for the research community.

To learn more or to apply, visit https://stanford.taleo.net/careersection/jobdetail.ftl?job=67671&lang=en#.ValymnZIYTQ.email.

]]>In a word: flexibility. In a few more words: ease of development.

One component of my laboratory's research is integrative structural biology, where we are developing computational methods that can infer protein structures from sparse, ambiguous, and unreliable data. We do this by adding additional forces to the system to account for the experimental data. I won't go into the details of our research, but in this post I will outline the use of custom force classes, which are one of two ways to add custom forces to OpenMM simulations. In a future post, I will discuss the more complex—but also more flexible—route of generating an OpenMM plugin with custom GPU code.

The first—and simplest—way to add a new force to OpenMM is to use one of the seven CustomForce classes:

CustomAngleForce

CustomBondForce

CustomExternalForce

CustomGBForce

CustomHbondForce

CustomNonbondedForce

CustomTorsionForce

Each of these classes implements a particular type of interaction. Most of these interaction types are obvious from the name, but a few are worth pointing out. CustomExternalForces represent external forces that act on each atom independent of the positions of other atoms. CustomGBForce allows for the implementation of new types of generalized Born solvation models (see customgbforces.py for examples). Finally, CustomHbondForce is a very interesting class that allows for distance- and angle- dependent hydrogen bonding models—although such terms are not common in standard MD forcefields.

What makes these classes so easy to use is that they do not require writing any GPU code. One simply provides a string containing a series of expressions that describe the force to be added and OpenMM does the rest, differentiating the expression to get the forces and automatically generating the required GPU code.

As an example, the following code implements a flat-bottom harmonic potential.

```
import openmm as mm
flat_bottom_force = mm.CustomBondForce(
'step(r-r0) * (k/2) * (r-r0)^2')
flat_bottom_force.addPerBondParameter('r0', 0.0)
flat_bottom_force.addPerBondParameter('k', 1.0)
# This line assumes that an openmm System object
# has been created.
# system = ...
system.addForce(flat_bottom_force)
```

Line 3 creates a new CustomBondForce, where the energy to be evaluated is given as a string. It is also possible to use more complex expression with intermediate values (see the manual for more details).

Lines 5 and 6 add per-bond parameters, which are referred to in the energy expression. In this case, for each bond that is added, we will need to specify the values of r0 and k. It is also possible to specify global parameters that are shared by all bonds using the addGlobalParameter method.

Line 11 adds our new force to the system. Don't forget this step! It's easy to overlook. Simply creating the force is not enough, it must actually be added to the system.

We've now defined a new custom force. But, how do we add new interactions to our system? Again, OpenMM provides a great deal of flexibility—especially when using the python interface. The code below reads in the flat-bottom restraints to add from a text file.

```
# restraints.txt consists of one restraint per line
# with the format:
# atom_index_i atom_index_j r0 k
#
# where the indices are zero-based
with open('restraints.txt') as input_file:
for line in input_file:
columns = line.split()
atom_index_i = int(columns[0])
atom_index_j = int(columns[1])
r0 = float(columns[2])
k = float(columns[3])
flat_bottom_force.addBond(
atom_index_i, atom_index_j, [r0, k])
```

Lines 13 and 14 add the new interaction. Notice that the parameters for the bond are specified as a list and that the parameters are given in the order of the calls to addPerBondParameter.

This is a very simple example, but new interactions can be added in a variety of ways: based on atom and residue types, based on proximity to other atoms or residues, based on position in the simulation cell, and so on.

There are only a few potential downsides or pitfalls when using custom force classes. The first is performance. Generally the performance of the custom force classes is quite good. However, as with all automatically generated code, it is likely that a hand-coded version would offer improved performance. However, most interactions will take only a small part of the runtime and spending considerable effort to speed up these interactions is of questionable utility.

The second downside is the limited programability. The custom force classes do not allow for loops, iteration, or if statements. Sometimes one can work around this using functions like min, max, or step (as used above). But sometimes what you want to calculate cannot be easily turned into an *expression* that the custom force classes use as input. In this case, one must implement a plugin with a hand-coded custom force, which is more flexible, but more involved

One of OpenMM's key strengths is the ease of adding new types of interactions. I have given a simple demonstration of how to add a custom bonded force. Other types of custom forces like torsions and new non-bonded interactions are similarly straightforward. In a future post, I will outline the steps required to create a custom plugin, which allows for more flexibility and possibly better performance, but at the cost of greater complexity.

]]>The force field is a potential energy function of the atomic positions; it provides the forces for accelerating the atoms in molecular dynamics and the potential energies for sampling from a statistical mechanical ensemble. In comparison to quantum mechanical methods, force fields do *not* require an explicit treatment of the electrons, so the calculations are millions to billions of times faster.

Some of the most popular force fields today such as AMBER, CHARMM, OPLS and GROMOS use a simple functional form first defined by Lifson and Warshel (Reference 1), which is a sum of: (1) simple harmonic potentials for the vibrations of predefined bonds and angles, (2) periodic potentials for the torsions of molecular backbones, and (3) pairwise interatomic Coulomb and Lennard-Jones potentials for describing intermolecular interactions (Figure 1). In addition, there exists a diverse assortment of force fields with varying levels of detail and domains of applicability; this includes detailed polarizable models that describe some aspects of the electrons, and coarse-grained models that use a single particle for groups of multiple atoms.

**Figure 1**: Summary of interaction terms in a typical biomolecular force field. Reproduced from Reference 2.

The *empirical parameters* in these force fields such as bond lengths, force constants, torsional barriers and atomic charges are carefully fine-tuned by the force field developer to reproduce known experiments and *ab initio* quantum calculations, with the goal of increasing accuracy and predictive power. This is a highly challenging problem for many reasons, including the following:

1) The force field is a highly approximate description, so any errors resulting from incompleteness of the model are effectively incorporated (or “rolled up”) into the parameters.

2) It is challenging to determine the dependence of simulated observables on the force field parameters, because simulations are expensive and simulated observables are statistically noisy.

3) The parameterization calculations are highly complex and there exists no workflow for carrying out a force field parameterization project, so results are not reproducible.

These difficult challenges gave rise to the colloquialism, “Nobody wants to know how force fields and sausages are made.” That is, until recently.

**Figure 2**: The ForceBalance calculation procedure. The calculation begins with an initial set of force field parameters (bottom left), which is used to generate a force field and run simulations using external MD software (upper left). ForceBalance evaluates force field predictions of observables from the simulation data and compares them to saved experimental measurements or quantum calculations to evaluate the objective function and its parameter dependence (upper right). The optimization algorithm determines the next iteration of force field parameters (bottom) and the cycle is repeated until convergence. Reproduced from Reference 3.

**ForceBalance** (Reference 3) is a method / software package designed from the ground up to address the difficult challenges of force field development; it applies Lifson and Warshel’s fundamental idea of least-squares fitting of parameters to a training data set, but the calculation procedure is made automatic, efficient, systematic and reproducible. ForceBalance creates a framework where any force field parameterization project is carried out by setting up and running the program (Figure 2) – much like how MD simulation projects are individual calculations (or sequences of calculations) in OpenMM.

If you’re working on a problem that requires a more accurate force field than is available in the literature, you are encouraged to download ForceBalance and try it for your project. New users are encouraged to visit https://simtk.org/home/forcebalance, follow the link to the GitHub source code repository, and download the development version of the code. If you have questions about how to use the program, please write me a message on simtk.org or post a question to the discussion forum.

*References for further reading:*

(1) Warshel, A.; Lifson, S. Consistent Force Field Calculations .2. Crystal Structures, Sublimation Energies, Molecular and Lattice Vibrations, Molecular Conformations, and Enthalpies of Alkanes. *J. Chem. Phys. ***1970,** *53*, 582-&.

(2) Levitt, M. The Birth of Computational Structural Biology. *Nature Structural Biology ***2001,** *8*, 392-393.

(3) Wang, L.-P.; Martinez, T. J.; Pande, V. S. Building Force Fields: An Automatic, Systematic, and Reproducible Approach. *J. Phys. Chem. Lett. ***2014,** *5*, 1885-1891.

MDTraj is a modern, lightweight and efficient software package for analyzing molecular dynamics trajectories. It reads trajectory data from a wide variety of formats, including those used by AMBER, GROMACS, CHARMM, NAMD and TINKER. The package has a strong focus on interoperability with the wider scientific Python ecosystem.

The 1.0 release indicates substantial stabilization of the package, and a strong commitment to backward compatibility. New features since the 0.9 release include interactive WebGL-based protein visualization in IPython notebook and a full implementation of DSSP secondary structure assignment.

More information, detailed release notes, downloads and a large number of example analysis notebooks can be found at http://mdtraj.org. Get involved with the code at https://github.com/simtk/mdtraj!

]]>I recently came across this very nice paper from Greg Wilson and the team at Software Carpentry, which is an organization that runs bootcamps for research scientists, teaching the tools of modern software engineering

There's a ton of really practical advice in this paper -- it's especially useful for beginners and those that are self-taught, but there also a lot of advice for even the most seasoned expert. I recommend taking a look! (I wish I had read it when I started graduate school.)

Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, et al. (2014) Best Practices for Scientific Computing. PLoS Biol 12(1): e1001745. doi:10.1371/journal.pbio.1001745

]]>Most force fields model electrostatics by assigning a "partial charge" to every atom. This is supposed to represent the average amount of charge found near each atom. The force field then treats atoms as point charges, using Coulomb's law to compute the interactions between them.

There are two main problems with this. First, real atoms are *not* point charges. The electrons are spread out in a diffuse cloud around the nuclei. In many cases (covalent bonds, lone pairs, etc.) there are concentrations of charge density that aren't centered at the location of any atom. Atom-centered partial charges only give a poor approximation to the electric field created by the real charge distribution.

Second, real systems are polarizable. The electrons move around, responding to the presence of other nearby charges. Every time the atoms move, the electrons rearrange themselves and the charge distribution changes. In conventional force fields the partial charges are all fixed, so they cannot accurately reproduce polarization effects.

AMOEBA improves on this in two ways. First, instead of just assigning a partial charge to each atom, it also assigns a dipole moment and a quadrupole moment. This gives it much more power to describe the electric field around a molecule. Second, it also assigns an "induced dipole" to every atom. These are recomputed at each time step based on the electric field from all other atoms in the system. This allows it to model polarization effects.

These improvements make AMOEBA much more accurate, but they also make it much more expensive to compute! The equations for the interaction between two quadrupoles are enormously more complicated than those for the interaction between two point charges. Also, the induced dipoles depend not just on fixed multipoles but also on other induced dipoles. So every induced dipole affects every other induced dipole, and we need to use an iterative process to solve for them. All of this makes AMOEBA very slow to work with.

The CHARMM polarizable force field takes a less expensive approach. It sticks with point charges, but adds a small number of "virtual interaction sites" to represent lone pairs. This lets it do a somewhat better job of describing the actual electron density, though not as accurate as AMOEBA. It also models polarization by adding "Drude particles" to a subset of atoms. These are extra point charges, tightly bound to the atoms by harmonic springs. They never move far from the atoms' positions, but they can still move a little bit in response to the local electric field, thus creating dipole moments.

In principle one should solve for the Drude particle positions at each time step by minimizing their energy, a process that is about as expensive as computing the induced dipoles for AMOEBA. OpenMM does offer this option, but it also offers a faster option: an "extended Lagrangian". We treat the Drude particles as having mass, then let them move around under the influence of the force field, just like any other particles. To make sure they stay near their minimum energy positions, we use a special integrator with two different Langevin thermostats. It couples the center of mass of each atom/Drude-particle pair to the regular thermostat (at, say, 300K), but it couples their relative motion to a separate thermostat at very low temperature (say, 1K). This keeps the Drude particles close to their minimum energy positions while adding very little cost to the computation. It does, however, require us to use a somewhat smaller time step: usually no more than about 1 fs.

]]>Standard fixed-charge force fields typically have a nonbonded potential that accounts for interactions between partial atomic charges and van der Waals interactions that model dispersion interactions. Because it is computationally efficient, the Lennard-Jones equation is often used to model van der Waals interactions. The equations for these interactions are shown below:

$$ U_{elec} = \frac 1 2 \sum _ {i=1} ^ N \sum _ {j=1} ^ N \frac 1 {4 \pi \epsilon_0} \frac {q_i q_j} r $$

$$ U_{vdW} = \frac 1 2 \sum_{i=1}^N \sum_{i=1}^N 4 \epsilon_{i,j} \left[ \left( \frac {\sigma} r \right) ^ {12} - \left( \frac {\sigma} r \right) ^ 6 \right] $$

This part of the calculation is by far the most expensive, as interactions between all pairs of particles must be computed. When trying to model systems in the condensed phase, we need to include the effect of the bulk solvent environment on our system of interest, meaning that we need to somehow account for $\approx 10^{23}$ water molecules! Since modern hardware is limited to simulating $\approx 10^6$ atoms, we need to have some way of simplifying our model.

The two broad families of methods that researchers typically employ to simulate aqueous environments is to: a) model the solvent implicitly as a continuum dielectric using, for example, the *Generalized Born* (GB) model and b) construct a periodic unit cell with a small amount of solvent surrounding the solute and tessellate that unit cell in all directions to approximate a bulk solution.

What sets OpenMM apart from other software packages here is, once again, its extreme customizability and flexibility. Using its Custom GB forces for example, you can rapidly prototype new implicit solvent models to improve the treatment of solvent effects on biomolecules. All you need is an equation and a set of atomic parameters, and OpenMM will do the rest, letting you test your new models efficiently using the power of GPUs in minutes.

]]>During my 4-week visit as an OpenMM Visiting Scholar, I implemented a constant pH algorithm for implicit solvent. A brief overview of the algorithm is given in this video:

Below, I describe the algorithm in much greater detail and also highlight key features of OpenMM that enabled me to quickly implement the algorithm.

**Introduction**

Constant pH dynamics is a simulation procedure whereby the protonation states of residues and molecules are allowed to fluctuate. The role that this process plays in structural biology can be surprisingly difficult to model with classical molecular dynamics. Jacobson *et al* [1] provide an excellent review of some of these cases. Protonation states often are inferred with programs like MCCE [2], but in a static structural context. With this approach, a set of stable initial protonation states can then be studied and compared, either by docking or simulation.

As noted by Jacobson, however, pH plays a much more dynamic role. Notable examples include DNA binding, allostery, fibril formation, and other mechanisms. These systems can exhibit cascades of proton transport, requiring an *in situ* approach to the simulation. To account for these effects properly remains one of the great challenges in biomolecular simulation. Many useful approaches have been proposed [3-6], and we propose to build on this impressive body of work.

**Implementation of Constant pH Dynamics**

In my implementation of constant pH dynamics, I built from an existing alchemical formulation that uses a fractional parameter to describe the protonation state of the system. Using this formulation, we can gradually morph molecules from one protonation state to another. The gradual parameter adjustment allows the surrounding environment to adjust to the new protonation state. The new state is then accepted or rejected based on a Monte Carlo criterion. Below, I describe the new energy formulation and the Monte Carlo criterion used for accepting the trial protonation states.

** Energetics: **Consider a system with coordinates $r$ and protonation state(s) ${\lambda}$ , and an energy $U_{SYS}$, which includes bonded and nonbonded terms, as well as solvation. The full energy term $U_T$ includes the Henderson-Hasselbach energy, giving

$$U_{T}(r,\lambda)=U_{SYS}\left(r,p(\lambda) \right) + G_{HH}(\lambda)$$

where $G_{HH}(\lambda)=\sum_{i} G_{HH}(\lambda_i) $ is the sum over all titratable residues and molecules, with the term for the $i$th residue follow the form given by Brooks [4] as

$$\beta G_{HH}(\lambda_i))=2.3\cdot \beta \lambda_i \cdot \left( pK_a(i)-pH \right )$$

The energy of the system is currently computed using a linear mixing rule for the parameter states, computed as

$$p(\lambda)=(1-\lambda)p_0 + \lambda p_1 $$

Where $p_0$ and $p_1$ are the initial and final parameter states, respectively, which can include partial charges, Van der Waals radii, and other forcefield parameters. The current implementation is in implicit solvent, and explicit solvent implementations will be available in the near term.

** Sampling: **The perturbation procedure is a Monte Carlo technique that collaborators and I have developed known as Nonequilibrium Candidate Monte Carlo (NCMC)[7]. The idea behind this procedure is to generate a trial move that gradually adjusts the parameter to allow the local environment to relax to the new parameter state. We then accept this trial with Monte Carlo criterion that ensures correct statistics. If we use a Verlet integrator in the trial move, the Metropolis acceptance criterion is

$$\textrm{acc}(r_0 \rightarrow r_1)=\min(1,e^{-

\beta \Delta H_T}) ,$$

where $\Delta H_T$ is the change in the total Hamiltonian moving from state 0 to state 1 through the nonequilibrium trial.

*OpenMM Experience and Tips*

Implementing this algorithm in OpenMM was very straightforward, given that the data structures can readily be accessed through the Python API. This type of prototype development can often be very difficult in large molecular dynamics codes, but OpenMM is clearly designed with rapid prototyping in mind. It took me about a week to refresh my Python skills and learn OpenMM, and another couple of weeks to implement the algorithm using OpenMM’s Python API and begin testing the algorithm. We still have much to do, but the basic machinery seems to be in place and working as expected, including the GPU acceleration.

For the scientist whose interest is primarily developing prototypes and testing hypotheses, the Python API is an ideal development platform. Python is very easy to use and debug. It is also a very legible language, so that others can easily see your logic. If you leverage the object-oriented structure, it is easy to see how your prototype code might be implemented in C++ once the prototype is stable.

Since the Python layer is an API to OpenMM’s GPU-accelerated code, it also runs extremely fast, so that generating validation data is much easier. We found that using special functions that allow for the rapid updating of forcefield parameters without needing to fully update the context on the GPU was extremely useful for our particular algorithm, as we needed to make frequent communications to the GPU.

The development team has provided extremely high quality documentation with great examples in the user manual. The code is also very well written and organized, and I found the online class diagrams to be particularly helpful when looking for routines to incorporate in the code.

*Future Work *

I hope to continue to develop this code as part of my research program moving forward. Having a stable constant pH code for implicit and explicit solvent would be our first effort. Another goal is to port this procedure onto the C++ layer so that it can be readily accessible to Folding@Home, which generates many simulation trajectories that can be pieced together using Markov State models via MSMBuilder. This would allow for microsecond to millisecond timescale studies of pH dependent folding events (for example). OpenMM has made it easy to test out my idea, and I am excited to see what new research results come of it.

*References*

1. Schönichen, A., et al., *Considering protonation as a posttranslational modification regulating protein structure and function.* Annual review of biophysics, 2013. **42**: p. 289-314.

2. Gunner, M., X. Zhu, and M.C. Klein, *MCCE analysis of the pKas of introduced buried acids and bases in staphylococcal nuclease.* Proteins: Structure, Function, and Bioinformatics, 2011. **79**(12): p. 3306-3319.

3. Bürgi, R., P.A. Kollman, and W.F. van Gunsteren, *Simulating proteins at constant pH: an approach combining molecular dynamics and Monte Carlo simulation.* Proteins: Structure, Function, and Bioinformatics, 2002. **47**(4): p. 469-480.

4. Khandogin, J. and C.L. Brooks, *Constant pH molecular dynamics with proton tautomerism.* Biophysical journal, 2005. **89**(1): p. 141-157.

5. Mertz, J.E. and B.M. Pettitt, *Molecular dynamics at a constant pH.* International Journal of High Performance Computing Applications, 1994. **8**(1): p. 47-53.

6. Mongan, J. and D.A. Case, *Biomolecular simulations at constant pH.* Current opinion in structural biology, 2005. **15**(2): p. 157-163.

7. Nilmeier, J.P., et al., *Nonequilibrium candidate Monte Carlo is an efficient tool for equilibrium simulation.* Proceedings of the National Academy of Sciences, 2011. **108**(45): p. E1009-E1018.

8. Kong, X. and C.L. Brooks, *λ‐dynamics: A new approach to free energy calculations.* The Journal of chemical physics, 1996. **105**(6): p. 2414-2423.

9. Stern, H.A., *Molecular simulation with variable protonation states at constant pH.* The Journal of chemical physics, 2007. **126**(16): p. 164112.

10. Wagoner, J.A. and V.S. Pande, *A smoothly decoupled particle interface: New methods for coupling explicit and implicit solvent.* The Journal of chemical physics, 2011. **134**(21): p. 214103.

11. Wagoner, J.A. and V.S. Pande, *Reducing the effect of Metropolization on mixing times in molecular dynamics simulations.* The Journal of chemical physics, 2012. **137**(21): p. 214105.

12. Börjesson, U. and P.H. Hünenberger, *Explicit-solvent molecular dynamics simulation at constant pH: methodology and application to small amines.* The Journal of chemical physics, 2001. **114**(22): p. 9706-9719.

]]>

YANK is able to take advantage of one of the more exciting aspects of OpenMM---namely, its ability to run incredibly fast implicit solvent calculations. In a paper we recently published in JCAMD, we used this feature to show how multiple binding sites and accurate free energies could be obtained for small molecule ligands of the T4 lysozyme L99A model binding system from Brian Shoichet and Brian Matthews. Input files for this system will be available as one of the examples distributed with YANK.

Right now, YANK is still being tested in its final phases before a first public release, but stay tuned for further updates!

Learn more about YANK at our GitHub page.

]]>First, we can estimate **equilibrium expectation values.** Each trajectory is, by construction, a realization of a Markov chain whose stationary distribution is the equilibrium distribution of the system. Assuming we're in the canonical (NVT) ensemble, then the equilibrium (Boltzmann) distribution of the system is

$$ \pi(x) = \frac{1}{Z} e^{-\beta U(x)} $$

If we index the timesteps of the simulation, $t \in \{1, 2, \ldots\}$, and let $X_t$ be the position at time $t$, then $X_t$ converges in distribution to $\pi$.

$$X_t \xrightarrow{d} \pi$$

Why is this useful? Because it provides the basis by which we can calculate expectation values for the system. If we want to calculate things like the ensemble average distance between various residues, FRET absorbance, X-ray scattering intensities, or other properties that can be well-modeled as an ensemble average over $\pi$ of some (instantaneous) observable $g(X)$, this property guarantees that averages of $g$ over a trajectory will converge to the right answer in the limit that the length of the trajectory goes to infinity.

$$ \lim_{t\rightarrow \infty}\frac{1}{T} \sum_t^T g(X_t) = \mathbb{E}_\pi \left[ g(X) \right] $$

But there's potentially much more information in an MD trajectory than just estimates of equilibrium expectation values. Consider questions about the system's **dynamics** like the following:

- How long does it take for the system to transition between two particular regions of conformation space?
- What are the
*slow dynamical modes*in the system? Which degrees of freedom or collective variables in the system take the most time to equilibrate? - What predictions can be made about relaxation experiments like temperature-jump IR spectroscopy?

These types of questions concern the dynamics of the system, and cannot truly be answered by the calculation of equilibrium expectation values. Why? If we look at the structure of the mean calculation above, we can see that it treats the individual $X_t$ as if they're exchangeable. This means roughly that given an MD trajectory, we get the same estimate for the equilibrium average of $g$ if we scramble the order of the frames in the trajectory (vs. using them in the ``proper'' order). All of the temporal structure of the trajectories is thus disregarded.

In an upcoming post, I'll talk about how these types of questions about the dynamics can be posed, and how Markov state models can be efficient estimators for these quantities.

]]>Chemical bonds are modeled as a simple analytical function of the distance between bonded atoms which are often fitted to high-level quantum mechanical potential energy scans and/or vibrational spectroscopic measurements. The most common functional forms used to model chemical bonds are even-order truncated Taylor series expansions—like simple quadratic or quartic functions—and the Morse potential. These functions are shown alongside an exact QM energy scan for the hydrogen molecule below.

Potential energy scan for Hydrogen molecule alongside 3 simple, analytical functions fitting the potential.

It is clear from this figure that for large vibrational energies—i.e., where bond vibrations exhibit strong anharmonicity—the more complex quartic and Morse potentials are better models. For most bonds, however, a simple quadratic description suffices at low temperatures around 300K. Such descriptions are not appropriate for all simulations, however, and we may wish to develop models utilizing atypical functional forms that are better suited for our applications. [1]

In the past, such a project necessitated the development of a new program or expansion of an existing program to run the simulations followed by extensive validation to make sure the code works. The code would then require extensive optimization and potentially even radical redesign in order to be used for production-level calculations to generate a statistically sufficient amount of data. With the incredibly flexible support for custom forces unique to OpenMM, however, a researcher need only assign a functional form for the valence bond terms and assign parameters for each of the terms she wants to model. OpenMM will generate an efficient kernel to compute both energies *and* analytical forces given an equation for the bond term that can run directly on GPUs—the fastest computational hardware widely available for molecular simulations today. This remarkable technology allows researchers to create pre-tested, production-ready software for brand new molecular models in minutes, drastically reducing the months-long process of writing, optimizing, and validating the code—even if he is not a skilled programmer!

But I digress... back to the force fields. Force fields also typically model atoms separated by two bonds using a function of the angle between those two bonds (again, typically a simple quadratic function whose frequencies can be computed or determined by vibrational analysis). Atoms separated by three bonds—forming torsions—often require a functional form like Fourier series that permit modeling the periodic, highly flexible rotations around the central bond. Force fields also frequently have potentials designed to keep aromatic functional groups in a planar configuration using a so-called *improper* torsion since it is a torsion angle defined between 4 atoms that are not connected the same way (see the figure below for an illustration). The final potential energy function I will mention here that is used in many force fields is coupled-torsion correction map (CMAP). This is a potential evaluated on a discrete, interpolating grid that describes how the flexibilities of two torsions may be coupled together in complex ways.

Schematics of various bonded potential terms. A) Bond term. B) Angle and Urey-Bradley distance terms. C) Torsion (or dihedral) term. D) Newman projection of the same torsion as C. E) Improper torsion used for maintaining planarity. F) Coupled-torsion correction map (CMAP) term.

This ends the first part of my discussion about typical biomolecular force fields, and hopefully shows how OpenMM can be an invaluable tool in your tool set for carrying out molecular simulations. My next post will focus on arguably the most important [2] part of a force field—the nonbonded part of the potential.

[1] Zhang, L. and Hioaki, T., *Wear*, 1997, **211** (1) p. 44 (http://dx.doi.org/10.1016/S0043-1648(97)00073-2)

[2] Okay, okay. All parts are very important and must be applied collectively for any force field to work.

**“Is an MD simulation possible given my computing resources and the system I am interested in?”**

This is a difficult question to answer as everyone has different resources at their disposal, but I will tell you what has been done successfully in the past and give you a better idea of how to know if your simulation is possible.

Not everyone has a super-computer available to them, but at Stanford University, we use a distributed computing network called Folding@Home [1]. This project allows people around the world to donate their personal computers to a protein folding project. Armed with this massive resource, we have been able to simulate several small proteins that fold in the microsecond to millisecond regime.

There are two important considerations when gauging how computationally difficult a simulation will be: the **number of atoms** in the simulation and the **timescale **of the process you’re interested in simulating. On a state-of-the-art GPU, OpenMM can generate **100 ns / day **for a system of **23,000 atoms **[2]. This means, if you have a single GPU and want 100 µs of data, you will need to run continuously for three years. That’s an awfully long time… But if we have the resources, it’s definitely possible!

To put all of this in perspective, we can look at a few of the most recent Folding@Home simulations. Obviously, you may not have access to a resource like Folding@Home, but if your system size is bigger than anything we’ve studied and moves slower, then you might want to reconsider using all-atom MD.

**protein number of atoms total simulation time wall-clock time**

Protein G 900 50 ms ~ 9 months

NTL9 24,000 3.2 ms ~ 4 months

*Note: Protein G was simulated on GPUs with GROMACS 4.0.3 powered by OpenMM. NTL9 was simulated on GPUs with OpenMM 5.1.

**Final Thoughts**

The point of all of this is to encourage you to think critically about running an MD simulation. It’s too easy to use MD to generate terabytes of essentially useless data. When you’re thinking about setting up a simulation remember to ask yourself:

1. Will an MD simulation be able to provide insight that an experiment or a simpler model couldn’t?

2. How many atoms are in my simulation? How much total simulation time will I need? Given my resources, how much time will I have to wait before getting the results?

3. MD is a physical ** model**, so

References/Footnotes:

[1] For more information, see: folding.stanford.edu

]]>It can be tempting to think of molecular dynamics (MD) as an atomic-level microscope, able to describe arbitrary molecular interactions that are unobservable in the lab. But it’s important to remember that MD is a ** model** that attempts to describe the molecular interactions in a very specific way. There is a particular functional form of the forces between any two atoms, and on top of this, there are many forcefields that parameterize these forces differently.

With this in mind, we need to always ask ourselves: “Can I trust the results of my simulation?” This is a difficult question to answer, but at least for the protein folding simulations that we do in the Pande group, we typically ask, “Are our simulations consistent with some experimental measurements of the same protein?” This can be a useful way to validate our simulations.

In addition to these concerns about the accuracy of the MD model, there are several more practical questions we should ask before beginning any simulation, specifically:

- Is an MD simulation appropriate for the question that I'm asking?
- Is an MD simulation possible given my computing resources and the system I am interested in?

I will address this first question here and tackle the second question in my next post.

**“Is an MD simulation appropriate for the question that I’m asking?”**

The power of a simulation is to provide insight that could not be gained in an easier way. Take for instance the field of protein folding. Here, there are numerous experiments that can measure low-dimensional projections of the very complex folding process. Drawing conclusions from these experiments, however, can be difficult since there may be many underlying mechanisms that are consistent with the data. This is where MD can be quite useful: it provides a physical model that can be used to interpret an experiment.

It may be tempting to turn to simulation as another experimental technique for calculating measurable quantities. For instance, imagine that you’re working with a drug that binds to an enzyme and you’re interested in determining its binding constant, but don’t really care about the mechanism of binding. An experiment should be fully capable of measuring this binding constant, and a simulation would provide no further insight that you’re interested in! More importantly, remember that MD is ultimately just a model for the intrinsic dynamics, so it may turn out that the calculated binding constant is simply inconsistent with the experimental measurement.

The same can be said for other questions, where atomic-level detail is not necessary. For instance, consider a researcher that is interested in the relative diffusion constants between the dimeric and trimeric forms of a protein. A simpler model that does not attempt to describe the atomic-level interactions would likely be just as accurate as an MD simulation.

Next, we need to consider the practicality of running an MD simulation.

]]>