View on GitHub

conan

Analysis of contacts in molecular dynamics trajectories

CONAN – Analysis of contacts in molecular dynamics trajectories

Documentation

Contact analysis can be extremely useful to understand structure and dynamics of molecules and molecular complexes as they are atomistic-resolved descriptions of the interactions occurring within and between the investigated molecules.

CONAN is a tool developed for the statistical and dynamical analysis of contacts along molecular dynamics trajectories. It uses a combination of open-source tools for the computation of contacts along trajectories and for the creation of images and videos.

In particular:

What’s in the distribution?

When you download CONAN, you will find two executables: conan.py, (the main program) and conan_comp.py (comparative CONAN). Additionally, you will find three directories called example_input, gnuplot_inputscripts and chimera_visualization_scripts.

While the gnuplot_inputscripts contains gnuplot scripts. Feel free to change them.

In the example_input folder, you will find the following files:

In the folder named chimera_visualization_scripts, you will find the following scripts:

chimera visualize_clusters.py summary.txt xxx.pdb

Launching the script will open chimera (that will prompt a message window asking which format are you wishing to open – no need for that, just click cancel) and will visualize, on the structure that you have specified as xxx.pdb, the cluster that have been identified by CONAN. Molecular structure (in cartoon representation) will be coloured according to each cluster and the atoms of the residues composing the center of the clusters will be shown using vdW radii. From the chimera window the user can then adjust the orientation of the molecule for the best visualisation and save a picture.

chimera fit_structures.py

No arguments need to be specified as all the .gro files in the current working directory will be considered for the alignment.

A few hints about what CONAN can do for you

CONAN has been developed in order to provide information about how inter-atomic contacts evolve over time during molecular dynamics simulations (MD). Beyond providing videos that show how contacts are formed or broken during MD, it can also read in a series of input files describing any observable against which the software calculates the correlation of the contacts made or broken. In this way it is easy for the user to identify events of interest along the simulated trajectory and relate them to the evolution of contact formation or ruptures. As an example, the formation of contacts can be related to increases of radius of gyration or end-to-end distance in a molecule, or to the formation, disruption or change of secondary structure (in this case, for example, the provided observables can be the φ or ψ angles of peptide chains). Additionally, contact maps can also be calculated in defined parts of the trajectory. As another example, in pulling simulations where the user identifies major rupture events of the tertiary and/or secondary structure of a molecule, CONAN can calculate contact maps for a defined amount of time, along the trajectory, defined by the user. The program in this case reads a file (through the keyword STAGES – see below) in which the user specifies beginning and end of the interval for which contact maps need to be calculated. In addition, a title for the set interval will be used to set the title of the contact map created.

Further functionalities (cluster analysis of residues or of frames in a trajectoryies, PCA, blocking analysis, …) are also described below.

Minimal requirements

In order to successfully run CONAN you need to have working installations of:

Optional requirements

If you want to use the scripts provided for the visualization of the clusters onto a molecular structure or the fitting of the cluster conformations (see description of the scripts above)you will need:

All of these softwares naturally run on Linux.

Requirements on disk and memory

Given that CONAN is working in N^2 dimensional space, you should be aware that it can get expensive in terms of disk space and memory. As a rough guide, the following are the requirements of CONAN:

Functions

CONAN can perform the following types of analysis:

There is one more tools for contact analysis, distributed along with CONAN:

Input keywords

CONAN needs to have an input file that defines the parameters of the analysis.

An example input file containing most of the keywords and some explanation is contained in the distribution under: example_input/mdmat.sample

The keywords are case-insensitive (but are given in all-caps in this guide), but the keys are case-sensitive (since some of them are file names). If not specified, CONAN will either pick up the default value of that parameter and pass it to the routine used for the calculations, or will skip the kind of analysis defined by the omitted keyword. At the bottom of the list, we will give specific keywords applying only to the “asymmetric” runs or specific for the “comparative CONAN” tool.

GNUS_PATH

Path to the gnu files.

TRAJ

The trajectory file (-f option. Supported formats: .pdb, .gro, .g96, .cpt, .xtc, .trr, .tng).

COORD

The structure/coordinate file (-s option. Supported formats: .pdb, .gro, .g96, .tpr, .brk, .ent).

TRUNC <float (nm)>

The truncation value, i.e., any distance larger than this one will be set to this value.

NLEVEL

The number of discretization levels between 0 and TRUNC (maximum value: 7744). For example, if you want to use a truncation of 1.0 and a resolution of 0.001, set this to 1001. Setting NLEVEL to a high level will not affect the performance of mdmat or CONAN.

PATCH_TIME <yes|no>

Ignore time information from the .xpm file and just take the first frame and DT and keep increasing the time.

MEAN <yes|no>

Should gmx mdmat create an average XPM contact map?

INDEX

An index file.

BEGIN <time (ps)>

The first time in picoseconds (-b).

END <time (ps)>

The first time in picoseconds (-e).

DT <time (ps)>

The time step to consider, i.e., only take frames that have (t - t_begin)%dt == 0.

TRUNC_INTER <float (nm)>

The truncation threshold to turn interactions on.

TRUNC_INTER_HIGH <float (nm)>

The truncation threshold to turn interactions off.

TRUNC_LIFETIME <float (0.00..1.00)>

Truncation of lifetimes to consider interactions in the “interaction types” plot (if a .pdb file has been given).

DR_MODE <init|prev|both>

Which sort of differential contact maps should ConAn make?

TRUNC_DR <float (nm)>

Truncation of differential contact maps.

RUN_MDMAT <yes|no>

Should ConAn run gmx mdmat?

MATRICES <yes|no>

Should ConAn generate matrices? If set to no, ConAn will only generate aggregate data files.

CLEAN_MATRICES <yes|no>

NTERM

N-terminus residue ID.

PEARSON_TIME <yes|no>

The Pearson correlation with time will be plotted for each residue.

PEARSON_OBS

A file with a time dependence of some external observables. The first line of the file should be the number of observables N, the next N lines contain the titles of the observables, then the following lines contain N+1 columns time and the values of the observables). The times do not need to exactly match the ones from the frames but all the frames need to be present here.

PEARSON_INTER <yes|no>

Should CONAN compute inter-residue cross-correlations?

PCA_MAKE

The number of principal components to analyze and project to.

PCA_READ

Read in the principal components from another run and project this run onto it.

REREAD

A path to existing matrices. This path must also contain a file called index.dat, specifying NTERM, NRES, and TRUNC. CONAN will read all matrices there. Matrices do not need to contain all entries, i.e., data can be sparse.

ZOOM_LIST

A file name with a few interesting residue pairs to be “zoomed in” on. CONAN will plot the time evolution of the contact distance between each of these residues and the correlation with other contacts. Each line of the file should contain two numbers (residue IDs).

MAKE_MOVIE <yes|no>

Should ConAn make a movie out of the frames?

COORD_PDB

A pdb file for ConAn to read and understand the sequence of our protein.

SHADOW_TOL

Remove “shadowed contacts”: consider only contacts (i,j) for which there is NO third residue k where

shadow_tol*(d_ik + d_jk) < d_ij.

IGNORE_OBS_TIME <yes|no>

Ignore the time variable in the observable file.

K_RES_CLUSTERS <0|range of ints>

The number of clusters for residue-based cluster analysis of the protein. If not 0, the user should enter a range of values. Possible are formats as: 1 2 3, 1-3, or even 1-3 5 8-10 (which would mean 1 2 3 5 8 9 10). If 0, the number will be chosen interactively (when the dendrogram is displayed), using the same format as above.

K_TRAJ_CLUSTERS <0|range of ints>

The number of clusters for frame-based cluster analysis of the trajectory. ConAn will perform a k-medoid clustering of the trajectory, based on the RMSD between the contact maps.

DOMAINS

A file giving the domains of the protein. Not all residues must belong to a domain and technically one residue could belong to more than one domain (but the plot will be confusing). The file should contain 3 columns, the first two giving the first and last residue IDs and the third one the name of the given domain enclosed in quotes.

STAGES

A file giving stages of the trajectory. ConAn will not perform a full analysis on each stage but rather plot the difference between the end and the beginning of each stage.

ECONOMY <yes|no>

CONAN can be run in “economy mode”, i.e., the data is not stored in memory in raw format but only aggregate quantities. This means, however, that only frames, matrices, movies, and aggregate plots will be printed. Use this if you have too many residues or too many frames.

Specific keywords for asymmetric runs

Instead of a symmetric run (comparing intra-chain contacts), CONAN can be used to analyze the interface between two sets of residues, group X and group Y (to be shown in the X- and Y-axes of the matrices). The average number of inter-group interactions will be plotted and inserted in a PDB file of each group if given. CONAN will automatically switch to “asymmetric mode” if any of the below keywords are given. Note that all plots by CONAN are square formatted, so if the residue numbers between X and Y are very different, the contact maps can look distorted.

START_X

The starting residue of group X. 1-indexed, based on what is in the XPM file.

NRES_X

The number of residues in group X (i.e., group X will be START_X <= index START_X + NRES_X where index is 1-indexed.)

LIST_X

A file name containing all the residues that are in the “X group”. The format is very simple - all residue IDs (starting at 1!) are in a separate line. Note that CONAN will still output contiguous residues.

NTERM_X

The N-terminus index of group X for plotting purposes and to read the sequence from the PDB (i.e., the residue IDs will be: NTERM_X <= resID < NTERM_X + NRES_X).

DOMAINS_X

Domain file for plotting of group X. This is based on the NTERM_X index.

COORD_PDB_X

PDB coordinate file for group X. The residue IDs in this file must correspond to NTERM_X.

START_Y, NRES_Y, NTERM_Y, DOMAINS_Y, COORD_PDB_Y

As above. If only one of COORD_PDB_X, COORD_PDB_Y is given, the same PDB will be used for both groups X and Y.

DIMER

“Dimer mode”: if both groups X and Y are in fact different copies of the identical group (i.e., a homodimer), CONAN can take this into account for read-in and cluster analysis. This requires that the residue IDs be equal.

Example for an asymmetric run

Suppose we are working with a PDB of 3000 residues but want to analyze the contacts formed between residues 1001..1050 and residues 2001..2050 while keeping this numbering but ignoring all other pairs. We can proceed as follows:

  1. We create a special index file for these residues only.
  1. We set:
START_X### 1
NRES_X    50
NTERM_X 1001
START_Y   51
NRES_Y    50
NTERM_Y 2001

Note that if we want to use a PDB file, conan.py will look for 1001..1050 and 2001..2050 in the residue ID column.

Specific keywords for comparative runs

The “comparative CONAN” script is a simple tool intended to give a quick comparison between two runs (“run A” and “run B”) on similar systems. The two runs must have the same number of residues* and it is recommended that they also have the same simulation time.

* of course, if the residues are similar in number, the user him/herself can create an index file to create the same number of residues, by e.g., skipping over gaps in alignment.

The input supports the following keywords.

RUN_A

The location of “run A”. Compulsory.

RUN_B

The location of “run B”. Compulsory.

GNUS_PATH

The location of the gnuplot scripts. Compulsory.

TITLE_A

The title of “run A” between double quotes. Optional; default is “run A”.

TITLE_B

See description of TITLE_A.

R_CUT <float (nm)>

Consider only pairs of residues for which the average r_ij < r_cut for at least one of the two runs. Optional; default is considering all residue pairs. This is especially useful for changes in distances/RMSF, where large changes between large distances are possibly uninteresting. In particular, the RMSF of a pairwise distance can change drastically, depending on which side it is of the original truncation radius, so it makes sense to include some R_CUT < TRUNC.

LIFE_CUT <float (nm)>

Consider only pairs of residues for which the interaction lifetime is larger than life_cut for at least one of the two runs. Optional, default is considering all residue pairs. Similarly to R_CUT, LIFE_CUT can restrict the analysis to “interesting” pairs. If both R_CUT and LIFE_CUT are given, then residue pairs must pass both tests to be considered.

Note: all “lifetime” values will be taken from the original analyses. The “comparative CONAN” script does not process any contact maps, just compares aggregate quantities it found in the folders!

Output

CONAN will generate output in a series of folders.

backup.{i}/

In case CONAN is run over an older CONAN run, all of the below folders will be moved into backup folders (backup.1 is the oldest run, then backup.2, …), including the input/ folder, for users who prefer not to give meaningful names to input files. dmf.xpm is not copied into the backup folder to save space.

movies/

Hint: movies can be turned off with the keyword MAKE_MOVIE no.

frames/

This folder contains the .png files with frames that form the movies mentioned above.

Hint: the keyword MAKE_MOVIE no also turns off the generation of these frames.

matrices/

This folder contains the raw data (i.e., distances) that form the frames before. The format is <i> <j> <r_ij> (or i j dr_ij). Note: there is an empty line between sections with different i’s, to respect gnuplot requirements.

Hint: the keyword MATRICES no turns off the generation of these data files as well. The keyword CLEAN_MATRICES yes cleans up the raw data files but keeps the .png frames and will generate the movies.

input/

The input file will be copied here for later reference.

aggregate/

This folder has the most important plots and raw data files:

cluster_res/{method}

This folder will contain the results of the cluster analysis of residues. Three different methods are available, {method} = crosscorr, distance, or lifetime. For each of these options, there will be a folder with:

After the user has chosen a desired set of numbers of frames, there will be created a new folder for each number chosen. Under each of these folders, there will be three more files:

cluster_trj/

This folder will contain the results of the cluster analysis of frames (trajectories).

After the user has chosen a desired set of numbers of frames, there will be created a new folder for each number chosen. Under each of these folders n (showing the subdivision into n clusters), there will be another file and n folders:

In the subfolders n/cluster0..n/cluster(n-1), one can find:

pca/

The output of the principal component analysis, as:

zoom/

Analysis “zoomed in” on specific residue pairs.

blocking/

“Blocking analysis” of the contact map to estimate its correlation time (see Flyvbjerg and Petersen, JCP 1989). This is done directly on the data by blocking each non-trivial time series (i.e., any residue pair that has a non-constant distance as a function of time) and normalizing the standard error as: ε_tot=∑ε_i^2N_res

In the limit of decorrelation of all time series, there will be a plateau in ε_tot. ConAn estimates the correlation time as: T_corr≈Δt⋅(ε_platε_0)^2

where i is the number of blocking steps, ε^i is the standard error at the so-far blocked data, ε(ε^i) is the standard error on the standard error, and T_corr is the estimated correlation time.

The user needs to visualize and assess whether or not they see a plateau in the standard error. In case there is no such plateau, the correlation time computed by CONAN is a lower limit.

Hint: note that getting the autocorrelation time from PC projections is another way of defining this quantity (Ernst et al, JCP 2015).

Common recipes

Binary analysis

If you prefer a binary analysis, with contacts defined as any two residues within r_0, just set TRUNC <2*r0> and NLEVEL 2. The factor 2 is necessary as the XPM file will be rounded. For example, if you consider inter-residue interactions to be anything that is higher than 0.5 nm, you can write:

TRUNC 1.0
NLEVEL 0.5

Any two residues within 0.5 nm will appear as d=0 nm and any two residues farther than this distance will appear as d=1 nm.

Analyzing part of a trajectory

If you think you identified a part of a trajectory that you would like to concentrate on, you can re-run CONAN with changed BEGIN and END points and RUN_MDMAT no.

Uniting several trajectories

.xpm files can be concatenated without issues, although time stamps could be duplicated. You can go around this with PATCH_TIME yes, which will make time stamps consecutive.

Smaller than picosecond time steps

CONAN assumes the time steps between frames to be an integer multiple of 1 ps. If your data is more finely spaced than that, you can just use PATCH TIME yes and set DT 1 which will unite all your frames with a fictional 1 ps time step.

Comparing several parts of the same trajectory

If you want a detailed comparison between parts of a trajectory, one option is to run a whole CONAN first, then changing BEGIN and END in the input file as well as RUN_MDMAT no, keeping the outputs in separate subfolders, and then using conan_comp.py for a comparison.

Using metrics other than smallest distance

One option for other metrics can be to just use a special group, for example, repeating the analysis for just the Ca atoms or an atom that you consider to represent the amino acid’s/lipid’s/etc position.

Otherwise, if you have another way of measuring distances and have plain-text outputs, you can transform them into the CONAN matrix format (check them out) and use the REREAD keyword.

Coarse-grained input

CONAN can handle any input, as long as atoms in the same residue have the same residue ID (in fact, by changing the residue IDs in your reference file, you can trick CONAN into inter-domain contacts or any other criteria). If you need the INPUT_PDB file to be read in for the purposes of identifying physical interactions, be sure that the backbone atom is called either BB (as Martini does) or CA in the PDB file.

PCA or cluster analysis on two populations

Just unite the two .xpm files (as mentioned above, they can be concatenated and they can still be read), use the PATCH_TIME yes command, run the PCA or cluster analysis, and later separate the output accordingly.