You may now have a trajectory from your simulations. In case you don’t, you can use the training data sets we have provided for you.
What are RMSD and RMSF?
RMSD: root mean square deviation
RMSD stands for root mean square deviation. RMSD is a numerical measurement representing the difference between two structures: a target structure and a reference. In molecular dynamics, we are interested in how structures and parts of structures change over time as compared to the starting point. For example, if a protein, over a course of a simulation, has a lid opening and closing motion (like pacman opening and closing his mouth), a plot of RMSD vs. time will indicate that with large RMSDs. RMSD is typically plotted vs. time.
RMSD can be used to identify large changes in protein structure as compared to the starting point. A leveling off or flattening of the RMSD curve can also indicate that the protein has equilibrated.
RMSF: root mean square fluctuation
RMSF stands for root mean square fluctuation. This is a numerical measurement similar to RMSD, but instead of indicating positional differences between entire structures over time, RMSF is a calculation of individual residue flexibility, or how much a particular residue moves (fluctuates) during a simulation. RMSF per residue is typically plotted vs. residue number, and can indicate structurally which amino acids in a protein contribute the most to a molecular motion.
In our case of pacman, the parts of pacman’s face that make up his mouth will have high RMSFs because the mouth in constant motion, but the parts making up the rest of his head would maintain consistently low RMSFs.
Create a Jupyter Notebook for Analysis
Let’s set up a brand new jupyter notebook for your RMSD & RMSF calculations that you can use for all your analyses moving forward.
Generate the Notebook
-
Navigate to your scratch directory and start up a jupyter notebook by running the following commands in your terminal:
module load anaconda jupyter-notebook
This should launch a browser opening into the Jupyter Notebook interface. The URL will look something like: localhost:8888/tree. This is the working directory.
-
Create a new notebook by selecting “New” from the Files tab and then selectiong “Python 3” Notebook.
New > Python 3
This should immediately open up a new ‘Untitled’ notebook in a new window or tab. Name the notebook ‘RMSD_RMSF_Analaysis.’
Import Necessary Modules
-
To do our analysis, we will need to import some useful (and, hopefully, familar!) python libraries. In the first code block, write and run:
import numpy as np import mdtraj as md import matplotlib.pyplot as plt %matplotlib inline
Load Training Data
We can’t do any analysis unless we’ve got data to analyze! Your training set data is located in keck2’s /scratch/bcc2018_trajectories/${BCCID}
, where ${BCCID} is the name of the BCCID for your training file.
-
Cd into your trajectory folder and take a look around. You should find a topology file
*.prmtop
and three replicate trajectory folders namedmd1
,md2
, andmd3
. Inside each of thesemd
files you will find the trajectory file,*.nc
. -
In your jupyter notebook, load the data you wish to inspect. It may be helpful for you to create a BCCID variable, as well as a filedir variable for easier coding.
BCCID = #your BCCID filedir = '/scratch/bcc2018_trajectories/'+BCCID+'/' traj1 = md.load(filedir+'md1/'+BCCID+'-Pro01.nc', top=filedir+BCCID+'.prmtop') traj2 = md.load(filedir+'md2/'+BCCID+'-Pro01.nc', top=filedir+BCCID+'.prmtop') traj3 = md.load(filedir+'md3/'+BCCID+'-Pro01.nc', top=filedir+BCCID+'.prmtop')
Key in Shift + Enter
to run this cell, load your trajectories, and define your variables!
Calculate RMSD
Generate RMSD Data
-
Use the built-in MDTraj (
md
) trajectory function to generate RMSD data from your trajectories. In a new code block:protein_sel1 = traj1.topology.select('protein') protein_traj1 = traj1.atom_slice(protein_sel) rmsd1 = md.rmsd(protein_traj1, protein_traj1, 0) print(rmsd1)
The documentation tells you all about what arguments this function takes, and what they each mean.
You should also obtain the data for your replicate trajectories, traj2 & traj3.
Plot RMSD v Time
-
To visualize this data, use a plt scatterplot. In a new code block:
plt.scatter(np.arange(0, len(rmsd1)), rmsd1, marker='.', color='m', s=3, label='Rep 1') ## what does np.arange do? plt.xlabel('Time (ns)') plt.ylabel('RMSD ($\AA$)')
Overlay RMSD replicate plots
-
You may want to overlay your data to compare your three replicates. To do this, you can add two more scatterplots to the above code, after the first call to plt:
plt.scatter(np.arange(0, len(rmsd2)), rmsd2, marker='.', color='c', s=3, label='Rep 2') plt.scatter(np.arange(0, len(rmsd3)), rmsd3, marker='.', color='b', s=3, label='Rep 3')
-
Add a legend to your plot.
plt.legend(loc=4)
I encourage you to play around with the matplotlib pyplot documentation to adjust your plots to your liking. You can vary the colors, legend location, marker sizes and types, etc, until your figures are as clear as possible.
Calculate RMSF
To calculate RMSF, we will need to invoke cpptraj
, which is part of the AmberTools software. In addition to RMSF, cpptraj
can also calculate RMSD and diffusion, as well as many other useful quantities.
Cpptraj requires an executable script and an input file. I’ll walk you through how to make them, and what needs to go in each.
Preparing the Input File
Cpptraj requires an input file that looks like this:
trajin ${our_trajectory_file}.nc
rmsd @C,CA,N first
atomicfluct out ${output_file_name}.dat @C,CA,N byres
The variable ${your_trajectory_file} needs to be replaced with the name of your trajectory file, which is probably something like: filedir+BCCID+'.nc'
.
The second line, ‘rmsd @C,CA,N first,’ is the program aligning your protein to the structure given by the first frame in the simulation. This is done to avoid the biasing of your RMSD and RMSF values by protein diffusion.
The @C, CA, N is the ‘mask,’ or atom selection of your alignment and calculations. This aligns the protein backbone only.
Your ${output_file_name} should be a name of your choice to which you’d like the program to output the data. The flag ‘byres’ means that the data output file will give an residue index and a corresponding RMSF value.
- In your jupyter notebook, you will use python to generate this input file. In a new code block:
mydir = '/scratch/${username}/'
with open(mydir+BCCID+'_rmsf.in', 'w') as file:
file.write('trajin '+filedir+'/md1/'+BCCID+'-Pro01.nc\n')
file.write('rmsd @C,CA,N first\n')
file.write('atomicfluct out '+mydir+BCCID+'.dat @C,CA,N byres\n')
In running this block, you create a file in your scratch directory called BCCID_rmsf.in, which is your cpptraj input file. If you look at it with your favorite text editor, you should see two lines. If you only see one line, double check your code to make sure that you have a line separator ‘\n’ at the end of the first line.
Preparing the Executable
The second thing you need to run cpptraj is the executable script. This is what acts as the command line to tell cpptraj to run. This script needs to have this structure:
module load amber/18
cpptraj ${path_to_topology}.prmtop ${your_input_file_name}.in > ${your_log_file_name}.log
Where ${path_to_topology} indicates the filename of your prmtop file, which should be found in filedir+BCCID
. Your input file name is the name of the input file you created in the above code block, mydir+BCCID+'_rmsf.in'
. The >
is what we use to specify that we’d like the output to be stored into a log file, which you name yourself.
-
We could run all of this in the command line, but for the purposes of this tutorial and showing you how these analyses can be automated, we are going to generate a script instead. In a new code block in Jupyter Notebook:
with open(mydir+BCCID+'_rmsf.sh', 'w') as file: file.write('module load amber/18\n') file.write('cpptraj '+filedir+BCCID+'.prmtop '+mydir+BCCID+'_rmsf.in > '+mydir+BCCID+'_rmsf.log\n')
Double check to make sure that the input file name in your *.sh matches the input file name *.in, and that all your other file names and file paths are correct. If you need to make a change, it is okay to rerun the code block in Jupyter Notebook.
Running the Script
Now that your cpptraj files have been generated, you can now run the program by running the executable shell script (*.sh) in your command line.
-
Execute:
source ./${your_shell_script_name}.sh
Once you run this in the command line, you should wait for it to complete. When $bash pops up again, you can ls
in your scratch directory, where you will hopefully see two new files: the *.dat and *.log files which cpptraj should have created.
If the log file has been created but not the dat file, open and look at the log file to see what errors came up. Double check all your input scripts and file paths, and if you still cannot isolate the source of error, check with a mentor.
Extracting .dat Data
Your *.dat file should contain two columns, headed by “#Res” and “AtomicFlx.”
- See if you can use the python methods you’ve learned so far to extract each column of this data file into a numpy array in your jupyter notebook! Be sure to give your arrays reasonable names. A mentor will go over this with you if you get stuck.
Plotting the Data
RMSF vs. Residue Number
RMSF is typically plotted vs. residue number. At this point, you should have three data sets: one for each md replicate.
- Create a plot of RMSF on the y-axis and residue number on the x-axis. What does this plot tell you about the dynamics of your protein? Do the replicates line up? (Should they line up? Why or why not?)
RMSF vs. IC50
Now you will be challenged to use all of your python skills to generate a plot of RMSF vs. IC50. Recall that IC50 is a measurement of drug potency. High IC50 correlates to a low potency, and low IC50 correlates to a high potency.
Sometimes, IC50 is also expressed as pIC50. Similar to the pH scale, where pH = -log[H+], pIC50 = -log(IC50). This means a high pIC50 corresponds to a high potency, and vice versa.
-
In your jupyter notebook, create a function called calculateMeanRMSF. Have it take the input ${BCCID} and output list [x, y], where x = pIC50 and y = mean RMSF.
def calculateMeanRMSF(id): #return [pIC50, mRMSF]
You should be able to populate this function with everything you need to calculate the RMSF values of a given BCCID and calculate the mean RMSF value of the entire system. The pIC50 values can be found in a dictionary located in the bccHelper.py located here.
-
Once this function has been created, use it to calculate the mean RMSF value for all systems in the training set.
-
Plot this new data on a scatter plot with log(IC50) on the x-axis and meanRMSF on the y-axis. You should have 6 data points. Does mean RMSF of a system correspond to drug efficacy? Is mean RMSF a good
metric
for describing drug efficacy? -
Think about what your data mean, and start coming up with a list of metrics you could use to correlate RMSF or RMSD to IC50 of a system. Do you want to look at one specific residue or a group of residues? Do you want to investigate the dynamics of the ligand? Talk this over with your group, and maybe play around with mdtraj and cpptraj’s atom selection tools to look at different parts of the system.