Cluster Computing

MIT houses several computing clusters that are available for the lab to use. As of 2025, we use the Engaging cluster, though this may change in the future.

The most common use case in our lab is RNAseq analysis because reading and aligning millions of transcripts is computationally intensive, as you can imagine. The general workflow (in detail below) is using a Snakefile to execute a list of commands for trimming, aligning, and/or analyzing transcripts. These commands may point to RNAseq-related packages or to user-defined python scripts that run analysis. At a high level, you upload your raw reads and your project repo housing your snakefile, run snakemake on the cluster, the cluster will compute, and then you will extract the data you need (usually gene counts and/or differentially expressed genes).

Creating and setting up an account

Create an account

Following the instructions on the MIT ORCD docs page, log in to the Engaging cluster through the web portal using your Kerberos ID and password (instructions here). This will automatically trigger a new account to be created.

Note

There may be a delay of a day after creating your account before you can start any jobs. However, you should still be able to log in.

Confirm you can log in to Engaging via your terminal or PowerShellusing ssh. Replace [your-kerberos] below with your Kerberos ID.

$ ssh [your-kerberos]@orcd-login.mit.edu

This will prompt you for your Kerberos password and Duo authentication.

First-time setup

Add an ``ssh`` shortcut

Once you’ve confirmed that you can log in, create an ssh shortcut to the cluster:

On your computer (not in the cluster), add the following to your config file using nano ~/.ssh/config:

Host engaging
    HostName orcd-login.mit.edu
    User [your-kerberos]
    ForwardAgent yes

While you’re at it, add a shortcut to the BioMicro Center cluster. This is where they’ll temporarily store your sequencing data.

Host bmc
    HostName bmc-150.mit.edu
    User galloway_ill

Important

You can’t use nano on Windows. Instead, navigate to the folder directly in the File Explorer and edit your config file with a text editor:

In PowerShell, run cd ~/.ssh
Get the directory path by pwd
Copy this path into “File Explorer”. This might look like C:\Users\ChemeGrad2025\.ssh
Once you’ve located the hidden .ssh directory, edit the config file with “Notepad” (or “VSCode”, etc.) and add in the above.

See MIT ORCD docs SSH key setup if stuck.

Setup the link to the shared data folder

We have a 20TB shared data folder on the Engaging cluster. It is located at /orcd/data/katiegal/002, which is an annoying path to type. Instead, we like to put a link in your home directory, which is the place where you start when you SSH in.

You only have to create this symbolic link (symlink) once. To make the symlink, run:

$ ln -s /orcd/data/katiegal/002 ~/katiegal_shared

The relevant folders here are:

data/raw_reads: where we put all our raw data
projects: where we clone git repos for analysis pipelines, etc.
hpc_infra: infastructure scripts and other useful items.

katiegal_shared
├── data
│   └── raw_reads
├── hpc_infra
└── projects

Note

2025.12.12 - NBW: We used to have another folder at /orcd/pool/003/katiegal_shared/. If something is missing it is likely there. You should symlink it, e.g. ln -s /orcd/pool/003/katiegal_shared/data/raw_reads/250425Gal/ ~/katiegal_shared/data/raw_reads/250425Gal

Setup and SSH key to make Git repo’s easier to access

Based on MIT ORCD docs “SSH key setup” and GitHub docs “Using SSH agent forwarding”

At a high level: SSH agent forwarding can be used to make deploying to a server simple. It allows you to use your local SSH keys instead of leaving keys (without passphrases!) sitting on remote servers, like the Engaging cluster.

You can set up ssh-agent for your local computer which runs in the background and keeps your SSH key loaded into memory so you don’t need to enter a passphrase every time you need to use the key. Then, you can give remote servers, like the Engaging cluster, access to your local ssh-agent as if they were running on the server. This is sort of like asking a friend to enter their password so that you can use their computer.

The end result basically means you get use git clone and other things without having to re-enter passphrases every time while on the Engaging cluster.

We’ll start with GitHub docs “Using SSH agent forwarding” . Check to see if your own SSH key is set up and working by entering ssh -T git@github.com in the terminal. If successful it will look like:

$ ssh -T git@github.com
# Attempt to SSH in to github
> Hi USERNAME! You've successfully authenticated, but GitHub does not provide shell access.

If not, next make sure your local computer has an SSH public key for GitHub based on GitHub docs “Adding a new SSH key to your GitHub account”

Check for an existing SSH public key on your local computer: GitHub docs “Checking for existing SSH”
If no key exists, then add generate a new SSH public key: GitHub docs “Generating a new SSH key and adding it to the ssh-agent”

After confirming you have an SSH public key on your local computer, you will then give this SSH key to Github so Github can identify you without having to log in everytime. To do so:

Copy the SSH public key to your clipboard pbcopy < ~/.ssh/id_ed25519.pub. If your SSH public key file has a different name than the example code, modify the filename to match your current setup. When copying your key, don’t add any newlines or whitespace. For example, NBW’s key is located at ~/.ssh/id_rsa.pub.
In the upper-right corner of any page on GitHub, click your profile picture, then click “Settings”.
In the “Access” section of the sidebar, click “SSH and GPG keys”.
Click “New SSH key” or “Add SSH key”.
In the “Title” field, add a descriptive label for the new key. For example, if you’re using a personal laptop, you might call this key “Personal laptop”.
Select the type of key, either authentication or signing. For more information about commit signing, see About commit signature verification.
In the “Key” field, paste your public key.
Click “Add SSH key”.

Check to see if your own SSH key is set up and working again with Github by entering ssh -T git@github.com in the terminal.

Now GitHub has your public key but you still need to let ssh-agent get access to your private key. This way, when a remote server with ForwardAgent true needs to sign something with your private key, the request gets funnel back to your ssh-agent which returns the signed request so the private never leaves your local computer. By copying the public key onto remote systems like copy-pasting onto Github like we just did or using ssh-copy-id, your public key gets pre-loaded onto remote systems but you can still control access to your private keys for each individual remote server.

To make your key available to ssh-agent:

Check that your key is visible to ssh-agent by running the following command on your local computer: ssh-add -L
If the command says that no identity is available, you’ll need to add your key with the following command: ssh-add . This will add any “default” keys. You can also add a specific key. For NBW this looks like ssh-add ~/.ssh/id_rsa which is different than the public key, ~/.ssh/id_rsa.pub!
On macOS, ssh-agent will “forget” this key, once it gets restarted during reboots. But you can import your SSH keys into Keychain using this command: ssh-add --apple-use-keychain YOUR-KEY

Great! You should be done now! The sercret was in something we added before:

Host engaging
    HostName orcd-login.mit.edu
    User [your-kerberos]
    ForwardAgent yes

The ForwardAgent yes tells your ssh-agent to let the Engaging cluster use your local keys. This is known as “SSH agent forwarding”.

Check to make sure it’s set up correctly:

Log into Engaging cluster using ssh engaging and sign in
On the Engaging cluster, test to see if the SSH key is set up and working again with Github by entering ssh -T git@github.com in the terminal. Like before, if successful it should say Hi USERNAME! You've successfully authenticated, but GitHub does not provide shell access.

If it’s not working, check GitHub docs “Using SSH agent forwarding: Troubleshooting SSH agent forward”. for more tips.

Per-project setup

TODO

.gitignore?

Set up project Git repo

After getting your ssh-agent set up as described above, you should clone your project repo into ~/katiegal_shared/projects/. This will let you edit your script files locally or on the server, and track changes. You will want to make a new directory to house all of your Engaging cluster files. You can either copy a cluster folder from someone else’s pipeline (CJ is working on an incoming template repo) or make a new one.

To clone your repo:

ssh engaging and log into the Engaging cluster
Navigate to the projects directory by cd katiegal_shared/projects
Clone your project repo by using the ssh url which you can get from GitHub. This might look like git clone git@github.com:GallowayLabMIT/project_repo.git

To make a new cluster directory (if not using existing template):

Run mkdir ~/katiegal_shared/projects/project_repo/cluster
Run mkdir ~/katiegal_shared/projects/project_repo/cluster/data . This will house any untracked data, like genomes and raw_reads
Run mkdir ~/katiegal_shared/projects/project_repo/cluster/data/raw_reads. We will symlink this with the raw reads once it is uploaded.

Ultimately, your repo/cluster structure should look something like this:

katiegal_shared
├── data
├── hpc_infra
└── projects
    └── project_repo
        ├── analysisFile.ipynb
        ├── datadir.txt
        └── cluster
            ├── config      # Metadata for configuring
            ├── data        # Data you don't want tracked, like genomes
            │    └── raw_reads
            ├── envs        # TODO DESCRIPTION
            ├── inputs      # Inputs that should be tracked, like transgenes or metadata
            ├── profiles    # TODO DESCRIPTION
            ├── scripts     # Scripts for analysis
            ├── .gitignore  # TODO DESCRIPTION
            └── Snakefile   # Runs pipeline

Next we will upload the raw reads from smithsonian to the cluster using rclone. Once per SSH session, you need to activate the rclone module. You do this by running our HPC-infastructure activate script and adding the module:

$ . ~/katiegal_shared/hpc-infra/modules/activate.sh
$ module add rclone

Then, you can use rclone. See the rclone documentation for more details, but a simple copy command between files stored in Smithsonian to the cluster could be:

$ rclone copy smithsonian:data/NGS/raw_reads/251204_Plasmidsaurus ~/katiegal_shared/data/raw_reads/251204_Plasmidsaurus

Be sure to unzip your files if they are zipped.

Next we want to symlink in the raw_reads so you can easily access it:

Run ln -s ~/katiegal_shared/data/raw_reads/ ~/katiegal_shared/projects/project_repo/cluster/data/raw_reads

Run pipeline

in your project folder on the cluster, do git pull to confirm you are up-to-date
do tmux new to activate a [terminal multiplexer](https://github.com/tmux/tmux/wiki)
- this will keep things running in the background even if you close your computer
Add modules:

~/katiegal_shared/hpc-infra/modules/activate.sh
module add snakemake

do a dry run to check for errors in the structure of snakemake calls (note: this will not catch all errors)

snakemake --dry-run

(optional) create the conda environment (long step) using a compute node

salloc --mem 20G -c 10 -p mit_normal
snakemake --conda-create-envs-only

run the pipeline! (do this when you know your pipeline is good, otherwise just do snakemake inside the folder with your Snakefile)

snakemake --default-resources slurm_partition=mit_preemptable --keep-going --retries 3

to exit the tmux window, type ctrl-b d (dettaches, keeps running in background), and to check on progress, do tmux attach
When the job is finished, download to smithsonian using rclone.

$ . ~/katiegal_shared/hpc-infra/modules/activate.sh
$ module add rclone
$ rclone copy ~/katiegal_shared/data/[data of interest] smithsonian:data/NGS/processed_reads/[new folder for your data]

Troubleshooting

if your snakemake is failing and nothing appears in a log for a rule involving a python script, ensure you have made the script executable by typing git update-index --chmod=+x cluster/scripts/pythonfile.py in the VS Code terminal