====================== Cluster Computing ====================== MIT houses several computing clusters that are available for the lab to use. As of 2025, we use the Engaging cluster, though this may change in the future. Creating and setting up an account ================================== **Create an account** Following the instructions on the `MIT ORCD docs page `_, log in to the Engaging cluster through the web portal using your Kerberos ID and password (`instructions here `_). This will automatically trigger a new account to be created. .. note:: There may be a delay of a day after creating your account before you can start any jobs. However, you should still be able to log in. Confirm you can log in to Engaging via your terminal or PowerShellusing ``ssh``. Replace ``[your-kerberos]`` below with your Kerberos ID. .. code-block:: console $ ssh [your-kerberos]@orcd-login.mit.edu This will prompt you for your Kerberos password and Duo authentication. First-time setup ---------------- **Add an ``ssh`` shortcut** Once you've confirmed that you can log in, create an ``ssh`` shortcut to the cluster: On your computer (not in the cluster), add the following to your config file using ``nano ~/.ssh/config``: .. code-block:: Host engaging HostName orcd-login.mit.edu User [your-kerberos] ForwardAgent yes While you're at it, add a shortcut to the BioMicro Center cluster. This is where they'll temporarily store your sequencing data. .. code-block:: Host bmc HostName bmc-150.mit.edu User galloway_ill .. important:: You can't use ``nano`` on Windows. Instead, navigate to the folder directly in the File Explorer and edit your config file with a text editor: 1. In PowerShell, run ``cd ~/.ssh`` 2. Get the directory path by ``pwd`` 3. Copy this path into "File Explorer". This might look like ``C:\Users\ChemeGrad2025\.ssh`` 4. Once you've located the hidden .ssh directory, edit the config file with "Notepad" (or "VSCode", etc.) and add in the above. See `MIT ORCD docs SSH key setup `_ if stuck. **Setup the link to the shared data folder** We have a 20TB shared data folder on the Engaging cluster. It is located at ``/orcd/data/katiegal/002``, which is an annoying path to type. Instead, we like to put a link in your home directory, which is the place where you start when you SSH in. You only have to create this symbolic link (symlink) once. To make the symlink, run: .. code-block:: $ ln -s /orcd/data/katiegal/002 ~/katiegal_shared The relevant folders here are: - ``data/raw_reads``: where we put all our raw data - ``projects``: where we clone git repos for analysis pipelines, etc. - ``hpc_infra``: infastructure scripts and other useful items. .. code-block:: katiegal_shared ├── data │   └── raw_reads ├── hpc_infra └── projects .. note:: 2025.12.12 - NBW: We used to have another folder at ``/orcd/pool/003/katiegal_shared/``. If something is missing it is likely there. You should symlink it, e.g. ``ln -s /orcd/pool/003/katiegal_shared/data/raw_reads/250425Gal/ ~/katiegal_shared/data/raw_reads/250425Gal`` **Setup and SSH key to make Git repo's easier to access** Based on `MIT ORCD docs "SSH key setup" `_ and `GitHub docs "Using SSH agent forwarding" `_ At a high level: SSH agent forwarding can be used to make deploying to a server simple. It allows you to use your local SSH keys instead of leaving keys (without passphrases!) sitting on remote servers, like the Engaging cluster. You can set up ``ssh-agent`` for your local computer which runs in the background and keeps your SSH key loaded into memory so you don't need to enter a passphrase every time you need to use the key. Then, you can give remote servers, like the Engaging cluster, access to your local ``ssh-agent`` as if they were running on the server. This is sort of like asking a friend to enter their password so that you can use their computer. The end result basically means you get use ``git clone`` and other things without having to re-enter passphrases every time while on the Engaging cluster. We'll start with `GitHub docs "Using SSH agent forwarding" `_ . Check to see if your own SSH key is set up and working by entering ``ssh -T git@github.com`` in the terminal. If successful it will look like: .. code-block:: $ ssh -T git@github.com # Attempt to SSH in to github > Hi USERNAME! You've successfully authenticated, but GitHub does not provide shell access. If not, next make sure your local computer has an SSH public key for GitHub based on `GitHub docs "Adding a new SSH key to your GitHub account" `_ 1. Check for an existing SSH public key on your local computer: `GitHub docs "Checking for existing SSH" `_ 2. If no key exists, then add generate a new SSH public key: `GitHub docs "Generating a new SSH key and adding it to the ssh-agent" `_ After confirming you have an SSH public key on your local computer, you will then give this SSH key to Github so Github can identify you without having to log in everytime. To do so: 1. Copy the SSH public key to your clipboard ``pbcopy < ~/.ssh/id_ed25519.pub``. If your SSH public key file has a different name than the example code, modify the filename to match your current setup. When copying your key, don't add any newlines or whitespace. For example, NBW's key is located at ``~/.ssh/id_rsa.pub``. 2. In the upper-right corner of any page on GitHub, click your profile picture, then click "Settings". 3. In the "Access" section of the sidebar, click "SSH and GPG keys". 4. Click "New SSH key" or "Add SSH key". 5. In the "Title" field, add a descriptive label for the new key. For example, if you're using a personal laptop, you might call this key "Personal laptop". 6. Select the type of key, either authentication or signing. For more information about commit signing, see About commit signature verification. 7. In the "Key" field, paste your public key. 8. Click "Add SSH key". Check to see if your own SSH key is set up and working again with Github by entering ``ssh -T git@github.com`` in the terminal. Now GitHub has your public key but you still need to let ``ssh-agent`` get access to your private key. This way, when a remote server with ``ForwardAgent true`` needs to sign something with your private key, the request gets funnel back to your ``ssh-agent`` which returns the signed request so the private never leaves your local computer. By copying the public key onto remote systems like copy-pasting onto Github like we just did or using ``ssh-copy-id``, your public key gets pre-loaded onto remote systems but you can still control access to your private keys for each individual remote server. To make your key available to ``ssh-agent``: 1. Check that your key is visible to ``ssh-agent`` by running the following command on your local computer: ``ssh-add -L`` 2. If the command says that no identity is available, you'll need to add your key with the following command: ``ssh-add`` . This will add any "default" keys. You can also add a specific key. For NBW this looks like ``ssh-add ~/.ssh/id_rsa`` which is different than the public key, ``~/.ssh/id_rsa.pub``! 3. On macOS, ssh-agent will "forget" this key, once it gets restarted during reboots. But you can import your SSH keys into Keychain using this command: ``ssh-add --apple-use-keychain YOUR-KEY`` Great! You should be done now! The sercret was in something we added before: .. code-block:: Host engaging HostName orcd-login.mit.edu User [your-kerberos] ForwardAgent yes The ``ForwardAgent yes`` tells your ``ssh-agent`` to let the Engaging cluster use your local keys. This is known as "SSH agent forwarding". Check to make sure it's set up correctly: 1. Log into Engaging cluster using ``ssh engaging`` and sign in 2. On the Engaging cluster, test to see if the SSH key is set up and working again with Github by entering ``ssh -T git@github.com`` in the terminal. Like before, if successful it should say ``Hi USERNAME! You've successfully authenticated, but GitHub does not provide shell access.`` If it's not working, check `GitHub docs "Using SSH agent forwarding: Troubleshooting SSH agent forward" `__ for more details, but a simple copy command between files stored in Smithsonian to the cluster could be: .. code-block:: console $ rclone copy smithsonian:data/NGS/raw_reads/251204_Plasmidsaurus ~/katiegal_shared/data/raw_reads/251204_Plasmidsaurus This works bidirectionally! You can copy results back into Smithsonian directly. **sftp** To transfer local files, we use SFTP. On your local computer (not in the cluster), run: .. code-block:: console $ sftp [your-kerberos]@orcd-login.mit.edu This connects your local computer to the Engaging cluster. You should see ``katiegal_shared``. Then use ``put`` to upload the sequencing data to ``katiegal_shared\data\raw_reads`` .. code-block:: put path/to/local/directory/filename.extension /path/to/remote/directory/newname.extension Before you upload your data, making a new directory to hold the data using ``katiegal_shared\data\raw_reads\new_directory_name`` It should look something like this .. code-block:: mkdir katiegal_shared/data/raw_reads/251204_Plas put C:\Users\ChemeGrad2019\Downloads\4Y5Y7T_fastq.zip katiegal_shared/data/raw_reads/251204_Plas/4Y5Y7T_fastq.zip Then unzip your files and delete the original zip. .. important:: We have multiple data folders from the Engaging cluster. Ideally everything should be symlinked into the current folder. TODO ADD MORE DETAILS `` /orcd/data/katiegal/003`` So the next thing to do is to clone your git repo: .. code-block:: $ cd ~/katiegal_shared/projects $ git clone https://github.com/GallowayLabMIT/[your_project] $ git config --global --add safe.directory /orcd/data/katiegal/002/projects/[your_project] A convenient way to organize your project is to add a folder called ``cluster`` (or similar) in the root directory of your project repo. Here, you can add pipelines to run on the cluster separate from the other data analysis (e.g., flow) for your project. .. warning:: Below needs to be updated TODO: suggested project folder structure `cluster` - `data/` - `raw` - `envs/` - `inputs/` - `profiles/` - `scripts/` - `Snakefile` - `.gitignore` | cluster | ├── config | │   ├── samplesheet.csv | ├── data | │   ├── raw | ├── envs | │   ├── deseq2.yaml | │   ├── salmon.yaml | │   └── trim_reads.yaml | ├── inputs | │   └── transgenes | │   ├── transgenes-eGFP.fna | │   └── transgenes-eGFP.gtf | ├── load_snakemake.sh | ├── profiles | │   └── default | │   └── config.yaml | ├── scripts | │   └── run_deseq2.R | └── Snakefile 3 **Upload data to Engaging** TODO then, symlink data to your project folder .. code-block:: ln -s /orcd/data/katiegal/002/data/raw_reads YourPath **Uploading RNA-seq data from Plasmidsaurus ** We use ``sftp`` to copy data between servers, either remote (e.g. Engaging cluster) or local (your computer). You can look at `SFTPCloud docs `_ for more info. For **Plasmidsaurus**, download the fastq.zip file (e.g. "4Y5Y7T_fastq.zip" which contains fastq.gz files). Open a new terminal or PowerShell and run locally: .. code-block:: sftp [your-kerberos]@orcd-login.mit.edu This connects your local computer to the Engaging cluster. You should see ``katiegal_shared``. Then use ``put`` to upload the sequencing data to ``katiegal_shared\data\raw_reads`` .. code-block:: put path/to/local/directory/filename.extension /path/to/remote/directory/newname.extension Before you upload your data, making a new directory to hold the data using ``katiegal_shared\data\raw_reads\new_directory_name`` It should look something like this .. code-block:: mkdir katiegal_shared/data/raw_reads/251204_Plas put C:\Users\ChemeGrad2019\Downloads\4Y5Y7T_fastq.zip katiegal_shared/data/raw_reads/251204_Plas/4Y5Y7T_fastq.zip Then unzip your files and delete the original zip. **Run pipeline** TODO (KL has notes) **Download output to local computer** TODO (KL has notes) KL notes ======== ### Run pipeline - in your project folder, do `git pull` to confirm you are up-to-date - do `tmux new` to activate a [terminal multiplexer](https://github.com/tmux/tmux/wiki) - this will keep things running in the background even if you close your computer - add modules - `. ~/katiegal_shared/hpc-infra/modules/activate.sh` - `module add snakemake` - do a dry run to check for errors - `snakemake --dry-run` - tip: create the conda environment (long step) using a compute node - `salloc --mem 20G -c 10 -p mit_normal` - `snakemake --conda-create-envs-only` - then, run your pipeline - `snakemake --default-resources slurm_partition=mit_preemptable --keep-going --retries 3` - do this when you know your pipeline is good, otherwise just do `snakemake` inside the folder with your `Snakefile` - to exit the tmux window, type `ctrl-b d` (dettaches, keeps running in background) - to check on progress, do `tmux attach` ### Download plots, etc from cluster - navigate to the directory where you want to download the data - e.g., your computer downloads folder, some output folder in your local project repo - log in to the cluster using `sftp` - `sftp engaging` - approve the Duo request (note that nothing will pop up) - navigate to what you'd like to transfer - e.g., `cd katiegal_shared/projects/YourProject` - download the data using the `get` command - `get -R PathToFolderToCopy` - [sftp manual](https://man.openbsd.org/sftp.1)