Workshop 3 - Workflow Automation with Snakemake

Introduction

In this workshop, you will learn how to make your genome analysis reproducible and shareable by refactoring your scripts into a Snakemake workflow. You will use Large Language Models (LLMs) to help design, implement, and debug your workflow, and you will run your analysis on the compute cluster. This workshop is designed for students with little prior experience in workflow management or Snakemake.

Template Repo

Workshop 3

Supporting Materials

Problem Statement

Your PI wants your ancient genome analysis to be reproducible and easy for other lab members to use. Your job is to:

Refactor your scripts from previous workshops into a Snakemake workflow
Automate the steps for downloading data, running sequence analysis, and summarizing results
Run your workflow on the compute cluster
Summarize your workflow and findings in a brief report

Technical Skills Introduced

Using VS Code and git for collaborative workflow development
Introduction to workflow management with Snakemake
Writing and debugging Snakemake rules
Integrating Python scripts into workflows
Submitting Snakemake jobs to a compute cluster
Prompt engineering and iterative debugging with LLMs

Workshop Structure

Setup: Clone your workshop repository from GitHub Classroom, set up your environment, and review your previous scripts.
Workflow Design: Use LLMs to help you design a Snakemake workflow for your analysis pipeline.
Implementation: Prompt the LLM to help you write Snakemake rules for each analysis step (download, analysis, reporting).
Cluster Execution: Use LLMs to help you generate and debug cluster job submission for Snakemake workflows.
Reporting: Summarize your workflow and findings in a short markdown report. All files should be tracked in git and pushed to GitHub Classroom.

Sample Initial Prompt

I need to refactor my genome analysis scripts into a Snakemake
workflow that downloads a microbial genome FASTA file, runs an
external python script, and summarizes the results. Please
generate a Snakefile and example rule for running the analysis
on a compute cluster.

Clone the github classroom repository

Clone the github classroom repository and open your VSCode session within the repo.

Try and accomplish the following milestones working directly in the repo you create.

Milestones

Milestone 1

Topics and Concepts

Snakemake workflow design
Python script and qsub script
Conda environments - YML files

Tasks

Refactor your scripts into a Snakemake workflow
Generate a Snakefile and example rule for running the analysis on a compute cluster
Your workflow should download a single FASTA file and run the script you developed last week

Milestone 2

Topics and Concepts

Third party tools

Tasks

Instead of using wget, use the NCBI Datasets CLI to download the genome
Add another rule that will run the tool Prokka to annotate the genome

Milestone 3

Topics and Conncepts

Easily test multiple parameters
Adding resources per job (threads)

Tasks

Add another rule that runs kmer-jellyfish and counts the numbers of k-mers for every value of k from 1-31.

Milestone 4

Topics and Concepts

Docker container integration
Handling multiple samples

Task

Swap to using docker containers (singularity) for each tool
Instead of one genome, have your workflow download at least three genomes
Wrap your sample information in a CSV

Milestone 5

Topics and Concepts

Opening computational notebooks in a managed environment

Tasks

Make a conda environment with pyCirclize and seaborn installed
Open a notebook and do the following:
1. Generate a circos plot for the genome annotations from Prokka
2. Graph the values of the k-mer counting for each tool

Deliverables

By the end of this workshop, you will have created the following artifacts:

Snakemake Workflow Files
- A complete and well-documented Snakefile and any config or rule files needed for your workflow
- Example: Snakefile, config.yaml, rules/
Integrated Python Scripts
- Python scripts for sequence analysis, adapted for use within the Snakemake workflow
- Example: scripts/gc_content.py
Cluster Submission Script
- A script or command for running your Snakemake workflow on the compute cluster (e.g., with qsub or Snakemake’s cluster integration)
- Example: run_snakemake.qsub
Workflow Output Results
- Output files generated by the workflow, including sequence statistics and any summary files
- Example: results/summary.md, results/gc_content.txt
Brief Report
- A short markdown report (1–2 paragraphs) summarizing your workflow design, results, and any challenges encountered. This should be clear enough to share with your PI or collaborators.
- Example: workflow_report.md
Version-Controlled Repository
- All code and workflow files should be tracked in your git repository and pushed to GitHub Classroom as part of reproducible research best practices. This ensures your work is reproducible and easy to share with instructors and collaborators.