Workshop 3 - Workflow Automation with Snakemake
Introduction
In this workshop, you will learn how to make your genome analysis reproducible and shareable by refactoring your scripts into a Snakemake workflow. You will use Large Language Models (LLMs) to help design, implement, and debug your workflow, and you will run your analysis on the compute cluster. This workshop is designed for students with little prior experience in workflow management or Snakemake.
Template Repo
Supporting Materials
Problem Statement
Your PI wants your ancient genome analysis to be reproducible and easy for other lab members to use. Your job is to:
- Refactor your scripts from previous workshops into a Snakemake workflow
- Automate the steps for downloading data, running sequence analysis, and summarizing results
- Run your workflow on the compute cluster
- Summarize your workflow and findings in a brief report
Technical Skills Introduced
- Using VS Code and git for collaborative workflow development
- Introduction to workflow management with Snakemake
- Writing and debugging Snakemake rules
- Integrating Python scripts into workflows
- Submitting Snakemake jobs to a compute cluster
- Prompt engineering and iterative debugging with LLMs
Workshop Structure
- Setup: Clone your workshop repository from GitHub Classroom, set up your environment, and review your previous scripts.
- Workflow Design: Use LLMs to help you design a Snakemake workflow for your analysis pipeline.
- Implementation: Prompt the LLM to help you write Snakemake rules for each analysis step (download, analysis, reporting).
- Cluster Execution: Use LLMs to help you generate and debug cluster job submission for Snakemake workflows.
- Reporting: Summarize your workflow and findings in a short markdown report. All files should be tracked in git and pushed to GitHub Classroom.
Sample Initial Prompt
I need to refactor my genome analysis scripts into a Snakemake
workflow that downloads a microbial genome FASTA file, runs an
external python script, and summarizes the results. Please
generate a Snakefile and example rule for running the analysis
on a compute cluster.
Clone the github classroom repository
Clone the github classroom repository and open your VSCode session within the repo.
Try and accomplish the following milestones working directly in the repo you create.
Milestones
Milestone 1
Topics and Concepts
- Snakemake workflow design
- Python script and qsub script
- Conda environments - YML files
Tasks
- Refactor your scripts into a Snakemake workflow
- Generate a Snakefile and example rule for running the analysis on a compute cluster
- Your workflow should download a single FASTA file and run the script you developed last week
Milestone 2
Topics and Concepts
- Third party tools
Tasks
- Instead of using
wget
, use the NCBI Datasets CLI to download the genome - Add another rule that will run the tool Prokka to annotate the genome
Milestone 3
Topics and Conncepts
- Easily test multiple parameters
- Adding resources per job (threads)
Tasks
- Add another rule that runs
kmer-jellyfish
and counts the numbers of k-mers for every value of k from 1-31.
Milestone 4
Topics and Concepts
- Docker container integration
- Handling multiple samples
Task
- Swap to using docker containers (singularity) for each tool
- Instead of one genome, have your workflow download at least three genomes
- Wrap your sample information in a CSV
Milestone 5
Topics and Concepts
- Opening computational notebooks in a managed environment
Tasks
- Make a conda environment with
pyCirclize
andseaborn
installed - Open a notebook and do the following:
- Generate a circos plot for the genome annotations from Prokka
- Graph the values of the k-mer counting for each tool
Deliverables
By the end of this workshop, you will have created the following artifacts:
- Snakemake Workflow Files
- A complete and well-documented Snakefile and any config or rule files needed for your workflow
- Example:
Snakefile
,config.yaml
,rules/
- Integrated Python Scripts
- Python scripts for sequence analysis, adapted for use within the Snakemake workflow
- Example:
scripts/gc_content.py
- Cluster Submission Script
- A script or command for running your Snakemake workflow on the compute cluster (e.g., with qsub or Snakemake’s cluster integration)
- Example:
run_snakemake.qsub
- Workflow Output Results
- Output files generated by the workflow, including sequence statistics and any summary files
- Example:
results/summary.md
,results/gc_content.txt
- Brief Report
- A short markdown report (1–2 paragraphs) summarizing your workflow design, results, and any challenges encountered. This should be clear enough to share with your PI or collaborators.
- Example:
workflow_report.md
- Version-Controlled Repository
- All code and workflow files should be tracked in your git repository and pushed to GitHub Classroom as part of reproducible research best practices. This ensures your work is reproducible and easy to share with instructors and collaborators.