Workshop 2 Introduction Slides
BU Bioinformatics Computational Skills Workshop

Workshop 2 Introduction Slides

Scaling Up: Large Genome Analysis on the Cluster

  • You now have a much larger dataset: a full bacterial genome (>100M bp)
  • The file is too big for your laptop! Time to use the compute cluster.
  • You’ll learn to adapt your scripts and submit jobs to the cluster with LLM assistance.

Problem Statement

  • Download a large genome FASTA file from a public database.
  • Adapt your code to efficiently process large files.
  • Submit your analysis as a job to the compute cluster (qsub).
  • Summarize your findings for your PI.

Why Use a Compute Cluster?

  • Large files require more memory and processing power than a laptop can provide
  • Clusters allow parallel, high-throughput analysis
  • Learning to use clusters is essential for modern genomics research

Workshop Workflow: Problem → Prompt → Code → Debug → Result

  • Problem: Define the computational challenge
  • Prompt: Craft an effective LLM prompt
  • Code: Generate and run scalable code
  • Debug: Identify and fix errors (locally and on the cluster)
  • Result: Summarize and interpret findings

Getting Started: Example LLM Prompt

I need to process a large genome FASTA file (>100M bp)
that is too big for my laptop. Please generate Python code to
efficiently compute sequence length and GC content, and provide
an example qsub script to run this analysis on a compute cluster.