Workshop 2 Background Slides
BU Bioinformatics Computational Skills Workshop

Workshop 2 Background Slides

Large Genomes: New Challenges

  • Large genome files can be hundreds of millions of base pairs
  • File size and complexity require efficient code and high-performance computing

Downloading Large Genomic Data

  • Use tools like wget, curl, or programmatic Python to download large files
  • Always verify file integrity (e.g., checksums)
  • Consider storage and transfer limitations

Efficient File Handling in Python

  • Streaming and chunking allow you to process large files without loading them into memory
  • Use generators, file handles, and libraries like BioPython

Introduction to Compute Clusters

  • Clusters provide distributed computing resources
  • Typical workflow: write a job script, submit with qsub, monitor progress
  • Learn basic cluster commands and job submission syntax

Using LLMs for Scaling Up

  • LLMs can help you adapt scripts for efficiency and cluster compatibility
  • Effective prompts include file size, resource needs, and error handling
  • Always review and test generated code before running on the cluster