Workshop 2 Background Slides

BU Bioinformatics Computational Skills Workshop

Workshop 2 Background Slides

Large Genomes: New Challenges

Large genome files can be hundreds of millions of base pairs
File size and complexity require efficient code and high-performance computing

Downloading Large Genomic Data

Use tools like wget, curl, or programmatic Python to download large files
Always verify file integrity (e.g., checksums)
Consider storage and transfer limitations

Efficient File Handling in Python

Streaming and chunking allow you to process large files without loading them into memory
Use generators, file handles, and libraries like BioPython

Introduction to Compute Clusters

Clusters provide distributed computing resources
Typical workflow: write a job script, submit with qsub, monitor progress
Learn basic cluster commands and job submission syntax

Using LLMs for Scaling Up

LLMs can help you adapt scripts for efficiency and cluster compatibility
Effective prompts include file size, resource needs, and error handling
Always review and test generated code before running on the cluster