ai-for-genomic-science

Chapter 12: Foundation Models for Genomics

Consider a chef who has spent 20 years cooking every cuisine in the world — French, Japanese, Ethiopian, Peruvian. They’ve never cooked Korean food. But when they walk into a Korean kitchen for the first time, they don’t start from zero. They already understand heat, timing, flavor balance, fermentation. They know that doenjang is in the same family as miso. Their first attempt at kimchi jjigae will be far better than someone who has never cooked at all.

This is the core idea behind foundation models: train on a massive, diverse dataset to learn general principles, then specialize for a new task with very little additional data. The chef’s 20 years of broad cooking experience is the pre-training phase. Walking into the Korean kitchen for the first time — and succeeding quickly — is fine-tuning. In genomics, “cooking every cuisine” means pre-training on millions of DNA sequences from across the tree of life. “Walking into a Korean kitchen” means fine-tuning for your specific research question — with perhaps only a hundred labeled examples.

Why does this matter? Because the hardest biological questions are rarely the well-funded ones with thousands of labeled samples. A rare neurological disorder affecting 50 patients worldwide. A previously unstudied cell type from pancreatic islets. A pathogen that no one has sequenced before. In each case, a model trained from scratch would fail — you simply don’t have enough data to teach it genomics from the ground up. But a foundation model already knows genomics. You’re just teaching it the last mile.

The deep learning models we’ve encountered so far — DeepSEA, Basenji, Enformer — are powerful but face a critical limitation: every new biological question requires new labeled data and new training. Foundation models break this cycle. They learn the grammar of the genome once, from massive unlabeled sequence data, and transfer that grammar wherever it’s needed. This chapter is about how they do it.

The Biological Challenge

The deep learning models we’ve encountered so far—DeepSEA, Basenji, Enformer—are powerful but face a critical limitation: they require task-specific training data. DeepSEA needs chromatin accessibility measurements. Basenji needs expression data across tissues. Each model must be trained from scratch on labeled examples for its specific task.

This creates several problems for biological research:

The Data Scarcity Problem: Many important biological questions involve rare cell types, rare conditions, or novel experimental contexts where labeled training data is limited or nonexistent. You might have RNA-seq from 30 patients with a rare disorder, or ATAC-seq from 100 cells of a newly discovered cell type. Traditional supervised learning fails here—you can’t train a deep neural network with 30 examples.

The Task-Specific Problem: Every new biological question requires collecting new training data and training a new model. Want to predict enhancers? Train a model. Want to predict promoters? Train another model. Want to predict splice sites? Train yet another model. Each task requires thousands of labeled examples and days of training time.

The Generalization Problem: Models trained on one species, tissue, or condition often fail when applied to different contexts. A model trained on human heart tissue might not work for brain tissue. A model trained on reference genomes might not handle patient-specific variants well.

These challenges highlight a fundamental inefficiency: we’re asking each model to learn the basic grammar of genomic sequences from scratch, over and over again. It’s like teaching someone to read Shakespeare by only showing them Shakespeare—they’ll learn Shakespeare specifically, but not the general rules of English that could help them read anything.

The experimental alternative—generating comprehensive labeled data for every possible biological question—would cost billions of dollars and take decades. A model trained to predict chromatin states in one cell type can’t simply be asked “What about this other cell type?” without retraining.

What we need is a different approach: models that learn general genomic principles from massive unlabeled data, then transfer that knowledge to specific tasks with minimal additional training. This is the promise of foundation models.

Learning Objectives

Understand the concept of transfer learning and why it’s crucial for genomics with limited labeled data
Explain the two-stage process of pretraining and fine-tuning in foundation models
Describe what zero-shot and few-shot learning mean and how they address data scarcity
Recognize the difference between task-specific models and foundation models in genomics
Identify biological scenarios where foundation models offer advantages over training from scratch
Understand how self-supervised learning enables training on unlabeled genomic sequences
Evaluate the trade-offs between model size, training cost, and transfer learning performance

12.1 Introduction: The Foundation Model Revolution

12.1.1 What Is a Foundation Model?

The term “foundation model” emerged in 2021 to describe a new paradigm in artificial intelligence. A foundation model is a large-scale model trained on broad, unlabeled data that can be adapted to a wide range of downstream tasks with minimal task-specific training.

The key characteristics are:

Scale: Foundation models are large—often billions of parameters—and trained on massive datasets
Generality: They learn general patterns applicable to many tasks, not just one specific task
Adaptability: They can be fine-tuned or even used directly for new tasks with little or no additional training
Self-supervised learning: They typically use clever training objectives that don’t require manual labels

The foundation model concept originated in natural language processing (NLP) with models like BERT and GPT. These models are trained on billions of words of text to understand language structure, then can be adapted to specific tasks like sentiment analysis, question answering, or translation with relatively little task-specific data.

12.1.2 Why Genomics Needs Foundation Models

Genomics is actually an ideal domain for foundation models, perhaps even more so than natural language:

Abundant Unlabeled Data: There are billions of DNA sequences available—complete genomes from thousands of species, metagenomic data from environments, and sequence databases like GenBank containing trillions of base pairs. This data is unlabeled in the sense that we don’t have ground-truth annotations for every function, but it contains patterns.

Conserved Grammar: Just as all English text follows grammatical rules, all genomic sequences follow biological rules—promoter motifs, splice signals, regulatory grammars, codon usage patterns. A model that learns these rules could apply them across contexts.

Transfer Across Tasks: The same sequence features relevant for predicting enhancers might also be relevant for predicting transcription factor binding, chromatin accessibility, or evolutionary constraint. Knowledge should transfer.

Data Scarcity for Specific Tasks: While we have abundant sequence data, we often have limited labeled data for specific biological questions—especially for rare cell types, rare conditions, or novel organisms.

Cost of Experimental Validation: Generating labeled training data in genomics is expensive. A single ChIP-seq experiment costs $1,000-5,000. Whole-genome sequencing with deep phenotyping for thousands of individuals costs millions. If a foundation model could reduce the labeled data requirement from 10,000 examples to 100 examples, the cost savings would be enormous.

12.1.3 From Task-Specific to Foundation Models: A Paradigm Shift

Let’s contrast the old paradigm with the new:

Old Paradigm (Task-Specific Models):

Identify biological question (e.g., “predict enhancers in liver”)
Collect labeled training data (e.g., 10,000 regions with ChIP-seq labels)
Design neural network architecture
Train model from random initialization on your specific data
Model works for your task but can’t easily transfer to other tasks

New Paradigm (Foundation Models):

Pretraining: Train one large model on billions of unlabeled genomic sequences to learn general genomic patterns
Fine-tuning: For your specific question, fine-tune the pretrained model on a small amount of task-specific labeled data
Model works for your task AND can be adapted to many other tasks with minimal retraining

The key insight: learning general genomic patterns from unlabeled data is the expensive part. Once learned, adapting to specific tasks is relatively cheap.

This is analogous to how humans learn. You don’t learn to read from scratch every time you encounter a new book. You learned general reading skills once, and now you can read anything—scientific papers, novels, recipes—with minimal adjustment.

12.2 Transfer Learning: The Core Concept

12.2.1 What Is Transfer Learning?

Transfer learning is the ability of a model to apply knowledge learned from one task to a different but related task. Instead of starting from a blank slate (random parameters), you start from a model that already knows something useful.

The biological analogy: imagine you’re a cell biologist who studies yeast. You’ve spent years learning molecular biology techniques, experimental design, and data analysis. Now you switch to studying human cells. You don’t start from zero—much of your knowledge transfers. You understand Western blots, PCR, microscopy, statistical analysis. You only need to learn the specifics of human cell culture and human-specific biology. Your “pretrained” expertise transfers.

In neural networks, transfer learning works similarly:

Source Task (Pretraining): The model learns general features from a large dataset. For genomics, this might be “predict the next nucleotide in a sequence” trained on 3 billion base pairs of genomic DNA.

Target Task (Fine-tuning): The model adapts to a specific task using a smaller dataset. For example, “classify whether a sequence is an enhancer” using 5,000 labeled examples.

The pretrained model’s parameters encode general genomic knowledge—motif patterns, dinucleotide frequencies, splice signals, regulatory grammars. This knowledge helps even on tasks the model wasn’t explicitly trained for.

12.2.2 Why Transfer Learning Works in Genomics

Transfer learning is effective when the source and target tasks share underlying structure. Genomics has abundant shared structure:

Shared Motifs: Transcription factor binding motifs like TATA boxes, E-boxes, and CTCF sites appear in many contexts. A model that learns to recognize these motifs when predicting chromatin accessibility can transfer that knowledge to predicting transcription factor binding.

Shared Regulatory Grammar: Enhancers in liver and enhancers in brain have different specific sequences, but they follow similar structural principles—transcription factor binding site clustering, appropriate distances from promoters, GC content patterns.

Conserved Evolutionary Patterns: Functionally important regions tend to be conserved across species. A model trained to recognize conservation patterns in one context can apply that to other contexts.

Shared Sequence Context: The nucleotide patterns surrounding splice sites, start codons, and poly-A signals are similar across genes and tissues.

12.2.3 Mathematical View of Transfer Learning

Let’s formalize this intuition. Suppose we have:

A source task with abundant data: $D_{\text{source}} = {(x_1, y_1), (x_2, y_2), …, (x_n, y_n)}$ where $n$ is large
A target task with limited data: $D_{\text{target}} = {(x’_1, y’_1), (x’_2, y’_2), …, (x’_m, y’_m)}$ where $m$ is small

A neural network can be thought of as two components:

\[f(x) = g(h(x))\]

Where:

$h(x)$ is the feature extractor (early layers) that transforms input $x$ into learned features
$g()$ is the task-specific head (final layers) that maps features to predictions

Standard Training (No Transfer):

Initialize all parameters randomly
Train both $h()$ and $g()$ on $D_{\text{target}}$
Problem: with small $m$, overfitting is severe

Transfer Learning:

Pretraining: Train $h()$ on $D_{\text{source}}$ to learn general features
Fine-tuning: Initialize with pretrained $h()$, train on $D_{\text{target}}$
Benefit: $h()$ already encodes useful features, so even small $m$ can train effective $g()$

The pretrained $h()$ learns features like “transcription factor motif detected at position 47” or “high GC content in this window”—features useful for many tasks, not just the source task.

12.2.4 Types of Transfer Learning

Feature Extraction (Frozen Features):

Use pretrained $h()$ as-is, only train task-specific head $g()$
Fast, prevents overfitting
Works when source and target tasks are very similar

Fine-tuning (Adapted Features):

Start with pretrained $h()$, allow parameters to update during training on target task
Slower, but allows adaptation to target task specifics
Works better when tasks differ somewhat

Multi-task Learning:

Train on multiple related tasks simultaneously
Forces model to learn features useful across tasks
Common in genomics where you might predict multiple chromatin marks together

12.3 Self-Supervised Learning: Training Without Labels

12.3.1 The Labeling Problem

Foundation models need to learn from massive datasets, but manually labeling billions of sequences is impossible. Even if you had the funding, what would you label them with? Most sequences don’t have functional annotations.

This is where self-supervised learning comes in: using the data itself to create training signals, without needing human-provided labels.

The key insight: you can create a supervised learning problem from unlabeled data by hiding parts of the data and asking the model to predict them.

12.3.2 Language Modeling as Self-Supervised Learning

The most successful self-supervised approach is language modeling: predicting masked or future tokens.

For Text (BERT approach):

Original sentence: “The ribosome translates mRNA into protein”
Masked sentence: “The ribosome [MASK] mRNA into protein”
Task: predict the masked word (“translates”)

The model must understand grammar, context, and semantics to predict masked words correctly. By training on billions of sentences, models learn language structure without anyone manually labeling what each sentence means.

For DNA Sequences (Genomic Language Models):

Original sequence: ATGCGATTACGATCGTACGAT
Masked sequence: ATGCGA[MASK][MASK]ACGATCGTACGAT
Task: predict the masked nucleotides

To predict masked nucleotides, the model must learn:

Motif patterns (if ATG is masked, it might be a start codon)
Dinucleotide frequencies (CpG islands have different patterns)
Long-range dependencies (distant regulatory elements)
Sequence constraints (splice donor sites follow GT-AG rule)

This self-supervised task doesn’t require any experimental data—just raw sequences. Yet it forces the model to learn genomic grammar.

12.3.3 Self-Supervised Learning Objectives for Genomics

Several self-supervised objectives have proven effective:

Masked Language Modeling (MLM):

Randomly mask 15% of nucleotides
Model predicts masked nucleotides
Forces learning of local and global sequence context
Used by: DNABERT, Nucleotide Transformer

Next Token Prediction (Autoregressive):

Given sequence prefix ATGCGA, predict next nucleotide
Model learns sequential dependencies
Used by: GPT-style models

Sequence Order Prediction:

Shuffle or scramble sequences
Model predicts whether sequence is in correct order
Forces learning of positional relationships

Contrastive Learning:

Learn to distinguish similar sequences from dissimilar ones
Model learns what makes sequences functionally similar
Can incorporate evolutionary information

12.3.4 Why Self-Supervised Learning Works

The effectiveness of self-supervised learning might seem magical—how can predicting masked nucleotides teach a model about enhancers, splice sites, or gene expression?

The key is that genomic sequences have structure, and that structure is relevant to function:

Local Structure: To predict masked nucleotides in motifs, the model must learn motif patterns. A model that learns TATA[MASK][MASK] is often TATAAA (TATA box) has learned something about promoters.

Context Dependencies: To predict a nucleotide, the model must consider surrounding context. This forces learning of regulatory grammar—transcription factor binding sites cluster near enhancers, splice donor sites appear in specific contexts.

Evolutionary Constraints: Functionally important sequences are conserved. A model trained on sequences from multiple species will learn that certain patterns are preserved, indicating functional importance.

Statistical Patterns: Gene-rich regions, repeat elements, CpG islands, and other large-scale genomic features have characteristic sequence statistics. Learning to predict nucleotides captures these patterns.

The result: a model pretrained to predict masked nucleotides develops internal representations that encode functional properties of genomic sequences, even though those properties were never explicitly labeled during training.

[Optional: The Math]

Math Box: The Masked Language Modeling Objective

Let’s formalize the masked language modeling objective mathematically.

Given a genomic sequence $\mathbf{s} = (s_1, s_2, …, s_L)$ where each $s_i \in {A, C, G, T}$ and $L$ is the sequence length:

Step 1: Masking

Randomly select positions $M \subset {1, 2, …, L}$ to mask (typically 15% of positions)

Create masked sequence $\tilde{\mathbf{s}}$ where $\tilde{s}_i = [MASK]$ if $i \in M$, otherwise $\tilde{s}_i = s_i$

Step 2: Prediction

Model $f_\theta$ (parameterized by $\theta$) takes masked sequence and outputs probability distribution over nucleotides for each position: $P_\theta(s_i | \tilde{\mathbf{s}}) = \text{softmax}(f_\theta(\tilde{\mathbf{s}})_i)$

Step 3: Loss Function

Objective is to maximize probability of correct nucleotides at masked positions: $\mathcal{L}(\theta) = -\sum_{i \in M} \log P_\theta(s_i | \tilde{\mathbf{s}})$

This is negative log-likelihood (cross-entropy loss). We want to maximize log-probability, which is equivalent to minimizing negative log-probability.

Biological Interpretation:

The model learns to encode each position’s context (surrounding nucleotides) into a representation

This representation must contain information about motifs, patterns, and constraints

These learned representations transfer to downstream tasks like enhancer prediction

Example Calculation: Suppose at a masked position, the true nucleotide is A, and the model outputs probabilities:

$P(A) = 0.7$

$P(C) = 0.1$

$P(G) = 0.1$

$P(T) = 0.1$

Loss at this position: $-\log(0.7) = 0.357$

If the model is uncertain and outputs $P(A) = 0.25$ (random guessing), loss would be $-\log(0.25) = 1.386$—much higher penalty.

Over billions of sequences, minimizing this loss forces the model to learn predictive patterns in genomic sequences.

12.4 Pretraining and Fine-Tuning: The Two-Stage Paradigm

12.4.1 Stage 1: Pretraining

Pretraining is the computationally expensive stage where the model learns general genomic knowledge from massive unlabeled data.

Pretraining Data:

Entire reference genomes (human, mouse, hundreds of other species)
Total size: often billions to hundreds of billions of nucleotides
No experimental labels required—just sequence

Pretraining Task:

Self-supervised objective (typically masked language modeling)
Model learns to predict masked nucleotides
Requires days to weeks on powerful GPUs or TPUs

Pretraining Outcome:

A model with billions of parameters encoding genomic patterns
Model hasn’t seen your specific task (e.g., enhancer prediction)
But has learned features useful for many genomic tasks

Biological Analogy: Pretraining is like learning to read and write in your native language. You read thousands of books, articles, and documents. You haven’t specifically trained for writing scientific papers, but you’ve learned grammar, vocabulary, and structure that will help with any writing task.

Biological Analogy: Like a medical residency program: after years of seeing diverse patients and building broad clinical experience, you then choose a specialty. The model reads millions of unlabeled sequences, developing general biological intuition.

12.4.2 Stage 2: Fine-Tuning

Fine-tuning is the task-specific stage where the pretrained model adapts to your biological question using a small amount of labeled data.

Fine-tuning Data:

Task-specific labeled examples
Can be small: hundreds to thousands of examples (versus millions needed to train from scratch)
Examples: sequences labeled as enhancers/non-enhancers, genes with expression measurements, variants with functional annotations

Fine-tuning Process:

Start with pretrained model parameters
Add task-specific output layer (e.g., binary classifier for enhancer prediction)
Train on labeled data, allowing parameters to update
Use lower learning rate than pretraining (don’t want to “forget” pretrained knowledge)

Fine-tuning Outcome:

A model adapted to your specific task
Fast: hours instead of days
Data-efficient: works with limited labeled examples

Biological Analogy: Fine-tuning is like taking a general biology course (pretraining) then specializing in neuroscience (fine-tuning). You don’t relearn cell biology from scratch—you build on foundation and add specialization.

Biological Analogy: Like choosing a residency specialty: building on foundational knowledge, you can specialize for a specific task (e.g., variant effect prediction) with relatively few labeled examples.

12.4.3 Fine-Tuning Strategies

There are several approaches to fine-tuning, trading off between computation and adaptation:

Full Fine-Tuning:

Update all model parameters during fine-tuning
Most flexible, best performance
Slower, requires more data to prevent overfitting
Use when you have moderate amounts of labeled data (thousands of examples)

Feature Extraction (Frozen Backbone):

Freeze pretrained parameters, only train final task-specific layers
Fastest, most data-efficient
Less flexible—can’t adapt pretrained features
Use when you have very little labeled data (hundreds of examples)

Partial Fine-Tuning:

Freeze early layers, fine-tune later layers
Middle ground between above approaches
Early layers learn general features (nucleotide patterns), later layers learn task-specific combinations

Low-Rank Adaptation (LoRA):

Keep pretrained parameters frozen
Add small trainable “adapter” modules
Very parameter-efficient
Emerging approach for very large foundation models

12.4.4 Case Study: Enhancer Prediction with Fine-Tuning

Let’s walk through a concrete example.

Problem: Predict enhancers in pancreatic beta cells. You have ChIP-seq data for the enhancer-associated histone mark H3K27ac in beta cells, giving you 3,000 positive examples (enhancers) and 10,000 negative examples (non-enhancers).

Approach 1: Training from Scratch

Initialize random neural network
Train on your 13,000 examples
Results: moderate performance (~75% accuracy)
Problem: model hasn’t seen enough data to learn general sequence patterns

Approach 2: Using Pretrained Foundation Model

Start with model pretrained on 100 billion nucleotides
Pretrained model already recognizes transcription factor motifs, regulatory grammar
Fine-tune on your 13,000 examples
Results: high performance (~88% accuracy)
Why better: model leverages general enhancer patterns learned during pretraining, only needs to learn beta-cell specifics

Actual Numbers from Research: Studies comparing these approaches on ENCODE enhancer prediction tasks found:

Training from scratch with 10,000 examples: 72% accuracy
Fine-tuning pretrained model with 10,000 examples: 86% accuracy
Fine-tuning with only 1,000 examples: 82% accuracy (better than training from scratch with 10x more data!)

This demonstrates the power of transfer learning: pretrained knowledge dramatically reduces labeled data requirements.

12.5 Zero-Shot and Few-Shot Learning

12.5.1 The Few-Shot Learning Spectrum

Fine-tuning requires some labeled examples. But what if you have almost no labeled data? Foundation models enable a spectrum of learning paradigms:

Traditional Supervised Learning:

Training examples: 10,000 - 1,000,000+
How it works: learn from scratch
Genomics example: predicting chromatin states with comprehensive ENCODE data

Fine-Tuning:

Training examples: 100 - 10,000
How it works: adapt pretrained model with moderate labeled data
Genomics example: predicting enhancers in a specific cell type

Few-Shot Learning:

Training examples: 5 - 100
How it works: adapt with minimal examples, or learn from examples at inference time
Genomics example: predicting binding sites for a transcription factor with only 20 known sites

One-Shot Learning:

Training examples: 1
How it works: generalize from a single example
Genomics example: identifying similar sequences to one characterized regulatory element

Zero-Shot Learning:

Training examples: 0
How it works: generalize to new tasks without any task-specific training
Genomics example: predicting function of sequences for which no experimental data exists

12.5.2 Zero-Shot Learning: Using Models Without Fine-Tuning

Biological Analogy: Like a trained immunologist recognizing a pathogen they have never seen before. Prior knowledge generalizes to new situations.

Zero-shot learning is the ability to perform a task without any task-specific training examples. This might seem impossible—how can a model do something it was never trained to do?

The key: the pretrained model’s internal representations encode genomic properties. Even without explicit training on your task, the model might have learned relevant features during pretraining.

How It Works:

Extract embeddings (vector representations) from pretrained model
Use embeddings directly for downstream tasks
Embeddings capture sequence properties relevant to many tasks

Example: Predicting Regulatory Function

Suppose you want to predict whether a sequence is a promoter, enhancer, or silencer, but you have no labeled training data.

With a pretrained foundation model:

Extract embeddings for sequences from known promoters, enhancers, silencers
These form clusters in embedding space (similar functions have similar embeddings)
For a new unknown sequence, extract embedding and see which cluster it’s nearest to
Predict the function based on nearest neighbors

This works because the model learned during pretraining that promoters have certain sequence characteristics (TATA boxes, CpG islands, proximity to transcription start sites), enhancers have others (TF motif clusters, H3K27ac association), etc. Even though it was never explicitly trained to classify regulatory elements, it learned features that distinguish them.

Limitations:

Performance is typically lower than fine-tuned models
Works best when zero-shot task is similar to patterns seen during pretraining
Can’t learn truly novel task-specific patterns

When to Use:

Exploring new hypotheses without labeled data
Quick preliminary analysis
Comparing sequences across contexts where labels don’t transfer
Generating candidate predictions to guide experimental validation

12.5.3 Few-Shot Learning: Learning from Minimal Examples

Few-shot learning extends zero-shot by allowing a handful of examples. The goal is to adapt to a new task with 5-100 examples instead of thousands.

Approach 1: Fine-Tuning with Regularization

Fine-tune on small labeled set
Use strong regularization to prevent overfitting
Lower learning rate
Early stopping

Approach 2: Metric Learning

Learn to measure sequence similarity in embedding space
Given a few positive examples, find similar sequences
Doesn’t require updating model parameters

Approach 3: Prompt-Based Learning

Provide examples as “prompts” to guide model behavior
Model sees examples at inference time, not during training
Inspired by how large language models like GPT work

Example: Discovering Binding Sites with Few Examples

Suppose you’re studying a transcription factor with only 10 known binding sites (from ChIP-seq peaks).

Traditional approach: Can’t train a deep learning model with 10 examples—would severely overfit.

Few-shot approach with foundation model:

Extract embeddings from pretrained model for your 10 known sites
Scan genome, extract embeddings for all candidate sites
Rank candidates by similarity to your 10 examples in embedding space
Top-ranked candidates are likely binding sites

Research has shown this can achieve ~70% recall at 5% false positive rate with just 10 examples—far better than motif-based methods.

12.5.4 Case Study: Zero-Shot Variant Effect Prediction

Let’s examine a practical case study from recent research.

Problem: Predict which genomic variants affect gene splicing. Standard approach requires collecting thousands of variants with experimentally validated splicing outcomes—expensive and time-consuming.

Zero-Shot Approach Using Pretrained Model:

Researchers used a genomic foundation model (similar to models we’ll discuss in Chapters 13-14) and asked: can embeddings predict splicing effects without any splicing-specific training?

Method:

Extract embeddings for reference sequence
Extract embeddings for variant sequence (reference with variant introduced)
Calculate embedding difference (L2 distance)
Hypothesis: larger embedding differences indicate functional impact

Results:

Zero-shot approach achieved 0.78 AUC for predicting splice-disrupting variants
Compare to SpliceAI (trained specifically on splice data): 0.95 AUC
Zero-shot is less accurate but requires NO labeled splice data
Useful for quick screening before experimental validation

Key Insight: The model learned during pretraining that certain sequence patterns near exon-intron boundaries are important. When variants disrupt these patterns, embeddings change substantially. The model never saw splice labels during training, but learned features relevant to splicing.

Practical Application:

Screen 1000 variants in 10 minutes (zero-shot)
Select top 50 candidates
Validate experimentally ($$)
Much more efficient than validating all 1000

12.6 The Architecture and Training of Foundation Models

12.6.1 Model Architecture Choices

Foundation models for genomics are typically based on transformer architectures (which we introduced in Chapter 10). Key design decisions:

Model Size:

Small models: 10-100 million parameters
Medium models: 100-500 million parameters
Large models: 1-10 billion parameters
Larger models generally learn better representations but require more compute

Context Length:

How many nucleotides the model can process at once
Critical for genomics: regulatory elements can be far from genes they regulate
Standard: 512-4,096 bp
Long-range models: up to 100,000 bp or more
Longer context enables learning long-range dependencies

Tokenization:

How to split sequences into tokens
Single nucleotide: each A/C/G/T is a token
k-mers: overlapping or non-overlapping k-mer tokens (e.g., “ATG”, “TGC”, “GCA”…)
Byte-pair encoding: learned tokenization
Choice affects what patterns model can learn

Positional Encoding:

How to encode nucleotide positions
Absolute: each position has unique encoding
Relative: encode distances between positions
Important because position matters in genomics (promoters are upstream of genes)

12.6.2 Pretraining Data and Compute Requirements

Training a genomic foundation model requires substantial resources:

Data Scale:

Human genome: 3 billion bp
Multi-species: 100+ billion bp (including mouse, fly, worm, yeast, bacteria)
Metagenomic databases: trillions of bp
More diverse data generally improves transfer learning

Computational Requirements:

Small model on human genome: ~100 GPU-hours (~$500 on cloud)
Large model on multi-species data: ~100,000 GPU-hours (~$500,000 on cloud)
Pretraining is done once, then model is shared

Training Time:

Small model: days
Large model: weeks to months
Requires specialized hardware (GPUs or TPUs)

Data Preprocessing:

Quality control: remove low-quality sequences
Repeat masking: handle repetitive elements
Data augmentation: reverse complement, random crops
Creating training batches: millions of sequence chunks

12.6.3 Evaluation of Foundation Models

How do we know if a foundation model is good? Unlike task-specific models, we can’t evaluate on a single task.

Pretraining Metrics:

Masked nucleotide prediction accuracy
Perplexity (how surprised model is by sequences)
These measure if model learned to predict sequences, but not if it learned useful features

Downstream Task Performance:

The real test: does fine-tuning work well?
Evaluate on multiple diverse tasks: enhancer prediction, splice site prediction, variant effect prediction, etc.
Good foundation model performs well across many tasks

Probing Tasks:

Design specific tasks to test if model learned particular features
Example: can embeddings predict evolutionary conservation?
Example: can embeddings cluster sequences by function?

Data Efficiency:

How many labeled examples needed for fine-tuning?
Better foundation models need fewer examples
Measure accuracy vs. training set size curves

12.6.4 The Scaling Laws of Foundation Models

Research on language models has revealed “scaling laws”: predictable relationships between model size, data size, and performance.

Scaling Laws for Genomics:

Similar patterns emerge in genomic foundation models:

Larger models → better transfer learning (but with diminishing returns)
- 100M parameters: baseline
- 1B parameters: significant improvement
- 10B parameters: marginal additional improvement
More diverse data → better generalization
- Training on human genome only: good for human-specific tasks
- Adding model organisms: better cross-species transfer
- Adding bacteria/archaea: learns more universal genomic principles
Longer training → better, but eventually plateaus
- Optimal training is not “until perfect”—model learns general patterns relatively quickly
- Overtraining on pretraining task can reduce transfer ability

Practical Implications:

For most research labs, using existing pretrained models is more practical than training from scratch
Model developers should focus on good data diversity and moderate model sizes
Community benefits from a few well-trained foundation models shared widely

12.7 Transfer Learning Across Species and Conditions

12.7.1 Cross-Species Transfer Learning

One of the most exciting applications of genomic foundation models is transferring knowledge across species.

The Challenge:

You want to predict gene expression in zebrafish
But most labeled data exists for human and mouse
Can a model trained on mammals help with zebrafish?

Transfer Learning Approach:

Pretrain on multi-species data (human, mouse, zebrafish, fly, etc.)
- Model learns sequence patterns conserved across evolution
- Learns which patterns are universal vs. species-specific
Fine-tune on target species with limited data
- Model adapts to species-specific patterns
- Leverages conserved patterns learned during pretraining

Results from Research: Studies show impressive cross-species transfer:

Model pretrained on vertebrates, fine-tuned on zebrafish with 500 examples: 75% accuracy
Model trained only on zebrafish with 500 examples: 55% accuracy
Cross-species pretraining provides +20% accuracy boost

Why It Works:

Many regulatory mechanisms are conserved (transcription factor binding, splice signals)
Model learns these conserved patterns during pretraining
Only needs to learn species-specific details during fine-tuning

12.7.2 Cross-Tissue and Cross-Condition Transfer

Similar benefits apply within a species across tissues or conditions:

Example: Tissue-Specific Enhancers

Pretrain on enhancer data from multiple tissues (liver, brain, heart, kidney)
Model learns general enhancer characteristics (TF motif clustering, distance from promoters)
Fine-tune on new tissue (pancreas) with limited data
Works better than training only on pancreas data

Example: Condition-Specific Expression

Pretrain on gene expression across many conditions (developmental stages, drug treatments, knockouts)
Model learns gene regulatory relationships
Fine-tune on specific condition (rare disorder) with few samples
Can predict expression changes from limited patient data

12.7.3 Domain Adaptation: Handling Distribution Shift

A challenge in transfer learning is domain shift: when test data differs systematically from training data.

Examples in Genomics:

Model trained on Europeans applied to African populations (different allele frequencies)
Model trained on reference genomes applied to cancer genomes (somatic mutations)
Model trained on bulk RNA-seq applied to single-cell RNA-seq (technical differences)

Approaches to Handle Domain Shift:

Domain-Adversarial Training:

Train model to make good predictions while being unable to distinguish source vs. target domain
Forces learning domain-invariant features

Unsupervised Domain Adaptation:

Use unlabeled data from target domain during fine-tuning
Helps model adapt to target domain statistics

Multi-Task Learning:

Train on multiple domains simultaneously
Learns features that generalize across domains

Calibration:

Adjust prediction thresholds based on target domain
Simple but often effective

12.8 Practical Considerations for Using Foundation Models

12.8.1 When to Use Foundation Models

Foundation models aren’t always the best choice. Use them when:

✅ Good Use Cases:

Limited labeled data for your task (< 10,000 examples)
New biological context (rare cell type, rare condition, new species)
Multiple related tasks (transfer learning amortizes pretraining cost)
Need quick results (fine-tuning is faster than training from scratch)
Exploratory analysis (zero-shot embeddings for initial screening)

❌ Less Suitable Cases:

Abundant task-specific data (>100,000 examples)—might not need transfer learning
Task very different from anything in pretraining data (transfer may not help)
Strict computational constraints (foundation models are large)
Need fully interpretable models (complex models harder to interpret)

12.8.2 Choosing a Pretrained Model

Many genomic foundation models are now publicly available. How to choose?

Consider:

Pretraining Data:
- What species/sequences was model trained on?
- Choose model pretrained on data similar to your target
Model Size:
- Larger is usually better but slower
- Balance performance vs. computational resources
Context Length:
- Can model process sequences as long as you need?
- Longer context needed for long-range regulatory predictions
Task Performance:
- Check published benchmarks on tasks similar to yours
- Look for papers evaluating multiple models
Availability and Documentation:
- Is model easy to download and use?
- Are there code examples and tutorials?

12.8.3 Fine-Tuning Best Practices

To get best results when fine-tuning:

Data Preparation:

Balance positive and negative examples
Use data augmentation (reverse complement for DNA)
Split data properly (train/validation/test)

Hyperparameter Selection:

Lower learning rate than pretraining (typically 10-100x smaller)
Smaller batch sizes often work better
Monitor validation performance to prevent overfitting

Regularization:

Use dropout (especially with small datasets)
Weight decay (L2 regularization)
Early stopping based on validation performance

Gradual Unfreezing:

Start by only training final layers
Gradually unfreeze earlier layers
Helps prevent catastrophic forgetting

Task-Specific Architecture:

Design appropriate output head for your task
Classification: softmax layer
Regression: linear layer
Sequence labeling: per-position predictions

12.8.4 Interpreting Foundation Model Predictions

Foundation models are complex and can be hard to interpret. Approaches to understand them:

Attention Visualization:

Examine which input positions model attends to
Reveals which nucleotides influence predictions

Gradient-Based Methods:

Compute gradient of prediction with respect to input
Shows which nucleotides are most important

In Silico Mutagenesis:

Systematically mutate sequence positions
See which mutations change predictions most
Reveals functional motifs

Embedding Analysis:

Project embeddings to 2D (t-SNE, UMAP)
Visualize how sequences cluster
Helps understand what model learned

Probing Classifiers:

Train simple classifiers on embeddings
Test if embeddings capture specific properties (GC content, conservation, etc.)

12.9 Current Limitations and Future Directions

12.9.1 Current Limitations

Despite their promise, genomic foundation models face challenges:

Limited Context Length:

Most models process 1,000-10,000 bp
Genomic regulation can span megabases
Long-range interactions (enhancer-gene pairs 1Mb apart) are difficult to capture

Computational Cost:

Large models require significant compute for inference
Limits use in resource-constrained settings
Processing whole genomes is slow

Data Bias:

Most training data from well-studied species (human, mouse)
Model organisms over-represented
Reference genomes may not represent population diversity

Interpretability:

Complex models are “black boxes”
Hard to understand why predictions are made
Difficult to extract biological insights

Generalization to Rare Events:

Pretraining on common sequences
May not capture rare but important events (e.g., rare variants, rare cell states)

12.9.2 Emerging Solutions

Research is addressing these limitations:

Longer Context:

New architectures (e.g., Mamba, discussed in Chapter 14) can process 100,000+ bp efficiently
Enables modeling long-range regulatory interactions

Efficient Models:

Distillation: creating smaller models that mimic large ones
Quantization: reducing numerical precision
Sparse models: only activating relevant parameters

Diverse Training Data:

Including more species, populations, and conditions
Metagenomic data for microbial diversity
Ancient DNA for evolutionary perspective

Interpretable Architectures:

Designing models with interpretable components
Attention mechanisms that highlight important regions
Modular architectures where each module has clear function

Active Learning:

Foundation models suggest which experiments to do next
Iterative improvement: experiment → update model → new predictions
Reduces total experimental cost

12.9.3 Future Directions

Exciting developments on the horizon:

Multi-Modal Foundation Models:

Integrate DNA sequence with other data types
Sequence + chromatin accessibility + 3D structure + expression
Holistic understanding of genome function

Foundation Models for Protein Sequences:

Already successful (e.g., ESM, ProteinBERT)
Predict structure, function, interactions from sequence
Chapter 15 will explore single-cell RNA-seq foundation models

Genome-Scale Simulations:

Foundation models as “genome simulators”
Predict phenotype from genotype
Test hypotheses in silico before experiments

Personalized Genomics:

Foundation models adapted to individual genomes
Predict disease risk, drug response for specific patients
Precision medicine applications

Automated Scientific Discovery:

Foundation models generate hypotheses
Design experiments to test hypotheses
Close the loop between computation and experimentation

Summary

Key Takeaways:

Foundation models are large-scale models pretrained on broad data that can be adapted to many downstream tasks, addressing the challenge of limited labeled data in genomics
Transfer learning enables applying knowledge from one task to another, dramatically reducing the labeled data requirement from tens of thousands to hundreds of examples
Self-supervised learning trains models on unlabeled genomic sequences using objectives like masked nucleotide prediction, learning genomic grammar without experimental data
Pretraining and fine-tuning is a two-stage paradigm: first learn general genomic patterns from massive unlabeled data, then adapt to specific tasks with small labeled datasets
Zero-shot and few-shot learning extend foundation models to scenarios with minimal or no labeled data, using learned embeddings for prediction without task-specific training
Cross-species and cross-condition transfer allows knowledge learned in one biological context (e.g., human liver) to improve predictions in another (e.g., mouse pancreas) by leveraging conserved genomic principles
Foundation models work because genomic sequences have shared structure—transcription factor motifs, regulatory grammar, evolutionary conservation—that transfers across contexts and tasks

📖 Key Terms

| Term | Definition | |------|-----------| | **Domain adaptation** | Adjusting a model to perform well on a new domain (e.g., different species or tissue) that differs from the training domain | | **Embedding** | Vector representation of a sequence produced by a neural network, encoding its properties in continuous space | | **Feature extraction** | Using a pretrained model's learned representations without updating its parameters, only training task-specific output layers | | **Few-shot learning** | Learning to perform a task with very limited labeled examples (typically 5-100), leveraging pretrained knowledge | | **Fine-tuning** | Adapting a pretrained model to a specific task by training on task-specific labeled data with small learning rate adjustments | | **Foundation model** | Large-scale model trained on broad data that can be adapted to many downstream tasks through transfer learning | | **Masked language modeling (MLM)** | Self-supervised learning objective where random tokens are masked and the model learns to predict them from context | | **One-shot learning** | Extreme case of few-shot learning where only a single example is available for a new task | | **Pretraining** | Initial training phase where a model learns general patterns from large-scale unlabeled data before being adapted to specific tasks | | **Self-supervised learning** | Learning from unlabeled data by creating training signals from the data itself (e.g., predicting masked parts) | | **Transfer learning** | Applying knowledge learned from one task or domain to improve performance on a different but related task | | **Zero-shot learning** | Performing a task without any task-specific training examples, relying only on pretrained knowledge and embeddings |

Conceptual Questions

Explain why a genomic foundation model pretrained on human sequences might help predict enhancers in mouse liver, even though it never saw mouse liver ChIP-seq data during pretraining. What genomic features transfer across this species and tissue boundary?
A researcher has RNA-seq data from only 20 patients with a rare neurological disorder. Should they train a deep learning model from scratch or use a pretrained foundation model? Explain your reasoning considering both accuracy and the biology of rare disorders.
Masked language modeling requires a model to predict hidden nucleotides from context. How does this self-supervised task force the model to learn about biologically relevant features like transcription factor binding motifs or splice sites, even though these features are never explicitly labeled during training?
Compare the computational costs of: (a) training a task-specific model from scratch for 10 different genomic prediction tasks, versus (b) pretraining one foundation model then fine-tuning it for the same 10 tasks. When would each approach be more efficient?
Zero-shot learning allows making predictions without any task-specific training data by using sequence embeddings. Describe a biological scenario where zero-shot predictions would be particularly valuable, and explain what limitations you would expect compared to fine-tuned predictions.
Explain why longer context length (the number of nucleotides a model can process at once) is particularly important for foundation models in genomics, compared to foundation models in natural language processing.
A foundation model trained only on reference human genome sequences might perform poorly when applied to cancer genomes with many somatic mutations. This is called “domain shift.” Suggest two approaches to address this problem while still leveraging the pretrained model.
Some researchers argue that foundation models are “black boxes” that provide predictions without biological insight, while others claim they can reveal biological principles. Evaluate both perspectives using specific examples of what we can and cannot learn from foundation model embeddings.

Discussion Questions

Ethical Considerations in Population Genomics: Most genomic foundation models are pretrained predominantly on European ancestry genomes. Discuss how this could lead to performance disparities when models are applied to individuals of other ancestries. What steps should researchers take to address this bias, and what are the broader equity implications?
The Cost-Benefit of Foundation Models: Pretraining a large genomic foundation model might cost $500,000 in computational resources, while task-specific models for individual labs cost ~$1,000 each. From a scientific community perspective, when does it make sense to invest in foundation models versus encouraging labs to train task-specific models? Consider issues of access, reproducibility, and innovation.
Interpretability vs. Performance: Foundation models often provide better predictions than simpler, more interpretable models (like position weight matrices for motifs). In what biological contexts is it acceptable to use “black box” predictions, and when is mechanistic interpretability essential? Give specific examples.
Data Privacy and Sensitive Information: Genomic data is highly sensitive and can identify individuals. If foundation models are trained on genomic databases, could they inadvertently memorize and reveal information about individuals in the training data? What safeguards should be implemented?
Generalization to Truly Novel Biology: Foundation models excel at transferring knowledge to scenarios similar to their training data. However, what about truly novel biology—undiscovered regulatory mechanisms, synthetic genomes, or organisms from extreme environments? Discuss the limitations of transfer learning for genuinely new biological phenomena and how these might be addressed.

What’s Next?

You’ve now learned the conceptual foundations of genomic foundation models—what they are, why they work, and how they’re changing genomics research. Foundation models address the fundamental challenge of data scarcity by learning general genomic principles from massive unlabeled data, then transferring that knowledge to specific tasks with minimal labeled examples.

In the next three chapters, we’ll examine specific genomic foundation models in detail:

Chapter 13: DNA Language Models will cover the first generation of genomic foundation models based on BERT-style architectures:

DNABERT: Applying BERT’s masked language modeling to k-mer sequences
DNABERT-2: Improved tokenization and longer context
LOGO: Combining sequence and structure information
Nucleotide Transformer: Scaling to billions of parameters
GROVER: Multi-task pretraining on diverse genomic tasks

Chapter 14: Next-Generation DNA Models will explore cutting-edge architectures that address current limitations:

HyenaDNA: Processing sequences up to 1 million bp
Mamba: Efficient long-context modeling with state-space models
Caduceus: Bidirectional Mamba for genomics
Hybrid architectures combining multiple approaches

Chapter 15: Introduction to Single-Cell Omics will shift from DNA sequences to gene expression, preparing for:

Chapter 16: Single-Cell Foundation Models covering models like Geneformer and scGPT

Before proceeding, make sure you’re comfortable with:

The concept of transfer learning and why pretrained knowledge helps with limited data
How self-supervised learning (masked language modeling) works for DNA sequences
The difference between pretraining and fine-tuning
What zero-shot and few-shot learning mean
When to use foundation models versus task-specific models
How to evaluate whether a foundation model is effective

If any of these concepts are unclear, review the relevant sections in this chapter. The specific models in Chapters 13-14 all build on these foundational concepts, so understanding them now will make the next chapters much easier to follow.

Ready to dive into specific DNA language models? Let’s explore how BERT-style transformers have been adapted for genomic sequences in Chapter 13!

Chapter 12 Complete
Next: Chapter 13 - DNA Language Models