ai-for-genomic-science

Chapter 14: Next-Generation DNA Models

Interactive: Chapter 14

Start with a single nucleotide — an A at one position in the genome. Now zoom out to 100 bases: you see a transcription factor binding motif. Zoom to 1,000: an exon boundary with splice signals. Zoom to 10,000: a complete gene with its promoter. Zoom to 100,000: an entire regulatory domain with enhancers, insulators, and their target genes interacting across vast distances. Zoom to 1,000,000: a topologically associating domain where the 3D folding of chromatin determines which enhancers can reach which promoters. At every scale, new biology emerges. The genome is not a flat string of letters — it is a deeply hierarchical document, and understanding it requires reading at multiple resolutions simultaneously.

But until recently, most AI models could only see a few hundred to a few thousand bases at a time — like trying to understand a city by looking through a keyhole. The transformer models from the previous chapter are powerful, but they carry a fundamental computational burden: their attention mechanism has quadratic complexity. Double the sequence length and the computation quadruples. Increase it tenfold and the computation increases a hundredfold. At 1 million base pairs, the attention matrix alone would require a terabyte of GPU memory. The biology demands scale. The architecture forbids it.

This collision between biological necessity and computational constraint has driven a new wave of architectural innovation. Researchers have asked: what if we could capture long-range dependencies without computing all pairwise attention? What if the architecture scaled linearly with sequence length instead of quadratically? The answer has come from an unlikely direction — state space models and convolution operators originally developed for signal processing, now repurposed to read DNA at scales transformers cannot reach.

This chapter is about models that finally open the door: HyenaDNA, Mamba, and Caduceus. They represent not just engineering improvements but a genuine expansion in what questions genomics AI can ask.


The Biological Challenge

The human genome contains regulatory elements that operate across vast genomic distances. Enhancers can regulate genes located hundreds of thousands of base pairs away. Topologically associating domains (TADs) span megabase-scale regions. Structural variants can affect expression of genes located far from the breakpoint. Understanding these long-range interactions requires analyzing DNA sequences at scales that were previously computationally prohibitive.

Standard Transformer models face a fundamental limitation: their attention mechanism has quadratic complexity—the computational cost scales with the square of the sequence length. This means:

This quadratic scaling creates practical barriers:

Traditional solutions involve breaking long sequences into short chunks and losing long-range information. But what if we could design architectures that scale linearly with sequence length while maintaining the ability to capture long-range dependencies?

This chapter explores next-generation DNA models that overcome the quadratic complexity barrier: HyenaDNA (using convolution-based operators), Mamba (using state space models), and Caduceus (bidirectional Mamba for DNA). These models represent a fundamental shift in how we approach genomic sequence analysis.


Learning Objectives

After completing this chapter, you will be able to:


14.1 The Quadratic Complexity Problem

Understanding Attention Complexity

Let’s make the computational challenge concrete. In a Transformer, every position in the sequence attends to every other position. For a sequence of length L:

Let’s calculate what this means for real genomic sequences:

Example: BRCA1 gene region (100,000 bp)

Example: Chromosome 21 (46 million bp)

Why Can’t We Just Use Short Sequences?

You might ask: why not analyze DNA in short chunks? This approach has several limitations:

  1. Lost long-range interactions: Enhancers regulating distant genes are missed
  2. Structural variant context: Large deletions or duplications span boundaries
  3. Haplotype information: Phasing requires analyzing linked variants across distances
  4. TAD structure: Topologically associating domains require chromosome-scale context
  5. Splicing patterns: Alternative splicing depends on distant regulatory signals

Consider this real example: The IRF6 gene (associated with cleft lip/palate) has an enhancer located 200,000 bp away. Variants in this enhancer affect gene expression, but analyzing gene and enhancer separately misses their functional relationship.

Previous Attempts at Solutions

Before next-generation models, researchers tried several approaches:

Sparse Attention (e.g., Longformer, BigBird)

Sliding Windows

Hierarchical Models

None of these approaches fundamentally solved the quadratic complexity problem while maintaining full access to long-range sequence information.


14.2 HyenaDNA: Convolution-Based Long-Range Modeling

The Core Innovation

Biological Analogy (HyenaDNA long context): Instead of reading in short 512 bp windows, the entire genomic locus (100 kb) can be read at once — distant enhancer-promoter relationships are not missed.

HyenaDNA replaces the attention mechanism with a convolutional operator that can be computed efficiently using the Fast Fourier Transform (FFT). The key insight: convolutions can capture long-range dependencies through their filter design, and FFT makes them computationally efficient.

The model uses what the authors call the Hyena operator, which combines:

  1. Data-dependent gating (like attention’s value mechanism)
  2. Long convolution filters (capturing long-range patterns)
  3. FFT-based computation (achieving linear complexity)

Complexity comparison:

For L = 1,000,000, this means:

Architecture Details

The Hyena operator works in three steps:

Step 1: Projection Input sequence X → Three parallel projections (similar to Q, K, V in attention):

Step 2: Long Convolution Instead of attention scores, use learned convolutional filters:

Step 3: Gating Combine projections with element-wise operations:

Training HyenaDNA

HyenaDNA was pretrained on the human reference genome using:

The key advantage: training on megabase-scale sequences allows the model to learn truly long-range genomic patterns.

What HyenaDNA Learned

Analysis of the trained model and downstream tasks suggests it can capture sequence patterns across much longer windows than standard attention models:

Long-range sequence context:

Regulatory element interactions:

Repetitive element patterns:


[Optional: The Math]

Math Box: FFT and Convolution

Why FFT Makes Convolution Fast

A convolution between sequence X (length L) and filter H (length K) requires:

For long filters (K ≈ L), the FFT approach is dramatically faster.

The Convolution Theorem: Convolution in time domain = Multiplication in frequency domain

Conv(X, H) = IFFT(FFT(X) ⊙ FFT(H))

Where:

Total complexity: O(L log L) instead of O(L²)

For genomic sequences:

Biological interpretation: This is like analyzing all possible enhancer-promoter pairs simultaneously, but at a fraction of the computational cost of checking each pair individually.


14.3 State Space Models: Mamba and S4

Introduction to State Space Models

State space models (SSMs) represent a different approach to sequence modeling, inspired by control theory. Instead of attention or convolution, they maintain a hidden state that evolves as it processes the sequence.

Key concept: At each position, the model:

  1. Updates its hidden state based on previous state and current input
  2. Produces an output based on the hidden state
  3. The state can “remember” information from distant positions

Think of it like a cell integrating signals over time:

The S4 Model

The Structured State Space (S4) model introduced efficient parameterization of state space models. The core equations:

h(t+1) = A × h(t) + B × x(t)    # State update
y(t) = C × h(t)                  # Output

Where:

Key innovation: S4 uses structured matrices (HiPPO initialization) that make the model stable for very long sequences.

Complexity: O(L) for sequence length L

Mamba: Selective State Space Models

Biological Analogy (Mamba / state space models): Like simultaneously holding short-term and long-term memory while reading a genome — processed efficiently without the computational cost of attention.

Mamba improves upon S4 by making the state space parameters data-dependent. Instead of fixed A, B, C matrices, Mamba computes them based on the input:

B(t) = Linear_B(x(t))    # Input-dependent
C(t) = Linear_C(x(t))    # Input-dependent  
Δ(t) = Softplus(Linear_Δ(x(t)))  # Input-dependent discretization

Why this matters for DNA:

Think of it like a researcher scanning a chromosome:

Mamba’s Computational Advantages

Mamba achieves linear complexity while maintaining long-range capability:

  1. Memory efficiency: No attention matrix needed
  2. Fast inference: Single sequential pass through sequence
  3. Long context: Can process 1M+ bp sequences
  4. Parallel training: Uses efficient parallel scan algorithms

Compared to Transformers:


14.4 Caduceus: Bidirectional Mamba for DNA

The Directionality Problem

Both strands of DNA are biologically meaningful:

Transcription factor binding sites can appear on either strand. Genes can be encoded on either strand. Regulatory elements work regardless of orientation.

Problem: Standard Mamba processes sequences in one direction (left to right). This creates an asymmetry that doesn’t reflect biological reality.

Caduceus Architecture

Caduceus solves this by using bidirectional Mamba layers:

Approach 1: RC-Augmentation

Approach 2: BiMamba Layers

The BiMamba block:

# Forward pass
h_fwd = Mamba(x, direction='forward')

# Reverse pass  
h_rev = Mamba(x, direction='reverse')

# Mix
output = h_fwd + h_rev  # or learnable combination

Caduceus Training

Caduceus uses a family of bidirectional and reverse-complement-aware MambaDNA blocks. Reported checkpoints vary in size and context length, so the safest description is architectural rather than a single fixed recipe:

Performance Comparison

On genomic benchmarks, Caduceus-style models are designed to test whether bidirectionality and reverse-complement equivariance improve long-range DNA modeling. Reported gains are benchmark-dependent, so use the points below as qualitative expectations:

Regulatory element prediction:

Variant effect prediction:

Downstream fine-tuning:


14.5 Comparing Next-Generation Models

HyenaDNA vs Mamba vs Caduceus

Let’s compare the three architectures:

Feature HyenaDNA Mamba Caduceus
Core mechanism Long convolution State space model Bidirectional SSM
Complexity O(L log L) O(L) O(L)
Max context 1M bp 1M+ bp 1M bp
Directionality Bidirectional (conv) Unidirectional Explicitly bidirectional
Training speed Fast Very fast Very fast
Inference speed Very fast Very fast Very fast
Memory usage Low Very low Very low

When to Use Each Model

HyenaDNA:

Mamba:

Caduceus:

Still use Transformers when:


14.6 Long-Context Applications

Application 1: Structural Variant Analysis

Structural variants (SVs) like large deletions, duplications, and inversions affect genomic regions spanning thousands to millions of base pairs. Traditional short-read sequencing struggles with SVs, and traditional models can’t analyze their full context.

Case example: Analyzing a 500 kb deletion

Using Caduceus with 1M bp context:

  1. Encode sequence with deletion and flanking regions
  2. Model captures:
    • Disrupted TAD boundaries
    • Lost enhancer-promoter contacts
    • Affected gene expression patterns
    • Compensatory regulatory changes

This enables:

Application 2: Enhancer-Promoter Prediction

Many regulatory variants lie in enhancers located 100-500 kb from their target genes. Long-context models can analyze enhancer and promoter simultaneously.

Example workflow:

  1. Input: 200 kb region containing enhancer and gene
  2. Model identifies:
    • Enhancer boundaries
    • Promoter location
    • CTCF insulator positions
    • Chromatin looping probability
  3. Variant effect prediction:
    • Test variant in enhancer
    • Predict change in gene expression
    • Account for 3D genomic context

Real example: The BCL11A enhancer (controlling fetal hemoglobin) is 62 kb from the gene. Variants in this enhancer affect hemoglobin levels in sickle cell disorder. Long-context models can directly model this regulatory relationship.

Application 3: Haplotype Analysis

Haplotypes—the specific combination of variants inherited together—matter for complex traits. Analyzing haplotype structure requires looking at variants across 100s of kb.

Using HyenaDNA for haplotype analysis:

  1. Input: Personal genome sequence spanning 500 kb
  2. Model identifies:
    • Haplotype blocks (regions of linkage)
    • Recombination hotspots
    • Compound heterozygous combinations
  3. Applications:
    • Pharmacogenomics (drug response haplotypes)
    • Risk prediction (combined variant effects)
    • Evolutionary history

Application 4: Locus-Wide Association

Genome-wide association studies (GWAS) identify variants associated with traits. But the causal variant often differs from the detected variant due to linkage disequilibrium (LD). Long-context models can analyze entire associated loci.

Approach:

  1. Input: 1 Mb region around GWAS signal
  2. Model evaluates:
    • Every variant’s functional impact
    • Regulatory context
    • Gene targets
    • Epistatic interactions
  3. Output: Prioritized causal variant list

Example: The FTO locus associated with obesity spans 500 kb. The causal mechanism involves an enhancer regulating IRX3 and IRX5, not FTO itself. Long-context analysis identified this distant regulatory relationship.


Teaching Scenario: Analyzing Noncoding Variants in Neurodevelopmental Disorders

Background: Dr. Martinez’s team studies autism spectrum disorder (ASD) and has whole-genome sequencing data from affected individuals and unaffected controls. Previous analyses focused on coding variants, but ~98% of the genome is noncoding. Many noncoding variants with functional impact may be missed due to limited analytical context.

Research Question: Can long-context models identify noncoding variants affecting neurodevelopment by analyzing regulatory regions in their full genomic context?

Approach:

  1. Data preparation:
    • Extracted all rare noncoding variants (MAF < 0.1%)
    • Identified 125,000 candidate regulatory variants
    • For each variant, extracted 500 kb genomic context
  2. Model application (using Caduceus):
    • Analyzed each variant in 500 kb context window
    • Predicted impact on:
      • Nearby gene expression
      • Chromatin accessibility
      • Transcription factor binding
      • Long-range regulatory interactions
  3. Prioritization:
    • Ranked variants by predicted functional impact
    • Considered gene targets expressed in brain
    • Evaluated enrichment in ASD cases vs. controls

Illustrative results:

Example validation plan: Selected top 10 variants for CRISPR-based validation:

Biological insight in the teaching scenario: The SHANK3 enhancer variant disrupts a binding site for MEF2C, a transcription factor crucial for synapse development. The variant reduces SHANK3 expression by 40% in neurons. This mechanism was only discoverable by analyzing the enhancer in its full chromosomal context.

Impact:

Reference note: This case study is a hypothetical synthesis based on principles from:


Teaching Scenario: Predicting Chromatin Structure at Megabase Scale

Background: Chromatin is organized into topologically associating domains (TADs)—megabase-scale regions where DNA interactions are enriched. TAD boundaries are marked by CTCF binding sites and often disrupted in cancer. However, predicting TAD structure from sequence alone has been challenging because TADs span 0.5-2 Mb.

Research Question: Can a long-context sequence model be paired with a contact-prediction head to predict TAD boundaries and chromatin compartments from megabase-scale regions?

Approach:

  1. Training data:
    • Hi-C contact maps from 30 cell types (ENCODE)
    • Shows which genomic regions physically interact
    • Resolution: 10 kb bins across genome
  2. Model training:
    • Input: 1 Mb DNA sequence
    • Output: Predicted contact probability for all pairs
    • Architecture: long-context sequence encoder with contact prediction head
    • Loss: Mean squared error on contact frequencies
  3. Validation:
    • Tested on held-out cell types
    • Compared to experimental Hi-C
    • Measured TAD boundary prediction accuracy

Illustrative results:

Variant Effect Prediction: Applied model to analyze structural variants:

Biological Insights:

  1. TAD boundaries enriched for:
    • Convergent CTCF motifs
    • Housekeeping genes
    • GC-rich sequences
  2. Cell-type-specific TADs driven by:
    • Tissue-specific enhancers
    • Variable CTCF binding strength
    • Cohesin recruitment sites

Clinical Relevance:

Limitations:

Reference note: This scenario is inspired by approaches from:


14.7 Limitations and Future Directions

Current Limitations

Computational Challenges:

Biological Limitations:

Interpretability:

Validation Challenges:

Future Research Directions

1. Hybrid Architectures Combining multiple mechanisms:

2. Multi-Modal Models Integrating sequence with other data:

3. Larger Context Windows Pushing toward chromosome-scale:

4. Improved Pretraining Better training objectives:

5. Clinical Translation Making models clinically useful:

6. Cross-Species Models Learning from comparative genomics:


Summary

Key Takeaways


📖 Key Terms | Term | Definition | |------|-----------| | **Bidirectional processing** | Analyzing DNA sequences in both forward and reverse directions simultaneously to capture biology of both strands. | | **Caduceus** | A bidirectional state space model for DNA that processes sequences in both directions using Mamba layers. | | **Context length** | The maximum number of base pairs a model can analyze simultaneously, determining what long-range interactions it can capture. | | **Fast Fourier Transform (FFT)** | An efficient algorithm for computing convolutions in O(L log L) time instead of O(L²). | | **Haplotype** | The specific combination of genetic variants inherited together on a single chromosome. | | **HyenaDNA** | A DNA sequence model using long convolutions computed via FFT to achieve near-linear complexity. | | **Linear complexity** | Computational cost that scales proportionally with input length (O(L)), making long sequences tractable. | | **Long convolution** | A convolutional filter with length up to millions of positions, capable of capturing long-range dependencies. | | **Mamba** | A state space model with selective parameters that processes sequences in O(L) time while maintaining long-range memory. | | **Quadratic complexity** | Computational cost that scales with the square of input length (O(L²)), limiting Transformers to short sequences. | | **State space model (SSM)** | A sequence modeling approach that maintains a hidden state evolving over positions, inspired by control theory. | | **Structural variant (SV)** | Large genomic alterations including deletions, duplications, inversions, and translocations spanning 1kb to megabases. | | **Topologically associating domain (TAD)** | A genomic region spanning 0.5-2 Mb where DNA interactions are enriched, bounded by CTCF sites. |

Conceptual Questions

  1. Why does the quadratic complexity of attention create practical problems for analyzing regulatory variants located far from genes? Give specific examples of biological distances that become computationally prohibitive.

  2. Explain how HyenaDNA’s use of convolution with FFT achieves better computational complexity than attention. What is the tradeoff between these two approaches?

  3. Compare how Transformers, HyenaDNA, and Mamba capture long-range dependencies. Which biological scenarios favor each architecture?

  4. Why is bidirectional processing important for DNA sequence analysis? What biological features require seeing both strands?

  5. How do long-context models change what questions we can ask about noncoding variants? What new types of analyses become possible?

  6. A researcher wants to analyze a 300 kb deletion that spans three genes. Which model architecture would you recommend and why?

  7. Explain why state space models can process sequences in linear time. What is the key difference from attention that enables this?

  8. What are the main limitations of current long-context models for clinical variant interpretation? How might these be addressed in future work?


Discussion Questions

  1. Ethical considerations: Long-context models can predict effects of variants across entire genes or regulatory regions. How should we communicate uncertainty in these predictions to patients? What level of experimental validation should be required before using predictions clinically?

  2. Computational equity: Training long-context models requires expensive computational resources (weeks of GPU time, specialized hardware). How does this affect which research groups can develop these models? What strategies could make long-context analysis more accessible?

  3. Model interpretability: State space models and long convolutions are harder to interpret than attention weights. For clinical use, how important is interpretability versus accuracy? Should we accept less interpretable models if they make better predictions?

  4. Scaling limits: Current models analyze up to 1M bp. The human genome is 3.2 billion bp. What biological questions require even longer context (e.g., chromosome-scale or genome-scale)? Are there fundamental limits to how much context is useful?

  5. Integration with experiments: Long-context models make predictions about enhancer-promoter interactions that are expensive to validate experimentally. How should we prioritize which predictions to validate? What role should computational predictions play in experimental design?


Further Reading

Foundational Papers

  1. Poli et al. (2023) “Hyena Hierarchy: Towards Larger Convolutional Language Models” ICML
    • Original HyenaDNA paper
    • https://arxiv.org/abs/2302.10866
  2. Gu & Dao (2023) “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” arXiv
    • Introduces Mamba architecture
    • https://arxiv.org/abs/2312.00752
  3. Schiff et al. (2024) “Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling” arXiv
    • Bidirectional state space models for DNA
    • https://arxiv.org/abs/2403.03234

Reviews and Perspectives

  1. Gu et al. (2023) “On the Parameterization and Initialization of Diagonal State Space Models” NeurIPS
    • Theory behind efficient state space models
    • S4 architecture and variants
  2. Nguyen et al. (2023) “HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution” NeurIPS
    • Applications to genomics
    • Benchmark results

Online Resources

  1. Mamba Documentation: https://github.com/state-spaces/mamba
    • Official implementation and examples
  2. HyenaDNA GitHub: https://github.com/HazyResearch/hyena-dna
    • Code, pretrained models, tutorials
  3. Long Context Models Survey: https://github.com/Strivin0311/long-llms-learning
    • Comprehensive list of long-context architectures

What’s Next?

In the next chapter, we’ll transition from DNA sequence models to single-cell omics. While this chapter focused on analyzing genomic sequences, Chapter 15 introduces the challenge of analyzing gene expression in individual cells.

Preview of Chapter 15: Introduction to Single-Cell Omics

Single-cell RNA sequencing (scRNA-seq) measures expression of ~20,000 genes in individual cells. A typical experiment generates data from 10,000-1,000,000 cells. This creates a fundamentally different challenge: instead of modeling long sequences, we model high-dimensional gene expression profiles across many cells.

You’ll learn:

Prerequisites for Chapter 15:

Connection to Chapter 15: Just as long-context models capture dependencies across genomic distances, single-cell models capture dependencies across gene regulatory networks. Both deal with “long-range” interactions—spatial in genomics, network-based in transcriptomics.

Ready to explore how cells differ at the molecular level? → [Continue to Chapter 15: Introduction to Single-Cell Omics]