ai-for-genomic-science

Chapter 14: Next-Generation DNA Models

Start with a single nucleotide — an A at one position in the genome. Now zoom out to 100 bases: you see a transcription factor binding motif. Zoom to 1,000: an exon boundary with splice signals. Zoom to 10,000: a complete gene with its promoter. Zoom to 100,000: an entire regulatory domain with enhancers, insulators, and their target genes interacting across vast distances. Zoom to 1,000,000: a topologically associating domain where the 3D folding of chromatin determines which enhancers can reach which promoters. At every scale, new biology emerges. The genome is not a flat string of letters — it is a deeply hierarchical document, and understanding it requires reading at multiple resolutions simultaneously.

But until recently, most AI models could only see a few hundred to a few thousand bases at a time — like trying to understand a city by looking through a keyhole. The transformer models from the previous chapter are powerful, but they carry a fundamental computational burden: their attention mechanism has quadratic complexity. Double the sequence length and the computation quadruples. Increase it tenfold and the computation increases a hundredfold. At 1 million base pairs, the attention matrix alone would require a terabyte of GPU memory. The biology demands scale. The architecture forbids it.

This collision between biological necessity and computational constraint has driven a new wave of architectural innovation. Researchers have asked: what if we could capture long-range dependencies without computing all pairwise attention? What if the architecture scaled linearly with sequence length instead of quadratically? The answer has come from an unlikely direction — state space models and convolution operators originally developed for signal processing, now repurposed to read DNA at scales transformers cannot reach.

This chapter is about models that finally open the door: HyenaDNA, Mamba, and Caduceus. They represent not just engineering improvements but a genuine expansion in what questions genomics AI can ask.

The Biological Challenge

The human genome contains regulatory elements that operate across vast genomic distances. Enhancers can regulate genes located hundreds of thousands of base pairs away. Topologically associating domains (TADs) span megabase-scale regions. Structural variants can affect expression of genes located far from the breakpoint. Understanding these long-range interactions requires analyzing DNA sequences at scales that were previously computationally prohibitive.

Standard Transformer models face a fundamental limitation: their attention mechanism has quadratic complexity—the computational cost scales with the square of the sequence length. This means:

10,000 bp sequence: 100 million attention computations
100,000 bp sequence: 10 billion attention computations (100× increase)
1,000,000 bp sequence: 1 trillion attention computations (10,000× increase)

This quadratic scaling creates practical barriers:

Memory requirements: A 1 million bp sequence requires 1TB of GPU memory just for attention
Training time: Models become too slow to train on long sequences
Inference cost: Analyzing genome-wide variants becomes economically infeasible
Limited context: Models can only “see” a few thousand bases at once

Traditional solutions involve breaking long sequences into short chunks and losing long-range information. But what if we could design architectures that scale linearly with sequence length while maintaining the ability to capture long-range dependencies?

This chapter explores next-generation DNA models that overcome the quadratic complexity barrier: HyenaDNA (using convolution-based operators), Mamba (using state space models), and Caduceus (bidirectional Mamba for DNA). These models represent a fundamental shift in how we approach genomic sequence analysis.

Learning Objectives

After completing this chapter, you will be able to:

Explain why Transformer models have quadratic complexity and how this limits genomic analysis
Describe the key architectural innovations that enable linear-time sequence modeling
Compare the tradeoffs between attention-based and state space-based models for DNA
Understand how convolutional operators can approximate attention at lower computational cost
Apply long-context DNA models to analyze regulatory variants and structural variation
Evaluate when to use Transformer models versus next-generation architectures
Interpret results from models that analyze sequences at megabase scale

14.1 The Quadratic Complexity Problem

Understanding Attention Complexity

Let’s make the computational challenge concrete. In a Transformer, every position in the sequence attends to every other position. For a sequence of length L:

Number of attention pairs: L × L = L²
Each attention computation: Involves vector operations (dot products, softmax)
Total complexity: O(L²) for attention mechanism

Let’s calculate what this means for real genomic sequences:

Example: BRCA1 gene region (100,000 bp)

Attention pairs: 100,000² = 10,000,000,000 (10 billion)
Memory for attention matrix: ~400 GB (using float32)
This exceeds typical GPU memory (16-80 GB)

Example: Chromosome 21 (46 million bp)

Attention pairs: 46,000,000² ≈ 2 × 10¹⁵ (2 quadrillion)
Memory required: ~8 petabytes
Clearly impossible on current hardware

Why Can’t We Just Use Short Sequences?

You might ask: why not analyze DNA in short chunks? This approach has several limitations:

Lost long-range interactions: Enhancers regulating distant genes are missed
Structural variant context: Large deletions or duplications span boundaries
Haplotype information: Phasing requires analyzing linked variants across distances
TAD structure: Topologically associating domains require chromosome-scale context
Splicing patterns: Alternative splicing depends on distant regulatory signals

Consider this real example: The IRF6 gene (associated with cleft lip/palate) has an enhancer located 200,000 bp away. Variants in this enhancer affect gene expression, but analyzing gene and enhancer separately misses their functional relationship.

Previous Attempts at Solutions

Before next-generation models, researchers tried several approaches:

Sparse Attention (e.g., Longformer, BigBird)

Only attend to nearby positions and a few global positions
Problem: May miss important long-range interactions
Still has significant memory requirements

Sliding Windows

Analyze DNA in overlapping chunks
Problem: Arbitrary boundaries, no true long-range integration
Post-hoc merging of predictions is challenging

Hierarchical Models

First encode local regions, then attend over region summaries
Problem: Local encoding may lose important details
Still limited in maximum context

None of these approaches fundamentally solved the quadratic complexity problem while maintaining full access to long-range sequence information.

14.2 HyenaDNA: Convolution-Based Long-Range Modeling

The Core Innovation

Biological Analogy (HyenaDNA long context): Instead of reading in short 512 bp windows, the entire genomic locus (100 kb) can be read at once — distant enhancer-promoter relationships are not missed.

HyenaDNA replaces the attention mechanism with a convolutional operator that can be computed efficiently using the Fast Fourier Transform (FFT). The key insight: convolutions can capture long-range dependencies through their filter design, and FFT makes them computationally efficient.

The model uses what the authors call the Hyena operator, which combines:

Data-dependent gating (like attention’s value mechanism)
Long convolution filters (capturing long-range patterns)
FFT-based computation (achieving linear complexity)

Complexity comparison:

Standard Attention: O(L²)
HyenaDNA: O(L log L)

For L = 1,000,000, this means:

Attention: ~1 trillion operations
HyenaDNA: ~20 million operations (50,000× fewer!)

Architecture Details

The Hyena operator works in three steps:

Step 1: Projection Input sequence X → Three parallel projections (similar to Q, K, V in attention):

Query projection: X_q
Key projection: X_k
Value projection: X_v

Step 2: Long Convolution Instead of attention scores, use learned convolutional filters:

Filters can have length up to 1 million positions
Can in principle capture sequence patterns separated by hundreds of kilobases
Computed efficiently using FFT

Step 3: Gating Combine projections with element-wise operations:

Output = X_v ⊙ Conv(X_k) ⊙ X_q
⊙ denotes element-wise multiplication
This provides data-dependent mixing like attention

Training HyenaDNA

HyenaDNA was pretrained on the human reference genome using:

Context length: up to 1 million tokens/base positions in the long-context setting
Training objective: nucleotide-level language modeling rather than BERT-style 6-mer masked-token prediction
Tokenization: single-nucleotide resolution, preserving SNP-level changes
Model sizes: multiple checkpoints, from small models to larger long-context variants

The key advantage: training on megabase-scale sequences allows the model to learn truly long-range genomic patterns.

What HyenaDNA Learned

Analysis of the trained model and downstream tasks suggests it can capture sequence patterns across much longer windows than standard attention models:

Long-range sequence context:

regulatory motifs separated by long distances
repetitive-element structure across large regions
sequence composition and motif organization at scales beyond typical transformer windows

Regulatory element interactions:

candidate enhancer-promoter pairs separated by long genomic distances
CTCF binding site relationships
Locus control regions and gene clusters

Repetitive element patterns:

Long interspersed nuclear elements (LINEs)
Short interspersed nuclear elements (SINEs)
Tandem repeat organization

[Optional: The Math]

Math Box: FFT and Convolution

Why FFT Makes Convolution Fast

A convolution between sequence X (length L) and filter H (length K) requires:

Direct computation: O(L × K) operations

FFT-based computation: O(L log L + K log K)

For long filters (K ≈ L), the FFT approach is dramatically faster.

The Convolution Theorem: Convolution in time domain = Multiplication in frequency domain
Conv(X, H) = IFFT(FFT(X) ⊙ FFT(H))
Where:

FFT: Fast Fourier Transform (O(L log L))

⊙: Element-wise multiplication (O(L))

IFFT: Inverse FFT (O(L log L))

Total complexity: O(L log L) instead of O(L²)

For genomic sequences:

L = 1,000,000 bp

Direct convolution: 1 trillion operations

FFT convolution: 20 million operations

Biological interpretation: This is like analyzing all possible enhancer-promoter pairs simultaneously, but at a fraction of the computational cost of checking each pair individually.

14.3 State Space Models: Mamba and S4

Introduction to State Space Models

State space models (SSMs) represent a different approach to sequence modeling, inspired by control theory. Instead of attention or convolution, they maintain a hidden state that evolves as it processes the sequence.

Key concept: At each position, the model:

Updates its hidden state based on previous state and current input
Produces an output based on the hidden state
The state can “remember” information from distant positions

Think of it like a cell integrating signals over time:

The cell’s current state depends on all previous signals
But it doesn’t need to store every past signal explicitly
It maintains a compressed representation of history

The S4 Model

The Structured State Space (S4) model introduced efficient parameterization of state space models. The core equations:

h(t+1) = A × h(t) + B × x(t)    # State update
y(t) = C × h(t)                  # Output

Where:

h(t): Hidden state at position t
x(t): Input at position t
y(t): Output at position t
A, B, C: Learned parameter matrices

Key innovation: S4 uses structured matrices (HiPPO initialization) that make the model stable for very long sequences.

Complexity: O(L) for sequence length L

Mamba: Selective State Space Models

Biological Analogy (Mamba / state space models): Like simultaneously holding short-term and long-term memory while reading a genome — processed efficiently without the computational cost of attention.

Mamba improves upon S4 by making the state space parameters data-dependent. Instead of fixed A, B, C matrices, Mamba computes them based on the input:

B(t) = Linear_B(x(t))    # Input-dependent
C(t) = Linear_C(x(t))    # Input-dependent  
Δ(t) = Softplus(Linear_Δ(x(t)))  # Input-dependent discretization

Why this matters for DNA:

Model can selectively remember important features (like TFBS)
Can ignore repetitive or uninformative regions
Adapts its processing to sequence content

Think of it like a researcher scanning a chromosome:

Slow down and pay attention at regulatory regions
Quickly skip through repetitive elements
Adjust memory capacity based on information density

Mamba’s Computational Advantages

Mamba achieves linear complexity while maintaining long-range capability:

Memory efficiency: No attention matrix needed
Fast inference: Single sequential pass through sequence
Long context: Can process 1M+ bp sequences
Parallel training: Uses efficient parallel scan algorithms

Compared to Transformers:

Speed: 5× faster inference
Memory: 10× less GPU memory
Context: 100× longer sequences

14.4 Caduceus: Bidirectional Mamba for DNA

The Directionality Problem

Both strands of DNA are biologically meaningful:

Forward strand: 5’ to 3’
Reverse strand: 3’ to 5’ (reverse complement)

Transcription factor binding sites can appear on either strand. Genes can be encoded on either strand. Regulatory elements work regardless of orientation.

Problem: Standard Mamba processes sequences in one direction (left to right). This creates an asymmetry that doesn’t reflect biological reality.

Caduceus Architecture

Caduceus solves this by using bidirectional Mamba layers:

Approach 1: RC-Augmentation

Process sequence in forward direction
Also process reverse complement
Average or concatenate the two representations

Approach 2: BiMamba Layers

Simultaneously process in both directions
Mix information from both directions at each layer
More parameter-efficient than separate models

The BiMamba block:

# Forward pass
h_fwd = Mamba(x, direction='forward')

# Reverse pass  
h_rev = Mamba(x, direction='reverse')

# Mix
output = h_fwd + h_rev  # or learnable combination

Caduceus Training

Caduceus uses a family of bidirectional and reverse-complement-aware MambaDNA blocks. Reported checkpoints vary in size and context length, so the safest description is architectural rather than a single fixed recipe:

Architecture: bidirectional Mamba-style blocks with reverse-complement equivariance in the DNA-specific variants
Context length: designed for long-range DNA sequence modeling
Objective: sequence language modeling and downstream fine-tuning
Key design goal: treat forward and reverse-complement DNA consistently

Performance Comparison

On genomic benchmarks, Caduceus-style models are designed to test whether bidirectionality and reverse-complement equivariance improve long-range DNA modeling. Reported gains are benchmark-dependent, so use the points below as qualitative expectations:

Regulatory element prediction:

Can match or exceed transformer baselines on some tasks
Can use longer contexts with lower memory pressure than full attention
Is especially relevant when strand symmetry matters

Variant effect prediction:

Can be applied to promoter and regulatory variants
Can test whether distal sequence context improves variant-effect prediction
Still requires task-specific validation

Downstream fine-tuning:

Potentially cheaper than full-attention transformers for long windows
Useful for comparing architecture choices on the same genomic benchmark
Requires careful train/test splits to avoid sequence leakage

14.5 Comparing Next-Generation Models

HyenaDNA vs Mamba vs Caduceus

Let’s compare the three architectures:

Feature	HyenaDNA	Mamba	Caduceus
Core mechanism	Long convolution	State space model	Bidirectional SSM
Complexity	O(L log L)	O(L)	O(L)
Max context	1M bp	1M+ bp	1M bp
Directionality	Bidirectional (conv)	Unidirectional	Explicitly bidirectional
Training speed	Fast	Very fast	Very fast
Inference speed	Very fast	Very fast	Very fast
Memory usage	Low	Very low	Very low

When to Use Each Model

HyenaDNA:

Best for: Analyzing repeated patterns across long distances
Advantages: FFT efficiency, interpretable filters
Use cases: Repetitive element analysis, chromatin domains

Mamba:

Best for: Fast inference on very long sequences
Advantages: Truly linear complexity, lowest memory
Use cases: Genome-wide scanning, real-time analysis

Caduceus:

Best for: Tasks requiring strand symmetry
Advantages: Explicit bidirectionality, strong performance
Use cases: TFBS prediction, splicing analysis, general genomics

Still use Transformers when:

Context length < 10,000 bp (quadratic complexity manageable)
Need interpretable attention weights
Task benefits from explicit pairwise comparisons
Have existing Transformer infrastructure

14.6 Long-Context Applications

Application 1: Structural Variant Analysis

Structural variants (SVs) like large deletions, duplications, and inversions affect genomic regions spanning thousands to millions of base pairs. Traditional short-read sequencing struggles with SVs, and traditional models can’t analyze their full context.

Case example: Analyzing a 500 kb deletion

Using Caduceus with 1M bp context:

Encode sequence with deletion and flanking regions
Model captures:
- Disrupted TAD boundaries
- Lost enhancer-promoter contacts
- Affected gene expression patterns
- Compensatory regulatory changes

This enables:

Predicting phenotypic impact of SVs
Identifying pathogenic deletions
Understanding dosage sensitivity
Prioritizing SVs for experimental validation

Application 2: Enhancer-Promoter Prediction

Many regulatory variants lie in enhancers located 100-500 kb from their target genes. Long-context models can analyze enhancer and promoter simultaneously.

Example workflow:

Input: 200 kb region containing enhancer and gene
Model identifies:
- Enhancer boundaries
- Promoter location
- CTCF insulator positions
- Chromatin looping probability
Variant effect prediction:
- Test variant in enhancer
- Predict change in gene expression
- Account for 3D genomic context

Real example: The BCL11A enhancer (controlling fetal hemoglobin) is 62 kb from the gene. Variants in this enhancer affect hemoglobin levels in sickle cell disorder. Long-context models can directly model this regulatory relationship.

Application 3: Haplotype Analysis

Haplotypes—the specific combination of variants inherited together—matter for complex traits. Analyzing haplotype structure requires looking at variants across 100s of kb.

Using HyenaDNA for haplotype analysis:

Input: Personal genome sequence spanning 500 kb
Model identifies:
- Haplotype blocks (regions of linkage)
- Recombination hotspots
- Compound heterozygous combinations
Applications:
- Pharmacogenomics (drug response haplotypes)
- Risk prediction (combined variant effects)
- Evolutionary history

Application 4: Locus-Wide Association

Genome-wide association studies (GWAS) identify variants associated with traits. But the causal variant often differs from the detected variant due to linkage disequilibrium (LD). Long-context models can analyze entire associated loci.

Approach:

Input: 1 Mb region around GWAS signal
Model evaluates:
- Every variant’s functional impact
- Regulatory context
- Gene targets
- Epistatic interactions
Output: Prioritized causal variant list

Example: The FTO locus associated with obesity spans 500 kb. The causal mechanism involves an enhancer regulating IRX3 and IRX5, not FTO itself. Long-context analysis identified this distant regulatory relationship.

Teaching Scenario: Analyzing Noncoding Variants in Neurodevelopmental Disorders

Background: Dr. Martinez’s team studies autism spectrum disorder (ASD) and has whole-genome sequencing data from affected individuals and unaffected controls. Previous analyses focused on coding variants, but ~98% of the genome is noncoding. Many noncoding variants with functional impact may be missed due to limited analytical context.

Research Question: Can long-context models identify noncoding variants affecting neurodevelopment by analyzing regulatory regions in their full genomic context?

Approach:

Data preparation:
- Extracted all rare noncoding variants (MAF < 0.1%)
- Identified 125,000 candidate regulatory variants
- For each variant, extracted 500 kb genomic context
Model application (using Caduceus):
- Analyzed each variant in 500 kb context window
- Predicted impact on:
  - Nearby gene expression
  - Chromatin accessibility
  - Transcription factor binding
  - Long-range regulatory interactions
Prioritization:
- Ranked variants by predicted functional impact
- Considered gene targets expressed in brain
- Evaluated enrichment in ASD cases vs. controls

Illustrative results:

Identified 47 high-confidence noncoding variants enriched in ASD cases
23 variants located in enhancers >100 kb from target genes
Would have been missed by standard 5-10 kb context windows
Top hit: Enhancer 380 kb from SHANK3 gene

Example validation plan: Selected top 10 variants for CRISPR-based validation:

Created deletions of enhancer regions in neural progenitor cells
Measured target gene expression changes
7/10 showed significant expression changes (70% validation rate)
Much higher than previous noncoding validation rates (20-30%)

Biological insight in the teaching scenario: The SHANK3 enhancer variant disrupts a binding site for MEF2C, a transcription factor crucial for synapse development. The variant reduces SHANK3 expression by 40% in neurons. This mechanism was only discoverable by analyzing the enhancer in its full chromosomal context.

Impact:

Long-context models enable noncoding variant interpretation
Identifying distant regulatory relationships requires >100 kb context
Provides new therapeutic targets (MEF2C pathway)
Demonstrates importance of analyzing full genomic context

Reference note: This case study is a hypothetical synthesis based on principles from:

Morrow et al. (2021) “Noncoding variants in neurodevelopmental disorders” (hypothetical synthesis)
Real SHANK3 biology from Leblond et al. (2014) Nature Genetics

Teaching Scenario: Predicting Chromatin Structure at Megabase Scale

Background: Chromatin is organized into topologically associating domains (TADs)—megabase-scale regions where DNA interactions are enriched. TAD boundaries are marked by CTCF binding sites and often disrupted in cancer. However, predicting TAD structure from sequence alone has been challenging because TADs span 0.5-2 Mb.

Research Question: Can a long-context sequence model be paired with a contact-prediction head to predict TAD boundaries and chromatin compartments from megabase-scale regions?

Approach:

Training data:
- Hi-C contact maps from 30 cell types (ENCODE)
- Shows which genomic regions physically interact
- Resolution: 10 kb bins across genome
Model training:
- Input: 1 Mb DNA sequence
- Output: Predicted contact probability for all pairs
- Architecture: long-context sequence encoder with contact prediction head
- Loss: Mean squared error on contact frequencies
Validation:
- Tested on held-out cell types
- Compared to experimental Hi-C
- Measured TAD boundary prediction accuracy

Illustrative results:

Achieved 0.82 correlation with experimental Hi-C
Correctly identified 76% of TAD boundaries
Predicted tissue-specific chromatin organization
Identified CTCF sites critical for TAD formation

Variant Effect Prediction: Applied model to analyze structural variants:

500 kb deletion removes TAD boundary → fusion of adjacent TADs
Predicts abnormal enhancer-promoter contacts
Validated in cancer genomes with known rearrangements

Biological Insights:

TAD boundaries enriched for:
- Convergent CTCF motifs
- Housekeeping genes
- GC-rich sequences
Cell-type-specific TADs driven by:
- Tissue-specific enhancers
- Variable CTCF binding strength
- Cohesin recruitment sites

Clinical Relevance:

Predicted impact of structural variants on 3D genome
Identified variants disrupting TAD boundaries in developmental disorders
Enables interpretation of noncoding SVs

Limitations:

Requires training on expensive Hi-C data
Cell-type specificity not fully captured
Some long-range interactions still missed

Reference note: This scenario is inspired by approaches from:

Schwessinger et al. (2020) Nat Commun - “DeepC predicts chromatin interactions”
Zhou et al. (2022) Nat Methods - “Deep learning predicts DNA structure”

14.7 Limitations and Future Directions

Current Limitations

Computational Challenges:

Training still requires significant GPU resources
1M bp sequences need 40-80 GB GPU memory
Pretraining on full human genome takes weeks
Limited availability of trained models

Biological Limitations:

Cell-type specificity requires additional inputs
3D genome structure not fully encoded in sequence
Epigenetic modifications not captured
Environmental context missing

Interpretability:

Harder to interpret than attention weights
Convolution filters less intuitive than attention
State space dynamics difficult to visualize
Black-box predictions for clinical use

Validation Challenges:

Limited experimental data for 100+ kb contexts
Difficult to validate predicted long-range interactions
Unclear how to benchmark megabase-scale predictions
Few datasets with sufficient genomic context

Future Research Directions

1. Hybrid Architectures Combining multiple mechanisms:

Local attention + global state space
Convolution for patterns + attention for specific interactions
Best of both worlds

2. Multi-Modal Models Integrating sequence with other data:

DNA sequence + chromatin accessibility
Sequence + evolutionary conservation
Sequence + 3D genome structure
Sequence + gene expression

3. Larger Context Windows Pushing toward chromosome-scale:

Current: 1M bp (0.03% of genome)
Goal: 50M bp (entire chromosome)
Would enable full-locus analysis
Requires further efficiency improvements

4. Improved Pretraining Better training objectives:

Multi-task learning (expression + chromatin + conservation)
Contrastive learning across species
Self-supervised from single-cell data
Incorporating structure prediction

5. Clinical Translation Making models clinically useful:

Calibrated uncertainty estimates
Interpretable predictions
Integration with electronic health records
Real-time variant interpretation

6. Cross-Species Models Learning from comparative genomics:

Pretrain on 100+ vertebrate genomes
Learn conserved regulatory logic
Identify species-specific elements
Evolutionary constraint prediction

Summary

Key Takeaways

Transformer models face quadratic complexity (O(L²)) that limits them to analyzing sequences of a few thousand base pairs, missing important long-range genomic interactions.
HyenaDNA uses long convolutions computed via FFT to achieve O(L log L) complexity, enabling analysis of sequences up to 1 million bp while preserving nucleotide-level resolution across long windows.
Mamba introduces state space models with O(L) complexity that maintain a hidden state as they process sequences, allowing even faster processing of megabase-scale DNA.
Caduceus extends Mamba with bidirectional processing to handle both DNA strands symmetrically, important for tasks like transcription factor binding site prediction.
Long-context models enable new biological applications including structural variant analysis, enhancer-promoter hypothesis generation, haplotype analysis, and contact-map prediction when paired with appropriate functional or 3D genome data.
Long-context models enable new validation strategies for noncoding variants, but reported validation rates depend strongly on the disease, assay, model, and candidate-selection procedure.
Each architecture has specific advantages: HyenaDNA for patterns, Mamba for speed, Caduceus for strand-aware tasks, and Transformers still valuable for shorter sequences.
Future directions include hybrid architectures, multi-modal integration, chromosome-scale analysis, and improved clinical translation of long-context models.

📖 Key Terms

| Term | Definition | |------|-----------| | **Bidirectional processing** | Analyzing DNA sequences in both forward and reverse directions simultaneously to capture biology of both strands. | | **Caduceus** | A bidirectional state space model for DNA that processes sequences in both directions using Mamba layers. | | **Context length** | The maximum number of base pairs a model can analyze simultaneously, determining what long-range interactions it can capture. | | **Fast Fourier Transform (FFT)** | An efficient algorithm for computing convolutions in O(L log L) time instead of O(L²). | | **Haplotype** | The specific combination of genetic variants inherited together on a single chromosome. | | **HyenaDNA** | A DNA sequence model using long convolutions computed via FFT to achieve near-linear complexity. | | **Linear complexity** | Computational cost that scales proportionally with input length (O(L)), making long sequences tractable. | | **Long convolution** | A convolutional filter with length up to millions of positions, capable of capturing long-range dependencies. | | **Mamba** | A state space model with selective parameters that processes sequences in O(L) time while maintaining long-range memory. | | **Quadratic complexity** | Computational cost that scales with the square of input length (O(L²)), limiting Transformers to short sequences. | | **State space model (SSM)** | A sequence modeling approach that maintains a hidden state evolving over positions, inspired by control theory. | | **Structural variant (SV)** | Large genomic alterations including deletions, duplications, inversions, and translocations spanning 1kb to megabases. | | **Topologically associating domain (TAD)** | A genomic region spanning 0.5-2 Mb where DNA interactions are enriched, bounded by CTCF sites. |

Conceptual Questions

Why does the quadratic complexity of attention create practical problems for analyzing regulatory variants located far from genes? Give specific examples of biological distances that become computationally prohibitive.
Explain how HyenaDNA’s use of convolution with FFT achieves better computational complexity than attention. What is the tradeoff between these two approaches?
Compare how Transformers, HyenaDNA, and Mamba capture long-range dependencies. Which biological scenarios favor each architecture?
Why is bidirectional processing important for DNA sequence analysis? What biological features require seeing both strands?
How do long-context models change what questions we can ask about noncoding variants? What new types of analyses become possible?
A researcher wants to analyze a 300 kb deletion that spans three genes. Which model architecture would you recommend and why?
Explain why state space models can process sequences in linear time. What is the key difference from attention that enables this?
What are the main limitations of current long-context models for clinical variant interpretation? How might these be addressed in future work?

Discussion Questions

Ethical considerations: Long-context models can predict effects of variants across entire genes or regulatory regions. How should we communicate uncertainty in these predictions to patients? What level of experimental validation should be required before using predictions clinically?
Computational equity: Training long-context models requires expensive computational resources (weeks of GPU time, specialized hardware). How does this affect which research groups can develop these models? What strategies could make long-context analysis more accessible?
Model interpretability: State space models and long convolutions are harder to interpret than attention weights. For clinical use, how important is interpretability versus accuracy? Should we accept less interpretable models if they make better predictions?
Scaling limits: Current models analyze up to 1M bp. The human genome is 3.2 billion bp. What biological questions require even longer context (e.g., chromosome-scale or genome-scale)? Are there fundamental limits to how much context is useful?
Integration with experiments: Long-context models make predictions about enhancer-promoter interactions that are expensive to validate experimentally. How should we prioritize which predictions to validate? What role should computational predictions play in experimental design?

What’s Next?

In the next chapter, we’ll transition from DNA sequence models to single-cell omics. While this chapter focused on analyzing genomic sequences, Chapter 15 introduces the challenge of analyzing gene expression in individual cells.

Preview of Chapter 15: Introduction to Single-Cell Omics

Single-cell RNA sequencing (scRNA-seq) measures expression of ~20,000 genes in individual cells. A typical experiment generates data from 10,000-1,000,000 cells. This creates a fundamentally different challenge: instead of modeling long sequences, we model high-dimensional gene expression profiles across many cells.

You’ll learn:

Why bulk RNA-seq averages away important biology
How scRNA-seq overcomes this limitation
Computational challenges of sparse, high-dimensional data
Dimensionality reduction and cell type identification
How AI models learn gene expression patterns

Prerequisites for Chapter 15:

Understand neural network basics (Chapter 2)
Familiar with representation learning concepts
Basic knowledge of gene expression
Comfortable with high-dimensional data concepts

Connection to Chapter 15: Just as long-context models capture dependencies across genomic distances, single-cell models capture dependencies across gene regulatory networks. Both deal with “long-range” interactions—spatial in genomics, network-based in transcriptomics.

Ready to explore how cells differ at the molecular level? → [Continue to Chapter 15: Introduction to Single-Cell Omics]

ai-for-genomic-science

Chapter 14: Next-Generation DNA Models

The Biological Challenge

Learning Objectives

14.1 The Quadratic Complexity Problem

Understanding Attention Complexity

Why Can’t We Just Use Short Sequences?

Previous Attempts at Solutions

14.2 HyenaDNA: Convolution-Based Long-Range Modeling

The Core Innovation

Architecture Details

Training HyenaDNA

What HyenaDNA Learned

Math Box: FFT and Convolution

14.3 State Space Models: Mamba and S4

Introduction to State Space Models

The S4 Model

Mamba: Selective State Space Models

Mamba’s Computational Advantages

14.4 Caduceus: Bidirectional Mamba for DNA

The Directionality Problem

Caduceus Architecture

Caduceus Training

Performance Comparison

14.5 Comparing Next-Generation Models

HyenaDNA vs Mamba vs Caduceus

When to Use Each Model

14.6 Long-Context Applications

Application 1: Structural Variant Analysis

Application 2: Enhancer-Promoter Prediction

Application 3: Haplotype Analysis

Application 4: Locus-Wide Association

Teaching Scenario: Analyzing Noncoding Variants in Neurodevelopmental Disorders

Teaching Scenario: Predicting Chromatin Structure at Megabase Scale

14.7 Limitations and Future Directions

Current Limitations

Future Research Directions

Summary

Key Takeaways

Conceptual Questions

Discussion Questions

Further Reading

Foundational Papers

Reviews and Perspectives

Online Resources

What’s Next?